Basic Information Theory

Josh Moller-Mara

How do you efficiently encode messages?

  • How would we encode English with bits?
  • Or, how could we encode the most information with the fewest bits?
(Source: Wikimedia Commons)
  • We should use shorter codes for more frequent letters
    and longer codes for less frequent letters
  • The most common characters are:
    1. e
    2. t
    3. a
  • Z, X, and Q are rare
Frequency of English characters
Morse code takes this into account
Here's another view, similar to a Huffman code
Notice that our code length for each letter is around $\log\left(\frac{1}{p(x)}\right)$
(Source: Wikimedia Commons)

What is entropy?

\[H(p) = \sum p(x) \underbrace{\log\left( \frac{1}{p(x)} \right)}_{\class{fragment}{\textrm{"Surprisal"}}}\]
We can think about surprisal as message length
So the entropy of a distribution is just the expected message length

Entropy of a beta distribution

$\alpha$: $\beta$:

What is cross entropy?

\[H(p, q) = \sum p(x) \log\left( \frac{1}{q(x)} \right)\]
Total Cross entropy:

What is Kullback-Leibler divergence?

\[D_{KL}(p || q) = H(p, q) - H(p)\]
Kullback-Leibler divergence:

Mutual Information

Mutual Information can be thought of as \[I(X;Y) = H(X) - H(X|Y)\] or \[I(X;Y) = \mathbb{E}_Y\left[D_{\mathrm{KL}}(p(x|y)\|p(x))\right]\]

Uses of Information Theory

  • Used for compression, like Huffman coding, etc.
  • Can be used to measure information gain: $D_{KL}\left(p(x | split) || p(x)\right)$
  • Mutual information is used in decision trees/feature selection. Decision trees try to reduce entropy