Basic Information Theory
Josh Moller-Mara
What is entropy?
\[H(p) = \sum p(x) \underbrace{\log\left( \frac{1}{p(x)} \right)}_{\class{fragment}{\textrm{"Surprisal"}}}\]
We can think about surprisal as message length
So the entropy of a distribution is just the expected message length
Mutual Information
Mutual Information can be thought of as
\[I(X;Y) = H(X) - H(X|Y)\]
or
\[I(X;Y) = \mathbb{E}_Y\left[D_{\mathrm{KL}}(p(x|y)\|p(x))\right]\]
Uses of Information Theory
- Used for compression, like Huffman coding, etc.
- Can be used to measure information gain: $D_{KL}\left(p(x | split) || p(x)\right)$
- Mutual information is used in decision trees/feature selection. Decision trees try to reduce entropy