Entropy


the entropy of a random variable $X$ with distribution $p$, denoted by $H(X)$ or sometimes $H(p)$, is a measure of it’s uncertainty. In particular, for a discrete variable with K states, it is defined by

Tip

$$ H(X) \triangleq - \sum_{i=1}^K p(X=k) log(p(X=k)) $$

in essence entropy is calculating the expected surprise we get from a RV intuitively the value of a surprise is inverse probability of an event, or mathematicaly: $$ surprise(p(X=k)) = log(\frac{1}{p(X=k)})$$

notice that when $P(X=k) = 1$ the surprise value is 0 hence the “Expected surprise” of a RV X

Tip

$$ H(X) \triangleq \sum_{i=1}^K p(X=k)log(\frac{1}{p(X=k)})$$


$$ H(X) = \sum_{i=1}^K p(X=k)log(\frac{1}{p(X=k)}) = \sum_{i=1}^K p(X=k)[log(1) - log(p(X=k))] $$

$$ H(X) = \sum_{i=1}^K p(X=k)[0 - log(p(X=k))] $$ $$ H(X) = - \sum_{i=1}^K p(X=k) log(p(X=k)) $$

Info

title: in the case of binary variable (bernoulli RV as a function of $\theta$)

$$ H(X) = - [p(X=1) log(p(X=1)) + p(X=0) log(p(X=0))]$$ . $$ H(X) = - [\theta log(\theta) + (1- \theta) log(1 - \theta)]$$

Info

The discrete distribution with maximum entropy is the uniform distribution

as result we can use the entropy to quantify the difference in the p(X=k) of each k in a RV

KL divergence (relative entropy)

one way to measure the dissimilarity of two probability distribuions. $p$ and $q$, is known as Kullback-Leibler divergence (KL divergence) or relative entropy

$$KL (p||q) \triangleq \sum_{k=1}^K p_k \frac{log(p_k)}{log(q_k)}$$

$$KL(p||q) = \sum_{k=1}^K p_k log(p_k) - \sum_{k=1}^K p_k log(q_k)$$ $$KL(p||q) = -H(p) + H(p, q)$$ where $H(p, q)$ is called **cross entropy**

Tip

$$H(p, q) \triangleq \sum_{k=1}^K p_k log(\frac{1}{q_k}) = - \sum_{k=1}^K p_k log(q_k)$$


References

  1. Machine Learning Probabilistic perspective, Kevin Murphy (page 46, chapter 2.8 on Information theory)
  2. https://www.youtube.com/watch?v=YtebGVx-Fxw