the entropy of a random variable $X$ with distribution $p$, denoted by $H(X)$ or sometimes $H(p)$, is a measure of it’s uncertainty. In particular, for a discrete variable with K states, it is defined by
Tip
$$ H(X) \triangleq - \sum_{i=1}^K p(X=k) log(p(X=k)) $$
in essence entropy is calculating the expected surprise we get from a RV intuitively the value of a surprise is inverse probability of an event, or mathematicaly: $$ surprise(p(X=k)) = log(\frac{1}{p(X=k)})$$
notice that when $P(X=k) = 1$ the surprise value is 0 hence the “Expected surprise” of a RV X
Tip
$$ H(X) \triangleq \sum_{i=1}^K p(X=k)log(\frac{1}{p(X=k)})$$
$$ H(X) = \sum_{i=1}^K p(X=k)log(\frac{1}{p(X=k)}) = \sum_{i=1}^K p(X=k)[log(1) - log(p(X=k))] $$
$$ H(X) = \sum_{i=1}^K p(X=k)[0 - log(p(X=k))] $$ $$ H(X) = - \sum_{i=1}^K p(X=k) log(p(X=k)) $$
Info
title: in the case of binary variable (bernoulli RV as a function of $\theta$)
$$ H(X) = - [p(X=1) log(p(X=1)) + p(X=0) log(p(X=0))]$$ . $$ H(X) = - [\theta log(\theta) + (1- \theta) log(1 - \theta)]$$
Info
The discrete distribution with maximum entropy is the uniform distribution
as result we can use the entropy to quantify the difference in the p(X=k) of each k in a RV
KL divergence (relative entropy)
one way to measure the dissimilarity of two probability distribuions. $p$ and $q$, is known as Kullback-Leibler divergence (KL divergence) or relative entropy
$$KL (p||q) \triangleq \sum_{k=1}^K p_k \frac{log(p_k)}{log(q_k)}$$
$$KL(p||q) = \sum_{k=1}^K p_k log(p_k) - \sum_{k=1}^K p_k log(q_k)$$ $$KL(p||q) = -H(p) + H(p, q)$$ where $H(p, q)$ is called **cross entropy**
Tip
$$H(p, q) \triangleq \sum_{k=1}^K p_k log(\frac{1}{q_k}) = - \sum_{k=1}^K p_k log(q_k)$$
References
- Machine Learning Probabilistic perspective, Kevin Murphy (page 46, chapter 2.8 on Information theory)
- https://www.youtube.com/watch?v=YtebGVx-Fxw