$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\I}{\mathcal{I}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} \newcommand{\inv}1 \newcommand{\map}{\text{MAP}} \newcommand{\L}{\mathcal{L}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

Maximizing likelihood is equivalent to minimizing KL-Divergence

When reading Kevin Murphy’s book, I came across this statement: “… maxmizing likelihood is equivalent to minimizing \( D_{KL}[P(. \vert \theta^{\ast}) \, \Vert \, P(. \vert \theta)] \), where \( P(. \vert \theta^{\ast}) \) is the true distribution and \( P(. \vert \theta) \) is our estimate …“. So here is an attempt to prove that.

\[\begin{align} D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)] &= \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log \frac{P(x \vert \theta^*)}{P(x \vert \theta)} \right] \\[10pt] &= \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log \, P(x \vert \theta^*) - \log \, P(x \vert \theta) \right] \\[10pt] &= \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log \, P(x \vert \theta^*) \right] - \mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log \, P(x \vert \theta) \right] \\[10pt] \end{align}\]

If it looks familiar, the left term is the entropy of \( P(x \vert \theta^*) \). However it does not depend on the estimated parameter \( \theta \), so we will ignore that.

Suppose we sample \( N \) of these \( x \sim P(x \vert \theta^*) \). Then, the Law of Large Number says that as \( N \) goes to infinity:

\[-\frac{1}{N} \sum_i^N \log \, P(x_i \vert \theta) = -\mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log \, P(x \vert \theta) \right]\]

which is the right term of the above KL-Divergence. Notice that:

\[\begin{align} -\frac{1}{N} \sum_i^N \log \, P(x_i \vert \theta) &= \frac{1}{N} \, \text{NLL} \\[10pt] &= c \, \text{NLL} \\[10pt] \end{align}\]

where NLL is the negative log-likelihood and \( c \) is a constant.

Then, if we minimize \( D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)] \), it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.

Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.