$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} $$

Maximizing likelihood is equivalent to minimizing KL-Divergence

When reading Kevin Murphy’s book, I came across this statement: “… maxmizing likelihood is equivalent to minimizing \( D_{KL}[P(. \vert \theta^{\ast}) \, \Vert \, P(. \vert \theta)] \), where \( P(. \vert \theta^{\ast}) \) is the true distribution and \( P(. \vert \theta) \) is our estimate …“. So here is an attempt to prove that.

If it looks familiar, the left term is the entropy of \( P(x \vert \theta^*) \). However it does not depend on the estimated parameter \( \theta \), so we will ignore that.

Suppose we sample \( N \) of these \( x \sim P(x \vert \theta^*) \). Then, the Law of Large Number says that as \( N \) goes to infinity:

which is the right term of the above KL-Divergence. Notice that:

where NLL is the negative log-likelihood and \( c \) is a constant.

Then, if we minimize \( D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)] \), it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.

Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.