Maximizing likelihood is equivalent to minimizing KL-Divergence

When reading Kevin Murphy’s book, I came across this statement: “… maxmizing likelihood is equivalent to minimizing $$D_{KL}[P(. \vert \theta^{\ast}) \, \Vert \, P(. \vert \theta)]$$, where $$P(. \vert \theta^{\ast})$$ is the true distribution and $$P(. \vert \theta)$$ is our estimate …“. So here is an attempt to prove that.

If it looks familiar, the left term is the entropy of $$P(x \vert \theta^*)$$. However it does not depend on the estimated parameter $$\theta$$, so we will ignore that.

Suppose we sample $$N$$ of these $$x \sim P(x \vert \theta^*)$$. Then, the Law of Large Number says that as $$N$$ goes to infinity:

which is the right term of the above KL-Divergence. Notice that:

where NLL is the negative log-likelihood and $$c$$ is a constant.

Then, if we minimize $$D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)]$$, it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.

Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.