# Maximizing likelihood is equivalent to minimizing KL-Divergence

When reading Kevin Murphy’s book, I came across this statement: *“… maxmizing likelihood is equivalent to minimizing* \( D_{KL}[P(. \vert \theta^{\ast}) \, \Vert \, P(. \vert \theta)] \)*, where \( P(. \vert \theta^{\ast}) \) is the true distribution and \( P(. \vert \theta) \) is our estimate …“*. So here is an attempt to prove that.

If it looks familiar, the left term is the entropy of \( P(x \vert \theta^*) \). However it does not depend on the estimated parameter \( \theta \), so we will ignore that.

Suppose we sample \( N \) of these \( x \sim P(x \vert \theta^*) \). Then, the Law of Large Number says that as \( N \) goes to infinity:

which is the right term of the above KL-Divergence. Notice that:

where NLL is the negative log-likelihood and \( c \) is a constant.

Then, if we minimize \( D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)] \), it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.

Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.