Deriving normal equation

Normal equation is a well known, exact solution of least squares problem (or linear regression in ML jargon). Let \( w \) be the learnable weights, \( X \) be the data, \( y \) be the target, it has the form of:

In this post, we will derive that equation using calculus.

Recall, the objective of least squares is to minimize the squared distance between the prediciton and the ground truth. So, we want to minimize the mean squared error: \( \frac{1}{2} \Vert y - Xw \Vert^2 \). The standard rule of minimization in calculus is to derive then set it to zero. But first let’s expand that equation first.

Now we could take the derivate w.r.t. \( w \), and set it to zero:

Above, \( y^Ty \) vanished as there’s no \( w \) dependence, and \( w^TX^TXw \) becomes \( 2X^TXw \) as \( w^Tw \) is analogous to \( w^2 \).

The final step, we just need to rearrange:

And we are done.