Overview

Normal Equation is a method in parallel with gradient descent algorithm to minimize the cost function J.

Differ from the gradient descent, normal equation does not need iterate calculation, but directly calculate the stagnation points of the cost function. Since this is an analytical solution, it also has some limits. Below is a comparison between these two methods:

Gradient DescentNormal Equation
Need to choose learning rate $\alpha$Does not need
Multiple iterationOne time calculation
Works well when n is largeSlow if n is very large
Suits for all kinds of modelsOnly suitable for linear regression

Method

Say the dataset has m examples, each example contains n features:

Put all features into a “m x (n+1)” matrix, (one more column for $x_0$, always =1 ) $$ X=\begin{bmatrix}1&x^{(1)}_1&x^{(1)}2&\cdots&x^{(1)}n\\vdots&\vdots&\vdots&\ddots&\vdots\1&x_1^{(m)}&x_2^{(m)}&\cdots&x_n^{(m)}\end{bmatrix} $$ where the cost function J is: $$ J(\theta)=\frac{1}{2m}\sum{i=1}^m(h\theta(x^i)-y_i)^2 $$

$$ = \frac{1}{2m}\sum_{i=1}^m(\theta^Tx_i-y_i)^2 \href{https://duanwenbo.github.io/2021/04/01/ML-Learning-Note—polynomial-regression-and-feature-scaling/#more}{\ previous\ derive} $$

$$ = \frac{1}{2m}(X\theta-Y)^T(X\theta-Y)\href{https://blog.csdn.net/melon__/article/details/80589759}{\ further\ derive} $$

To find the converge solution, let: $$ \nabla_{\theta}J(\theta)=0 $$ Hence, the solution is: $$ \theta=(X^TX)^{-1}X^TY $$

Non-invertible ?

Since we used the transform matrix of X, what if the matrix is non-invertible? Below are two conditions in normal equation that may cause the non-invertibility.

  • When it appears two linear correlation features, like meter and inch.

    Solution: Reduce one.

  • When the number of features is greater than the number of examples

    Solution: increase the datasets or decrease the considered features.