Note | Normal Equation | 🤗 Wenbo'Log

Overview

Normal Equation is a method in parallel with gradient descent algorithm to minimize the cost function J.

Differ from the gradient descent, normal equation does not need iterate calculation, but directly calculate the stagnation points of the cost function. Since this is an analytical solution, it also has some limits. Below is a comparison between these two methods:

Gradient Descent	Normal Equation
Need to choose learning rate $\alpha$	Does not need
Multiple iteration	One time calculation
Works well when n is large	Slow if n is very large
Suits for all kinds of models	Only suitable for linear regression

Method

Say the dataset has m examples, each example contains n features:

Put all features into a “m x (n+1)” matrix, (one more column for $x_0$, always =1 ) $$ X=\begin{bmatrix}1&x^{(1)}_1&x^{(1)}2&\cdots&x^{(1)}n\\vdots&\vdots&\vdots&\ddots&\vdots\1&x_1^{(m)}&x_2^{(m)}&\cdots&x_n^{(m)}\end{bmatrix} $$ where the cost function J is: $$ J(\theta)=\frac{1}{2m}\sum{i=1}^m(h\theta(x^i)-y_i)^2 $$

$$ = \frac{1}{2m}\sum_{i=1}^m(\theta^Tx_i-y_i)^2 \href{https://duanwenbo.github.io/2021/04/01/ML-Learning-Note—polynomial-regression-and-feature-scaling/#more}{\ previous\ derive} $$

$$ = \frac{1}{2m}(X\theta-Y)^T(X\theta-Y)\href{https://blog.csdn.net/melon__/article/details/80589759}{\ further\ derive} $$

To find the converge solution, let: $$ \nabla_{\theta}J(\theta)=0 $$ Hence, the solution is: $$ \theta=(X^TX)^{-1}X^TY $$

Non-invertible ?

Since we used the transform matrix of X, what if the matrix is non-invertible? Below are two conditions in normal equation that may cause the non-invertibility.

When it appears two linear correlation features, like meter and inch.
Solution: Reduce one.
When the number of features is greater than the number of examples
Solution: increase the datasets or decrease the considered features.

Overview#

Method#

Non-invertible ?#

Overview

Method

Non-invertible ?