Normal Equation

asrın öztin
4 min readOct 1, 2021

In machine learning, various optimization techniques can be used to reduce the error and thus increase the accuracy rate. In this article, we will discuss the optimization of machine learning models with the normal equation method, and in order to understand this, we should first take a look at the concept of the cost function.

Cost Function

The cost function, although it has different variations (see MAE, RMSE, MSE), basically contains 2 variables (y_real, y_predicted); It allows us to measure the error, in other words, the difference between the actual output values ​​and the predicted output values ​​in machine learning models.

Root Squared Cost Function Formula (Based on Linear Regression with one variable)

As can be seen in the figure, the sum of the squares of the differences between the y values ​​estimated as a result of the hypothesis function and the actual y values ​​gives us the root squared cost function.

If you want to learn more about this concept, you can reach my article on this subject from the link below. However, the information given in this article is sufficient to understand the normal equation.

Gradient Descent

I’ve mentioned the gradient descent algorithm, which iteratively minimizes the cost function, in the article I linked below. The normal equation is a method of analytically minimizing the cost function by eliminating the iterative operations in this method. Although gradient descent is not a prerequisite for understanding the normal equation, it is useful to briefly mention it for comparison.

The gradient descent algorithm created over a cost function of the univariate (for the sake of simplicity) linear regression model is as follows.

In the gradient descent algorithm, which continues iteratively until the derivative of the point, that is, the slope is zero (to the minimum point), the coordinates of this minimum point on the cost function graph give the values ​​of the hypothesis function with the least error.

Although this information about gradient descent is sufficient to compare it with the normal equation, you can also access my article on this subject from the link below for detailed information.

Normal Equation

Not iteratively obtaining the values ​​of the hypothesis function with the least error (such as Gradient Descent); To obtain analytically, it is sufficient to apply the following Normal Equation formula.

Normal Equation formula

Let’s go through multivariate linear regression for a simple understanding of the formula.

In the figure, we see a data set with 4 variables (x1,x2,x3,x4) and with certain output values ​​(y) (see supervised learning). When we show this data set as a matrix, while the X matrix represents the variables in our data set; The y matrix represents the output values. When the normal equation calculation is made over these X and y matrices, our theta value gives us the hypothesis function variable of the model with the least error and therefore the highest accuracy.

So what can be done if the “X transpose times X” in the operation is a non-invertible function?

  1. The reduction of features, which is a method also used to simplify the data set, may be applied. If our variable x1 is the area of ​​a house in square meters and our variable x2 is the area of ​​a house in square feet, then it makes sense to drop one of them.
  2. If the number of observations (a.k.a m) is less than or equal to the number of features (a.k.a n), some variables can be deleted and/or regularized.

In order to reach the minimum point of the cost function, which is the same goal, in some cases it is useful to choose gradient descent, while in some cases it is useful to choose the normal equation.

--

--

asrın öztin

Computer Science Engineering educated, with interest in Data Science.