Cost (Loss) Function in Machine Learning
The cost function, although it has different variations, basically contains 2 variables (y_real, y_predicted); It allows us to measure the error, in other words, the difference between the actual output values and the predicted output values in machine learning models.
Since this article focuses on logic, not on detailed mathematical calculations, let’s examine the subject through the linear regression model to keep it simple.
The general logic of supervised algorithms is shown in the figure, and the linear regression model, which is one of them, also works in this way. As can be understood, the aim here is to create a hypothesis (prediction) function that works with the least error and therefore the highest level of accuracy.
In linear regression, with one variable (for the sake of simplicity), the hypothesis function is expressed and can be visualized as follows.
Although there are other variants of cost function as mentioned at the very beginning by saying different variations (see MAE, RMSE, MSE), in this article we will consider the squared error function, which is one of the cost calculation functions and also is effective to use for many regression problems.
Since the aim is to find the most accurate model, our main goal is to minimize the cost function, that is, the error.
As seen in this image, we should use the optimal theta values of the J cost function, which are the theta values of the point where the error is minimum, in the model. To show it correctly in 2D, let’s consider the function simplified, that is, theta zero value (constant) is 0.
As can be seen in the figure, we start the calculation by accepting (randomly) the theta 1 value as 0.5. When we calculate the error, we get the value of approximately 0.58 and so, mark the point (0.5, 0.58) on the graph.
The cost function, which we showed in 2D, becomes a 3D bowl-shaped version in cases where theta zero (constant) is not 0.
Although there are different methods (see Normal Equation) to find this minimum point (B at the figure), the gradient descent algorithm speeds up the process of finding the local minimum, in other words, the point where the error is minimum in cases where the feature count (n) is higher such as n > 10⁶.
You can check out my articles below to learn more about gradient descent and normal equation.