Gradient Descent is one of the most widely used optimization algorithms to train machine learning models. It is extensively employed in machine learning as well as deep learning.
APPLICATION: To estimate the values of function parameter so that the value of cost function is minimized.
Let’s study more about the cost function:
We place on the x-axis, on the y-axis, and on the z-axis. The graph below depicts the following:
Our aim, here, is to somehow reach the minima of the cost function, i.e. the pits of the graph where the black marked lines end.
How do we achieve this?
We achieve this by taking the derivative of the cost function. The slope of the tangent represents the derivative at that point. We take steps down the graph in the path of steepest descent. Eventually, we reach the minima of the graph.
The learning rate parameter (α)determines the size of each step we take along the direction of steepest descent.
In the above example, the distance between each ‘star’ is a representation of the step taken, determined by “α”. Greater the learning rate (α) larger is the step taken.
NOTE: Depending on where we start, we could end up at two different minima (local optima) as shown in the above figure.
[ NOTE: We know that a = b is a truth assertion to check if a is equal to b, however, a:=b is an assignment operation where we the value of b is assigned to a. ]
Here we have to simultaneously update θ0 & θ1 in the following manner:
Now, let us mathematically analyse the simultaneous update of function parameters (θ0 & θ1).
Let us consider one parameter (θ1) for the time-being. We will plot the graph cost function J(θ1) against the parameter taken into consideration (θ1).
We have to repeat this formula until the cost function reaches its minimum value.
In the above graph, the value of the slope is negative until (θ1) reaches its minimum value. On moving further towards right the value of the slope increases as the value of (θ1) increases.
On the other hand, the value of the learning rate parameter (α) should be kept such that the gradient descent algorithm converges in a reasonable amount of time.
How does gradient descent converge when the learning rate parameter is fixed?
The convergence should happen in such a way that ( d[J(θ1) ])/(d(θ1)) approaches zero at the bottom of the cost function.
At the minima we get:
The gradient descent can converge to local minimum even with a fixed value of the learning rate parameter (α) because the steps taken by gradient descent spontaneously decrease as we approach the local minimum.
If you liked the story and want to appreciate us you can clap as much as you can. Appreciate our work by your constructive comment and also you can connect to us on….
Website : https://www.societyofai.in/