Gradient Descent — An Explanation
Gradient Descent is one of the most widely used optimization algorithms out there. Ease of understanding and the its applicability to various ML algorithms makes it an attractive choice. How does it work though? Let’s find out!
We will go through various sections mentioned below to understand this algorithm. By the end of this article you will have a strong idea on what the algorithm is and what it does.
- What is Gradient Descent?
- How does it work?
- Key Parameter
- Understanding the result of the algorithm
1. What is Gradient Descent?
The formal definition of gradient descent is as follows
Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function.
In simple words, gradient descent is an algorithm that finds the point where the error of the function to which it is applied is close to zero. In ML, gradient descent is widely used to find the value of the coefficients of a cost function where the error is close to zero. By doing this, we an find the optimal values for the coefficients to minimize the overall cost.
2. How does it work?
To first understand the working of the algorithm, we need to understand what a gradient is. In mathematical terms, a gradient is the slope of a function. To put it in simple terms, a gradient measures the change of a parameter with respect to another one. In ML, a gradient is the measure of change in weights with respect to the change in error.
Now onto the working of the algorithm. Now imagine trying to reach the top of a mountain with minimum number of steps. At first, we take large steps to reduce our count. As we near the peak, we take small steps incrementally to reach the peak. Gradient descent works the same way, Well almost! One slight change is that the algorithm starts at the top of the mountain. It tries to climb downhill while trying to find the point that brings the error/cost close to zero.
As we “climb down” the hill or reach the local minimum, our steps become smaller. This is because the slope of the function keeps reducing and hence our steps as well. Since its an iterative algorithm, this process happens until we reach the local minimum or close to it.
Before we initiate the algorithm, we assign random values as weights. Gradient Descent then starts at at the top and after certain number of iterations reaches the local minimum.
3. Key Parameter
One key parameter that influences our algorithm is the learning rate. The size of the steps taken in the direction of the local minimum is determined by the learning rate. This also determines how fast/slow we move towards the minimum.
An optimal value for the learning rate is essential. If the value is too high, we may never reach the minimum and the algorithm bounces back and forth. If it is too low, the algorithm takes a while to reach the minimum making it inefficient. Hence the learning rate is very important and must be set to an appropriate value. This value can be determined by trial and error method and by plotting a graph.
4. Understanding the result of the algorithm
A good way to make sure that the algorithm is working is to plot a graph between the number of iterations and the cost. After every iteration, the cost must decrease. When the cost cannot be decreased any further i.e. when the algorithm has reached the local minimum, the line becomes a plateau and remains the same for the subsequent iterations.
Gradient Descent is one of the most versatile and flexible algorithms out there. Proper learning rate and good understanding of the algorithm can yield great results for your model.