Neural Networks-Part(4): Gradient Descent

6 min readOct 28, 2021

1. Introduction

What do you make of it? Of course, it’s from Shutterstock but what else? A climber climbing the mountain using the best-suited path to him makes more sense. Let’s reverse the situation, now the climber is descending from the top of the mountain using the best-suited path to the bottom of it. That being said now we know what gradient descent is. Instead of the climber, it’s your gradient, instead of the mountain it’s your loss with respect to the flat base of it that is the weights of our network. Isn’t it so simple and elegant? Just for your convenience let’s have a visual of what we meant above.

Figure 1: Understanding Gradient Descent

This is a 2D representation of loss and weights, for simplicity, we assume we have a single neuron that receives a single neuron.

Since machine learning, we always tried to build a model with minimum loss and we did that by finding the corresponding optimal weights. We always searched for local or global minima of the function and that is what we do in every model so that we have good results. The process here is the same what changes here is the method. In Linear Regression, recall that we had an analytical solution to find our optimal weights using matrix multiplication but in neural networks, it is not possible to have an analytical solution, so what do we do, we use optimizers as we did in logistic regression. The optimizer we are proposing to use at first is the one we are discussing, the gradient descent optimizer.

How do we solve the problem of getting the optimal weights using gradient descent then? you ask. I’ll tell you but before that let’s just be clear on what a gradient is:

‘Gradient, in mathematics, a differential operator applied to a three-dimensional vector-valued function to yield a vector whose three components are the partial derivatives of the function with respect to its three variables. The symbol for gradient is ∇.’ , Gradient Mathematics, Britanica

Here we are only considering a single weight hence one dimension and the partial derivative of loss w.r.t to the weight(W) will be our gradient. A gradient calculated for a point the curve will always point towards the direction where the slope increases and we are going to exploit this in gradient descent.

We start at a random position for example form w1 in Figure 1, notice the gradient will give us a direction where it can increase and hence it will try to be directed towards the peak but here we move in the opposite direction and update our weight accordingly and then again calculate the gradient at that point and hence move in the opposite direction of the gradient which leads us to the local minima which in Figure 1 is also the global minima.

Till here, we’ve got a basic idea of what gradient descent is and how it works. Now, let’s take a deep dive and understand it better.

We know we have to move in the gradient’s opposite direction but by how much, should just take a big leap or small steps? This is the most important question here. So how do we do that? we use a learning rate that controls the step size as well as how much we learn at each step. A very small change in the learning rate would mean very small steps and hence a little amount of learning. If we have a large learning rate, that means a large step and lots of learning. We want a learning rate so that’s it’s neither too small nor too large. We’ll discuss more that but first let’s see how shall we update our weights with the help of the learning rate and gradients.

It would be clearer that if we now have a small learning rate, the decrement of the update will be small and it hence the convergence of our loss will be slow. Though a large value will cause jumps which will result in oscillations and we may never reach that minimum. The standard values for choosing a learning rate lies in the range of [1e-6,1] and the optimal ones vary with the problem.

We’ve configured the step size but we also had to stop at the minima right and we do that using the tolerance; a threshold value to stop the update process. If the absolute difference between the previous weight and the new updated weight is less than that threshold, we stop the process.

The above equation clarifies that.

In a neural network, we implement GD with the help of backpropagation, which we will discuss in the upcoming articles. Gradient Descent do looks fancy but have the disadvantage that it is stochastic; A new starting point a different result, sometimes it causes harm, like you may get stuck on very shallow local minima. Though we can try adding noise to our function and hopefully we get out of that.

2. Implementation in Python

Now we know the theory, why not implement it. The working code can be found here at my Machine Learning repository on Github and a live notebook on Kaggle.

Note run the code below in Jupyter cell by cell or take the .py file from the above links.

Let’s start by importing the necessary libraries:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Great Let’s define a function so we generate our loss and then use it to create a dummy loss function with respect to a single weight.

#Function to genertae Loss Function (y)
def gen_y(a):
    u=np.sin(a/10)*(5*(np.cos(a+10))-np.cos(a/2000)
    return u+20*np.sin(a)

Generating our loss function:

x=np.linspace(15,25,500)
y=gen_y(x)

Let’s plot and have a look,

plt.rcParams['figure.figsize']=(8,8)
plt.plot(x,y,color='r',linewidth=3)
plt.xlabel('Weight',color='g',fontsize=20)
plt.ylabel('Loss',color='r',fontsize=20)
plt.title('Loss Function',fontsize=25,color='b')

Awesome, we have global minima and local minima! It’s time for some maths and calculates its gradient and we will use another function for that,

def der(a):
    u=(np.sin(a/10)*((np.sin(a/2000)/2000)-5*np.sin(a/10)))
    v=(np.cos(a/10)*(5*np.cos(a+10)- np.cos(a/200)))/10 
    return u + v + (20*np.cos(a))

Now, it’s time to build our Gradient Descent Algorithm and we will do it with recursion.

def Gradient_Descent(W,W_prev=0,eta=0.004,tol=0.004,epochs=1):
    #Base Condition, from Equation 2
    if(W-W_prev<tol):
        print(f'Returning after {epochs} number of epochs')
        return W
    # We calculate the gradient value at W
    g=der(W) 
    # We memorize the the W 
    W_prev=W
    # We update the weight with the help of the previous one 
    # eta is the learning rate and tol is the tolerance
    W=W_prev-eta*g 
    #Itertaive Process and we also count the number of epochs 
    return Gradient_Descent(W,W_prev,epochs=epochs+1)

Great, now let’s get the optimal weights and know the number of epochs it took us to reach there. We have taken the learning rate as 0.004 and a similar tolerance, you may want to play with that.

start=15.25
best_weight=np.round(Gradient_Descent(start),2)
print(best_weight)

The optimal weight (best_wieght) for a minimum loss is 17.47. We reached there in 52 epochs.

Now let’s visualize what we have done so far,

#To plot iterations 
x_plot=np.linspace(start+0.1,best_weight-0.3,152)
##A Visual Plot of our excercise 
plt.rcParams['figure.figsize']=(10,10)
plt.plot(x,y,color='r',linewidth=3)
plt.scatter(best_weight,gen_y(best_weight),linewidth=16,label='Optimal Wight',color='blue')
plt.scatter(start,gen_y(start),linewidth=16,label='Start',color='orange')
plt.scatter(x_plot,gen_y(x_plot),linewidth=16,label='Iteration',color='#FDDF00',alpha=0.3)
plt.xlabel('Weight',color='g',fontsize=20)
plt.ylabel('Loss',color='r',fontsize=20)
plt.title('Loss Function',fontsize=25,color='b')
plt.legend(markerscale=0.2)
plt.tight_layout()

Figure 3: The Visual Implementation of Gradient Descent

Great!! we came from our starting point to finding our Nemo(optimal weights) and we did it with Gradient Descent. That’s it from my side in the upcoming article we’ll talk about learning rate decay, backpropagation and optimizers more.

Neural Networks-Part(4): Gradient Descent

1. Introduction

2. Implementation in Python

Written by Aamir Ahmad Ansari