Neural Networks-Part(3): Introduction to Multi-Layer Neural Networks and Keras Function API

A piece of cake with different layers

Observe this image and answer why would have been these layers added to that cake? For taste right! Each layer has its own unique taste that adds up and makes it even tastier. Similarly, in a neural network, we add layers to it so that it estimates the problem better. Just like that cake, a whole portion of a unique bread is taken we’ll take a number of nodes that will constitute our layer.

That being said, let’s know more about these layers and first by visualising it.

Figure 1: A Fully Connected Neural Network

This is an example of a fully connected Neural network with 2 hidden layers(in purple) and an output layer(red) with two predictors(input). Remember input is not considered here as a layer because it doesn’t have any neurons. Each layer consists of nodes and we have taken 4 nodes in the first hidden layer and 3 nodes in its successor. The idea is the same old, each neuron obtains a weighted input from the previous layer and then passed through an activation function; which happens inside our neuron. Do not confuse that the weights would be different for each neuron connection. For example, the first hidden neuron from the top receives the input as W1*X1 +W2*X2, similarly, the second neuron will receive input as W3*X1+W4*X2 and so on. We shall now discuss each layer in depth.

1. The Inputs

Inputs are not considered as a layer for the above-stated reason, but it is important to specify the dimension of the input while creating them in Keras. We generally specify the number of columns (the predictors) and not the number of observations as it can vary. As shown in figure 1, X1 and X2 are used as predictors with any number of samples or observations for them. It is very important that we should standardize or normalize our data before introducing it to the network. Normalization especially helps when sigmoid activation is present in the layers. Learn more about activation functions here in my article on Activation Functions. If you know in sigmoid activation the gradient is larger if the input is near the origin, hence we are less susceptible to vanishing gradients, we will discuss vanishing gradients in the upcoming articles.

The syntax for Inputs in Keras:

#p is the number of predictors.

2. The Hidden Layers

The hidden layers are the region of your neural network where all the learning happens. The name hidden signifies that you will only have access to the input and the output layer once your model is created. You cannot access the hidden layer after that. Like every other layer, the hidden layer constitutes a number of neurons that receive the weighted sum of inputs from the previous layer. Think of the hidden layers of a neural network as a tool to understand different patterns in your data. For example, if you are building an image detection NN, assume for understanding that the first layer is recognizing the contours and the next one some other feature and so on. But as we keep on increasing the number of hidden layers you also increase the complexity of the network.

The syntax for adding a hidden layer to the network:

keras.layers.Dense(n_neurons,activation='Your Choice')(previous layer)
#n_neurons is the desired number if neurons in a hidden layer
#Use activation function suitable to your problem, to keep it simple use RELU.
#And it take the additional parameter as the previous layer

3. The Output Layer

We are now in the end game, the output layer. Note, we consider the output as a layer as it contains neurons, remember? Awesome. It’s just like the hidden layer with weighted input and activation but here is the real deal, this is what we did all this hard work for, for the output and hence the outputs are accessible to us. Different problems require different activation at the output, for example, a classification problem with n number of classes require a softmax activation for the output on n neurons to sum to 1 and hence becomes a probability. In a regression problem, you’d like to use Relu if negative values are not desired-or if they are then go with a simple linear activation; if the input is WX then so will be the output. Hence a multiplication with 1. The syntax is similar to how we add a hidden layer.

Great! We’ve learnt the basics of each if them so far and now let’s implement it in Python.

4. Implementation in Python

As promised, we are going to use Keras functional API on the iris dataset, already available in sklearn. To use Keras functional API, we’d require TensorFlow. You know more about the installation from here.

Let’s get started.

What do we always do, initially? Yeah, libraries importation!!!

import tensorflow as tf 
from tensorflow import keras
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
%matplotlib inline

Great, let's quickly move towards our data set.


Remember they will be NumPy arrays, X have features and Y is an array of strings for 3 different species.

The architecture of our NN constitutes 4 input features, 3 hidden layers with 16 neurons each(you can play with that) with Relu activation for keeping it simple and have gradients for values greater than 0 and an output layer with 3 neurons( as we have 3 classes) with softmax activation(to convert it into a probability).

Ready to get our hands dirty on our first NN, well I am. Let’s get started!!!

Let’s transform our target variable into 3 columns for 3 species and we will do that using OneHotEncoding

y=y.reshape(-1,1) #Conversion to a 2D array
# We will now have 3 binary columns, one for each class.

Let’s split the data set, eh?


We have used stratify here so that the number of classes in train and test data is proportional.

NN time and we start by configuring the input

#Constructing input layer 
inputs1 =keras.Input(shape=4)

Great!!! Now let’s set our dense layers,

#Adding hidden layer 

Notice, how we have provided the previous layer as an input to the new layers. Dense represents a fully connected layer.

What’s next, the output config.

#Adding Output layer 

Again, notice we have provided the previous layer as an input for our output.

It’s time to get crazy baby, model config now,


In the first line, we initiated our model by passing it the information of inputs and outputs. Then we get everything we have defined so far together and compile together all the inputs, hidden layer, the output layer, with providing extra information, like optimizer: How to minimize the loss function and we used adam optimizer, a fast and reliable method. In future lectures, you will know why. The loss function, to measure the error. we have specifically used ‘categorical_crossentropy’ because we have provided the target as OneHotEncoded Values if the target would have been binary, we would have used binary crossed entropy. Finally, to measure our model power to predict right, we use accuracy as a metric, you may use multiple metrics and include true positive and true negative, explore it.

Model.Summary() provides us with information on the model, the hidden layers, the number of parameters, etc. Parameters are the weights and biases.

See the first, the parameter of output shape is None as the number of observations varies in training and test sets.

It’s time to train our model on the train set.,y_train,batch_size=128,epochs=800,validation_split=0.2)

We use history to store, information for each and every epoch, we took a batch size of 128, neither too small nor too large. And from each batch, we are also validating our model and hence a validation split of 0.2.

You’ll notice this information for each epoch here. If you put verbose=0, it will not appear.

Let’s plot the train and validation graph, to check how are we doing.

fig,ax =plt.subplots(figsize=(6,7))
plt.plot(history.history['loss'],label='Train Loss')
plt.plot(history.history['val_loss'],label='validation Loss')
plt.title('Loss Convergence',fontsize=18)

That’s pretty good, eh? No overfitting and a minimum value of the loss, which is around 0.08 for training and 0.09 for validation after 400 epochs.

Let’s evaluate our model against our validation set now.


Bravo, we got an accuracy of 1. Explore the confusion matrix for this result and tell me below are other metrics are also well defined like our accuracy.

That’s it for today, thank you.



Sharing knowledge is gaining knowledge. Data Science Enthusiast and Master AI & ML fellow @ Univ.Ai

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aamir Ahmad Ansari

Sharing knowledge is gaining knowledge. Data Science Enthusiast and Master AI & ML fellow @ Univ.Ai