How to perform Image classification using CNN

In the last few decades, machine learning has gaining a lot of popularity in the field of healthcare, autonomous vehicle, web search, and image recognition. The increased use of the internet produces a massive amount of data. Many tech giants harnessing these data to understand customer behavior and many more. The medical industry also will likely change, as deep learning helps doctors do things like to predict or detect cancer earlier, which can save lives.

On the financial front, machine learning and deep learning are poised to help companies and even individuals save money, invest more wisely, and allocate resources more efficiently. And these three areas are only the starting of future trends for machine learning and deep learning. We have already released a series of machine learning tutorial here.

However every technology has its pros and cons, machine learning can’t deal with a massive amount of data. Due to this deep learning comes into the picture. Deep learning can handle enormous amounts of data and make a prediction. It requires heavy computing power to do so. One type of hardware used for deep learning is graphical processing units (GPUs) while Machine learning programs can run on lower-end machines without as much computing power.

One of the cool application of deep learning is the image classification. In this article, we will know how to solve the image classification problem using a convolutional neural network.

Table of Contents

What is a convolutional neural network

A convolutional neural network is a special architecture of the artificial neural network, most commonly applied to an image problem. The term was first coined in 1988 by Yann LeCun.

For example- Amazon use this algorithm to generates product recommendation, Google uses it to let the user search among the photos. Facebook uses it for automatic tagging in the picture.

The main task of image classification is to take an input image assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The human eye can do these tasks very easily while the computer sees the image quite differently.

The computer sees an image as 0s and 1’s in the form of os pixels. For grayscaled images, there is only 1 channel but when we take a colored image, there are 3 channels present in that image which is red, green, and blue, each pixel has a range between 0 and 255.

A neural network consists of 3 parts: The input layer, Hidden Layer, and the Output layer. While a convolution neural network has five parts Input layer, convolution layer, Pooling layer, Fully connected layer, and Output layer.

Basic building blocks of convolutional neural network

Convolution layer

The convolutional layer is the first step in building a CNN model. In this step, we extract the features from the input image.

There are 3 necessary elements that are involved in performing the convolution operation.

Input image: The image that is given as input to the Convolution layer
Feature detector: It is also called a kernel or filter that is used to produce a feature map by multiplying element-wise to the input images.

The filter size is equal to n*m where “n” is the number of inputs and “m” is the number of outputs. The following filter size that are 3×3, 5×5 and 7×7.

Feature map: It is the result of the input image and feature detector.

How exactly does the Convolution Operation work?

We take two matrices that are Input image and feature detector and multiply element-wise to make a new set of matrices which is a feature map or convolved features.

In the above GIF diagram, then green ones are the input image, the yellow ones are called feature detector or kernels. And the output is a feature map.

Now the question arises is how to calculate dimensions of feature map.

The size of output feature map is depend on 3 parameters which is filter size, stride and zero padding.

Before moving to ReLU directly let us learn about what is the activation function. It is mathematical that determines the result of the neural network. This function is engaged to each neuron in the network and decides which neuron to be fired or not based on the input that is relevant to the model prediction.

Rectified linear unit (ReLU)

The purpose of using a ReLu function is to increase the non-linearity in our images. This function is used because naturally, all images are non-linear.

Suppose we are looking at the dog image, you’ll find it contains a lot of non-linear features (e.g. the change between pixels, the borders, etc.).

The ReLU serves to break up the linearity even further to make up for the linearity that we might impose an image when we put it through the convolution operation.

Pooling layer

The pooling layer is used to gradually decrease the spatial size of the representation to decrease the number of parameters and computation in the network. It also helps us to prevent overfitting.

There are several types of pooling methods are available. The most important ones are given below

Max pooling: It is a convolution process where the Kernel or feature detector extracts the maximum value of the area it convolves.
Min pooling: Similarly in min pooling, the kernel extracts the minimum value of the area it convolves.
Average pooling: In average pooling, the kernel extracts the average value of the area it convolves.

In our dataset, we are using max-pooling operation. Max pooling is also called downsampling.

Flattening

Now we flatten the output of the pooled feature map to create a single long feature vector.

Fully Connected layer

After completion of all the above steps now, it is necessary to attach a fully connected layer. A fully connected layer takes the output information from convolutional networks. When we attach a fully connected layer to the end of the network results in an N-dimensional vector, where N is the number of classes from which the model selects the desired class.

To calculate total no of parameter are “(n*m*l+1)*k” where n*m denotes filter size, k is output feature maps and l is input feature maps.

Softmax

Now the last step, we will apply a softmax function to convert the outputs to the probability values of each class. Next, we choose the value which has a maximum probability. And print the class.

What is Image classification

Image classification is the task of categorizing and labeling groups of pixels or vectors within an image based on specific rules.

An image classification model is trained to recognize different classes of images. For example, a cnn model might be trained to recognize photos representing three different types of animals: cats, hamsters, and dogs.

Workflow to Solve Image classification problem

A CNN network takes an image as the input
Then it applies many different kernels to create a feature map
After that, we use the relu activation function to increase the non-linearity in our images.
Then we apply the pooling layer to each feature map to reduce its dimension.
After that, we flatten the pooled images into one long vector.
Now the vector is attached to a fully connected layer.
Next, we use softmax to output the probability values of each class. And choose the probability with the maximum values
Now we train through forward propagation and backpropagation for several epochs. And then repeat this until the networks learn to classify the image correctly.

Understand the real-world dataset

In this tutorial, we are going to classify rock, paper, and scissors. In our dataset, we have a total of 900 images in our training set and 300 images in our testing set.

The dataset is downloaded from Kaggle. The link to the rock, paper, scissor dataset is here.

The sample images are given below.

Implementation of CNN to solve image classification using keras

Let us start to code CNN.

import os
import numpy as np
from keras.models import Sequential
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Convolution2D, MaxPooling2D, ZeroPadding2D
from keras import optimizers
from keras import applications
from keras.models import Model
from keras.preprocessing.image import load_img, img_to_array
from keras.models import Sequential, load_model
import time

The first step is to import all the required Python packages. Sequential is used to initialize the layer, Activation is used to deal with activation function, flatten is used to convert pooled feature into a single long vector, Image data generator is used to preprocess the image, Conv2D is used to deal with convolutional operations.

train_data_dir = 'data/train'
validation_data_dir = 'data/validation'

The next step is to load the training and validation images.

img_width, img_height = 150, 150
datagen = ImageDataGenerator(rescale=1./255)
batch_size = 32

train_generator = datagen.flow_from_directory(
       train_data_dir,
       target_size=(img_width, img_height),
       batch_size=batch_size,
       class_mode='categorical')

validation_generator = datagen.flow_from_directory(
       validation_data_dir,
       target_size=(img_width, img_height),
       batch_size=batch_size,
       class_mode='categorical')

# Found 900 images belonging to 3 classes.
# Found 300 images belonging to 3 classes.
image_size = 150

input_shape = (image_size, image_size, 3)
print(input_shape)

batch_size = 128
kernel_size = 3
filters = 64
dropout = 0.3

Now we set the width and height of the input images to 150px. Next, we use an Image data generator to rescale the images.

Here we initialize the training and validation generator. Each generator will provide batches of images on demand, as is denoted by the batch_size parameter. The class mode will be categorical because it has more than two classes.

Next, we set the input_shape by giving the parameter width, height, RGB channel. Next, we set the batch_size=128 and kernel size =3 and filters=64 and dropout=0.3. Dropout is used in a neural network to prevent overfitting.

The normal mini-batch sizes are 64, 128, 256, or 512. Kernel size is calculated through the width and height of the kernel mask.

# use functional API to build cnn layers
inputs = Input(shape=input_shape)
y = Conv2D(filters=filters,
           kernel_size=kernel_size,
           activation='relu')(inputs)
y = MaxPooling2D()(y)
y = Conv2D(filters=filters,
           kernel_size=kernel_size,
           activation='relu')(y)
y = MaxPooling2D()(y)
y = Conv2D(filters=filters,
           kernel_size=kernel_size,
           activation='relu')(y)

# image to vector before connecting to dense layer
y = Flatten()(y)

# dropout regularization
y = Dropout(dropout)(y)
outputs = Dense(3, activation='softmax')(y)

# build the model by supplying inputs/outputs
model = Model(inputs=inputs, outputs=outputs)

# network model in text
model.summary()

# classifier loss, Adam optimizer, classifier accuracy
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Three CONV2D => RELU => POOL blocks are defined here with increasing stacking and number of filters. I’ve applied for Dropout as well.

Our FC => RELU layers and softmax classifier make the head of the network. The output of the softmax classifier will be the probability percentages for each class our model will predict.

Next, we print the summary of the model. Our model is initialized with Adam optimizer. We then compile our model with a “categorical_crossentropy” loss function (because we only have three classes of data).

The input_1(Input Layer) has shape (None,150,150,3) and parameter is 0. In the whole program stride=3,kernel_size=3*3.

Convolutional_1 : ((kernel_size)*stride+1)*filters) = 3*3*3+1*64 = 1792 parameters. In first layer, the convolutional layer has 64 filters.

Max_pooling_2d: This is used when we have to reduce the input image size.

Convolutional_2 : As convolutional_1 already learned 64 filters. So the number of trainable parameters in this layer is 3 * 3 * 64+ 1 * 64= 36928 and so on.

Convolutional_3 : 3 * 3 * 64+ 1 * 64 = 18496 and so on.

At last all parameters sum together.

Total Training Parameter = 297,603 Trainable Parameters = 297,603

Non-Trainable Parameter = 0.

epochs = 17
train_samples = 900
validation_samples = 300
model.fit_generator(
        train_generator,
        steps_per_epoch=train_samples // batch_size,
        epochs=epochs,
        validation_data=validation_generator,
        validation_steps=validation_samples// batch_size,)

model.save('model.h5')
model.save_weights('weights.h5')

On the first 9 lines, we initiate our training process. After the training is complete we save the model and its weights in hdf5 format. Afterward, we will evaluate the model on the testing data.

We are getting the perfect accuracy on training data because the image features are easy to learn and on testing data we are getting an accuracy of 73.4%.

start = time.time()

#Define Path
model_path = 'model.h5'
model_weights_path = 'weights.h5'
test_path = 'data/test'

#Load the pre-trained models
model = load_model(model_path)
model.load_weights(model_weights_path)

#Define image parameters
img_width, img_height = 150, 150

#Prediction Function
def predict(file):
  x = load_img(file, target_size=(img_width,img_height))
  x = img_to_array(x)
  x = np.expand_dims(x, axis=0)
  array = model.predict(x)
  result = array[0]
  #print(result)
  answer = np.argmax(result)
  if answer == 1:
    print("Predicted: rock")
  elif answer == 0:
    print("Predicted: paper")
  elif answer == 2:
    print("Predicted: scissor")

  return answer


#Walk the directory for every image
for i, ret in enumerate(os.walk(test_path)):
  for i, filename in enumerate(ret[2]):
    if filename.startswith("."):
      continue
    
    print(ret[0] + '/' + filename)
    result = predict(ret[0] + '/' + filename)
    print(" ")

Now we start the timer and define the model path, its weight file and the Now we start the timer and define the model path, its weight file, and the location of testing data. Next, we load the trained model and define the image parameter that is width and height.

Second, last we define a predict function in which we load an image, convert the image to array values, expand the shape of that array, and then take the value which has a maximum value. If the output is 1 then it is rock, 0 means paper, 2 means scissor.

Finally, we loop over through the testing images and make the prediction.

data/test/283.jpg Predicted: paper

data/test/284.jpg Predicted: rock

data/test/285.jpg Predicted: scissor

Wrap up the Session

Finally, we made it to the end of the tutorial, in this tutorial we have learned the following things:

What is a convolutional neural network?
Basic building blocks of convolutional neural network
What is Image classification?
Workflow to Solve Image classification problem.
Understand the real-world dataset
Implementation of CNN to solve image classification

Hope you enjoyed the tutorial, feel free to comment if you have any doubt regarding the topic.

If you want to get free cheatsheets on machine learning, data science, deep learning, and computer vision is available in our telegram channel.