A brief overview on Single Shot Object detection

Some other blog post that you may want to read is

Table of Contents

Introduction

Hello everyone, I welcome all of you on this amazing tutorial where you would learn about object detection using deep learning and OpenCV. Object detection is hottest topic of the computer vision field.

Object detection is breaking into a wide range of industries, with use instances ranging from private safety to productivity in the workplace. Object detection and recognition is applied in many areas of computer vision, including image retrieval, security, surveillance, automated license plate recognition, optical character recognition, traffic control, medical field, agricultural field and many more.

By Seeing a lot of applications you must be exciting right, let’s jump on it and expand your knowledge and grab as much knowledge as possible.

In this blog post, we learn about categorize an image and tells us where an object resides in an image. For this, we have to obtain the bounding box i.e (x, y)-coordinates of an image. This process or method we call as object detection. Frame detection is considered as a regression problems. That makes it easy to understand.
Whenever we talk about single shot detection we mainly talk on these primary detection methods.

1- Faster RCNN
2- Single-shot detection
3- You look only once (YOLO)

Faster RCNN which perform detection on various region proposal and then end up doing prediction multiple times for various regions in an image. Its speed varies from 5 to 7 frames per second.

There is also another type of detection called YOLO object detection which is quite popular in real time object detectors in computer vision. It’s architecture is similar to faster RCNN. YOLO sees the whole image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. In simple words, we pass the image once through the faster RCNN network and output its main prediction.

Properties of YOLO
• Its processing speed is 45 frames per second better than real-time detection.
• It makes less background errors as compared to RCNN.

YOLO uses k-means clustering strategy on the training dataset to determine those default boundary boxes. The main problem with YOLO is that is leaves much accuracy to be desired. In the upcoming blogpost we will also implements this single shot object detections on video stream as well.

Single Shot Detectors for object detection

In this blog post we talk about single shot detection so let us understand what is single shot object detection.

Single Shot object detection takes one single shot to detect multiple objects within the image. As you can see in the above image we are detecting coffee, iPhone, notebook, glasses at the same time.

It composes of two parts
• Extract feature maps, and
• Apply convolution filter to detect objects

SSDs developed by Google researcher teams to main the balance between the two that is YOLO and RCNN.

There are mainly two models of SSD are available

1- SSD300: In this model the input size is fixed to 300×300. It is used in lower resolution images, faster processing speed and it is less accurate than SSD512.

2- SSD512: In this model the input size is fixed to 500×500. It is used in higher resolution images and it is more accurate than other models.

SSD is faster than R-CNN because in R-CNN we need two shots one for generating region proposals and one for detecting objects whereas in SSD It can be done in a single shot.

The MobileNet Single Shot Detection method was first trained on the COCO dataset and was then fine-tuned on PASCAL VOC reaching 72.7% mAP (mean average precision).

For example- In Pascal VOC 2007 dataset . SSD300 has 79.6% mAP and SSD512 has 81.6% mAP which is faster than out R-CNN of 78.8% mAP.

In this blog post we are going to use MobileNet-SSD (it is a caffe implementation of MobileNet-SSD detection network with pretrained weights on VOC0712 and mAP=0.727).

VOC0712 is a image data sets for object class recognition and mAP(mean average precision) is the most common metrics that is used in object recognition.
If we merge both the MobileNet architecture and the Single Shot Detector (SSD) framework, we arrive at a fast, efficient deep learning-based method to object detection.

The question arises why we use MobileNet, why we can’t use resnet, VGG or alexnet.
The answer is simple in resnet or VGG or alexnet has a large network size and it increases the no of computation whereas in mobilenet there is a simple architecture consisting of a 3×3 depthwise convolution followed by a 1×1 pointwise convolution.

Pseudo-code implementation of Single shot object detection (SSD)

Let us take a look at the pseudo-code implementation so we can get an overview to implement this single shot object detection algorithm.

• The first step is to load a pre-trained object detection network with the OpenCV’s dnn (deep neural network) module.

• This will allow us to pass input images through the network and obtain the output bounding box (x, y)-coordinates of each object in the image.

• Now we write the code to print the name of the detected object and their confident scores.

• At last, we look at the output of MobileNet Single Shot Detector for our following input images.

Project Structure

It contains three files and 1 image folder containing images.

1- main.py

It is the main file that contains object detection code

2- MobileNetSSD_deploy.prototxt.txt

Caffe deploy prototxt file

Get the prototxt file on the following link

https://github.com/chuanqi305/MobileNet-SSD/blob/master/deploy.prototxt

3- MobileNetSSD_deploy.caffemodel

Caffe pretrained model

Get the Caffe pretrained model on the following link

https://drive.google.com/file/d/0B3gersZ2cHIxRm5PMWRoTkdHdHc/view

4- And last the image folder containing images to test the code.

Prerequisites

· Basic programming knowledge

Python- It is the most simple language that is ever built. We are using python a lot on this project. You must have a basic understanding of python so you can understand the code very easily. Note- High-level programming is not required basic understanding is enough.

· Installed tools

1- For this program, we will need python installed on your computer

Link

2- We have to also install OpenCV and numpy library to run our program

pip install numpy

pip install open-cv

numpy library is used for numerical computation and it provides tools for working with these arrays and open-cv is used to load an image, how to display it and how to save it back.

Now let us jump right back to code

Step 1- Load all the required libraries

import numpy as np
import cv2
import argparse

Numpy is used for numerical computations. cv2 is used to load the input image and it is used to display output. Argparse make it easy to write user-friendly command line interfaces. In this code we are using it to parse the command-line arguments.

Step 2- The next step is to parse our command-line arguments as follow

ap=argpasrse.ArgumentParser(description=”Program to run MobileNet-SSD object  detection network “)

ap.add_argument(“-I”,”—image”,type=str,help=”path to input image”)

ap.add_argument(“-p”, “ — prototxt”, required=True,
help=”Path to Caffe ‘deploy’ prototxt file”)

ap.add_argument(“-m”, “ — model”, required=True,
help=”Path to weights for Caffe model”)

ap.add_argument(“-c”,”--confidence”,type=float,default=0.2,help=”minimum probability to filter weak detections”)

In the above block of code

–image: path to input image

–prototxt: path to the caffe prototxt file

–model: path to the pretrained models

–confidence: it is the minimum probability threshold to filter weak detections and its default value is 0.2.

Now we parse the above arguments

args=vars(ap.parse_args())

Step 3- The next step is to define the class labels and color of the bounding box.

CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
"sofa", "train", "tvmonitor"]
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3)

It will detect all those objects that are mentioned in the class and then it assigns a color to the bounding boxes that are blue.

Step 4- Then after that, we load the model and call the command-line arguments

print("[INFO] loading model...")
net=cv2.dnn.readNetFromCaffe(args[“prototxt”],
                args[“model”])

Step 5- Now we load the input image and construct an input blob (it is collection of single binary data stored in some database system) for the image and then resize it to a fixed 300*300 pixels and after that, we normalize the images (note: normalization is done via the authors of MobileNet SSD implementations).

image = cv2.imread(args["image"])
(h, w) = image.shape[:2]
blob=cv2.dnn.blobFromImage(cv2.resize(image(300,300)),0.007843,(300,300),127.5)

Step 6- After that we pass the blob through our neural network

 net.setInput(blob)
 detections=net.forward()

Above lines of code shows that we set the input blob to a network and the compute the forward pass for the object detection and prediction

Step 7- This step is used to determine what and where the objects are in the image.

for i in np.arange(0, detections.shape[2]):
   confidence = detections[0, 0, i, 2]
   if confidence > args["confidence"]:
   
     idx = int(detections[0, 0, i, 1])
     box = detection[0, 0, i, 3:7] * np.array([w, h, w, h])
     (startX, startY, endX, endY) = box.astype("int")
     label = "{}: {:.2f}%".format(CLASSES[idx], confidence * 100)
     print("[INFO] {}".format(label))
     cv2.rectangle(image, (startX, startY), (endX,   endY),COLORS[idx], 2)
      m = startY - 15 if startY - 15 > 15 else startY + 15
      cv2.putText(image, label, (startX, m),cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)

Now we loop over all the detection and extracting the confidence score for each detection. After that, we filter out all the weak detections whose probability is less than 20%. And then we print the detected object and their confidence score (it tells us that how confident the model is that box contains an object and also how accurate it thinks the box is that produce).

In mathematics it is calculated as

confident scores= probability of an object * IOU

IOU stands for Intersection over union. It is the ratio of overlapping area of ground truth and predicted area to the total area.

IOU= area of overlap / area of union

Step 8- At-last we use imshow function of cv2 to display our output image to a screen until a key is pressed.

cv2.imshow(“Output”,image)
cv2.waitKey(0)

Object Detections results

To run the code use the following commands.

In this example we detect multiple cars using deep learning-based object detection

python main.py --prototxt MobileNetSSD_deploy.prototxt.txt --model MobileNetSSD_deploy.caffemodel –image images/car.jpg

[INFO] loading model…
[INFO] computing object detections
[INFO] loading model…
[INFO] computing object detections
[INFO] car: 99.25% 
[INFO] car:99.78%

Our first results show us that we have detected both cars with around 100% confidence score.

In the next example we detect aeroplane using deep learning-based object detection:

Wrap up the Session

In today’s blog post we have learned about single-shot object detection using open cv and deep learning. There are many flavors for object detection like Yolo object detection, region convolution neural network detection. In this blog post, we use SSD to speed up the process by eliminating the region proposal network. By this, we have a drop in accuracy so we combine the MobileNet and SSD to get better accuracy. And we have also learned that the Yolo object detection has a fast processing speed than the other detection method.