Introduction to object detection and Evolution: RCNN, Fast-RCNN, Faster-RCNN, YOLO

Sumeet Sewate
4 min readApr 23, 2021

--

Hi, guys, I have been very fascinating to object detection since my college days when I first heard about face recognition it was a wow moment and it encouraged me on my journey to becoming a data scientist.

OBJECT DETECTION

How does the computer see the images and tells us what on it with its exact location? the answer to the above question lies in Deep learning and computer vision. In classic computer vision techniques when people used to try techniques such as HOG, SIFT features to recognize objects it was the time of 2004 when people came up with papers with the above technique. But the big game-changer period starts from 2012 when ImageNet Classification with Deep Convolutional Neural Networks made its mark in the field of CNN’s by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

lets us divide the whole detection process into two stages.

(1) Region Proposal (2) Object Detection and Classification.

  1. Region Proposal: The idea was to obtain the regions which might contain the objects and the idea was very simple as such as classifier or window of different size moves over to image and identifying the regions containing the objects and this technique is known as Sliding window technique. other techniques are Selective Search that we’ll see in a moment.
  2. Object Detection and Classification: after getting all the proposed regions that might contain the objects are feed to CNN networks(filters and pool layers ) which extracts all the necessary features. the dense layers which follow the previous features CNN layers and the further task of detection object (classification) and localizing object (regression) are divided with softmax for classification and linear function for regression task.

# Different Algorithms to achieve object detection:

(1) R- CNN
(2) Fast-RCNN
(3)Faster- RCNN
(4)Yolo

(1) R- CNN: To overcome the problem of Sliding windows where each of the windows needed lots of computation to detect regions. then authors of RCNN came up with the idea of Selective Search. they obtained nearly 2000 warped images of the region's proposal, here warping means scaled all the 2k images to a fixed square shape in order to make it compatible with feature extractor i.e AlexNet. and for classification, they used SVM classifiers.
It was computationally expensive and slow as well as in the region proposal there was no learning

Fig1.RCNN

(2)Fast-RCNN: The idea was very similar to the R-CNN algorithm. But, instead of feeding the regions proposals to the CNN, they fed the input image to the CNN to generate a convolutional feature map which was used for identifying the region of proposals and warp them into squares, and by using an RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer. and then softmax layer to classify the object of the proposed region. It was Faster than the previous RCNN as there was no need for 2k images to feed on CNN.

Fig2.Fast-RCNN

(3)Faster- RCNN: The Biggest change that was made on this implementation was no use of selective search algorithms for identifying the region proposals. This idea was the feeding the input image to CNN to obtain the feature map, and a separate Network was used to predict the region's proposal and then pooled by RoI to a fixed size and finally classifying the image within the proposed region.

Fig3. faster R-CNN
Figure training and test time comparison

(4) Yolo (You Only Look Once): Yolo is much faster than the above implementation. We had the problem of sliding window in which computation was very expensive instead of moving variable-sized window over the image was very expensive in terms of time and computation. to overcome this problem authors of Yolo came up with the idea of using the Convolution layer itself instead of making a dense layer that was used for the sliding window eventually removed the use of windows. in a nutshell, it divides the image into a fixed-sized grid. and takes each grid and performs classification and prediction for the object and its bounding boxes

Fig4.YOLO

--

--