Advance Computer Vision

10 min readSep 17, 2020

Object Detection
● Object detection is one of the traditional computer vision problems.
● Object detection is a computer vision technique used to locate object instances within pictures or videos.

● Object detection basically contain two tasks:

○ Image classification:

■ Image classification includes the process of assigning a class label to the image.

○ Object localization:

■ Object localization includes the process of drawing a bounding box around one or more objects in an image.

○ Object Segmentation in which instances of known objects are indicated by pointing out the object ‘s specific pixels instead of a coarse bounding box.

● There are many techniques for object detection

○ YOLO v2

○ R-CNN

○ Fast R-CNN

○ Faster R-CNN

○ Aggregate Channel Features (ACF)

○ SVM Classification with HOG Features

○ Viola-jones Algorithm

○ Gaussian Mixture Models(GMM)

○ Blob Analysis

○ Template Matching

○ Feature Extraction and Matching

● There are three main categories.

○ Object detection using deep learning

■ Deep learning based algorithm use convolutional neural networks

■ Ex:R-CNN,Fast-R-CNN,YOLO v2

○ Object detection using machine learning

■ Machine learning based algorithm uses feature extraction before training a new model

■ Ex.:Aggregate Channel Features,Viola-jones Algorithm

○ Classical Computer Vision

■ Classical computer vision methods may be sufficient depending on the application

■Ex:Template matching,Image segmentation and blob analysis

● Application Of object detection

○ There are many applications of object detection.Some examples are shown below.

○Optical Character Recognition

○ Self Driving cars

○ Tracking Objects

○ Face detection and face recognition

○ Identity verification through IRIS code

○ Object recognition as image search

YOLO,R-CNN,SSD

YOLO

● YOLO stands for you only look once.YOLO algorithm is a fresh way to detect objects.

● YOLO is a clever convolutional neural network(CNN) for doing object detection in real time.

● The YOLO detection system

There are three steps :

○ First step is, Resizes the input image to 448 * 448

○ Second,Runs a single convolutional network on image

○ Third,Thresholds the resulting detections by the model’s confidence.

● In this approach we reframe object detection from image pixels to bounding box coordinates and class probabilities for those boxes as a single regression problem.

● Using this method YOLO(You just look once) at an image to determine where and what objects are present.

● In this method we use a unified model.A single convolutional neural network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.YOLO trains on full images and directly optimizes detection performance.This unified model has following advantages over traditional methods of object detection.

○ YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline.We simply run our neural network on a new image at test time to predict detections.Our base network runs at 45 frames / second with no batch processing on a Titan X GPU and a fast version runs at over 150 fps. That means we can process streaming video in real time with less than 25 milliseconds of latency., we can process streaming video in real time.

○ We can achieve more than twice the mean average precision of other real-time systems by YOLO.

○ YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance ,unlike sliding window and region proposal-based techniques,.Fast R-CNN, which is a top detection method , mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors if we compare it with Fast R-CNN.

Architecture of YOLO

● We implement this model as a convolutional neural network.The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.. Our network has 24 convolutional layers followed by 2 fully connected layersHere.we

simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers.

Working of YOLO algorithm

○ Our system divides the input image into a S X S grid. If the object’s center falls into a grid cell,then the grid cell is responsible for detecting the object.

○ Each grid cell predicts B bounding boxes and confidence scores for those boxes.These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it to predict.Formally we can define confidence as

○ If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

○ Each boundary box has five predictions: x, y, w, h, and confidence. The coordinates (x,y)represent the center of the box relative to the grid cell boundaries. The width and height are predicted with respect to the entire image . Finally, the confidence prediction represents the IOU between the predicted box and any ground truth box.

○ Each grid cell predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object. We predict only one set of class probabilities per grid cell, no matter the number of boxes B. During the test time we multiply the probabilities of the conditional class and the confidence predictions of the individual boxes,

○ which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

○ Here are three main approach variations, at the time of writing; YOLOv1, YOLOv2 and YOLOv3. The first version proposed general architecture, while the second version improved the concept and used predefined anchor boxes to enhance the proposal for bounding boxes, and version three further improved the architecture and training method for the model.

R-CNN

● R-CNN stands for Region with Convolutional Neural Network.

● Region proposal algorithms

● Such methods take an image as the input and output bounding boxes corresponding to all the patches in an image that are more likely to be objects .These region proposals can be noisy, overlapping and may not contain the object perfectly but amongst these region proposals, there will be a proposal which will be very close to the actual object in the picture.We can then classify such proposals using the object recognition model. The region proposals with the high probability scores are locations of the object.

● Selective Search

○ A selective search approach is used to find region proposals for classification in R-CNN.

○ Selective search can capture any possible scales and less computational complexity.2000 regions are selected from selective search.

● The main idea behind the R-CNN algorithm is composed of two steps.

○ First,with the help of selective search,it identifies a manageable number of bounding-box object region candidates .

○ This then extracts CNN features independently from each region for classification

SSD

● SSD stands for single shot detector.

● Faster R-CNN uses a region proposal network to create boundary boxes and uses those boxes to classify objects. While it is considered the start-of-the-art in accuracy, the entire process runs at 7 frames per second. Far below what a real-time processing needs.

● SSD speeds up the process by eliminating the need of the region proposal network. To recover the drop in accuracy, SSD applies a few improvements including multi-scale features and default boxes. These improvements allow SSD to match the Faster R-CNN’s accuracy using lower resolution images, which further pushes the speed higher.

● SSD object detection composes into two parts.

○ Extract feature maps, and

○ Apply convolution filters to detect objects.

Semantic Segmentation

● We can think of semantic segmentation as image classification at a pixel level.

● Semantic segmentation is the process of linking each pixel in an image to a class label.

● These labels could be person,cars,dog,flower etc.

● Specifically the goal of semantic segmentation is to label each pixel of an image with a corresponding class of what is being represented.

● In simple segmentation a single label is assigned to the whole picture.semantic segmentation treats multiple objects of the same class as a single entity.

● However, a separate class of models known as instance segmentation is able to label the separate instances where an object appears in an image. This kind of segmentation can be very useful in applications used to count the number of objects, such as the amount of foot traffic in a mall. Segmentation of instances considers multiple objects of the same class as distinct individual objects (or instances).. Typically, instance segmentation is harder than semantic segmentation.

● Various semantic segmentation techniques :

○ Thresholding method

■ This method is based on histogram peaks of the image to find particular threshold values.

○ Edge based method

■ It is Based discontinuity detection.

○ Region based method

■ It is based on partitioning images into homogeneous regions.

○ Clustering method

■ It is based on division into homogeneous cluster

○ Watershed method

■ It is based on topological interpretation

○ PDE based method

■ It is based on the working of differential equations

○ ANN Based method

■ It is based on the simulation of learning process of decision making

● There are some approaches to building semantic segmentation models:

○ Weakly and semi supervised learning of a deep convolutional network for semantic image segmentation

○ Fully convolutional network for semantic segmentation

○ U-net:convolutional networks for biomedical image segmentation

○ Fully convolutional densenets for semantic segmentation

○ Multi scale context aggregation by dilated convolutions

○ Semantic image segmentation with deep convolutional nets, atrous convolution in fully connected CRFs

U-Net

● The U-Net architecture was developed by Olaf Ronneberger et al. for Bio Medical Image segmentation.

● The architecture of u-net contains two paths.

○ First pathos the contraction path(encoder) which is used to capture the context in the image

○ The encoder is a traditional stack of convolutional and max pooling layers.

○ The second path is the symmetric expanding path (decoder) which is used to enable precise localization using o enable precise localization using transposed convolutions.

○ Thus it is an end-to-and fully convolutional network(FCN),i.e. It only contains convolutional layers and does not contain any dense layer because of which it can accept image of any size.

○ U-net architecture:

○ Detailed architecture:

Face Recognition using Siamese networks

● In face recognition we want to be able to identify the identity of an individual by simply feeding one picture of the face of that individual to the system. And if it does not recognize the image, it means that the image of this person is not stored in the database of the system.

● For this we cannot use convolutional neural networks because it cannot work on smaller training sets and it is not convenient to retrain the model every time we add a picture of a new person to the system.

● However we can use siamese networks for face recognition.

● The Siamese network’s goal is to figure out how similar two items are compatible (e.g., signature authentication, facial recognition, etc.) This network has two identical subnetworks, both of which have the same parameters and weights.

● As shown in figure the first sub network’s input is an image, followed by a sequence of convolutional, pooling, fully connected layers and finally a feature vector (We are not going to use a softmax function for classification). The last vector f(x1) is the encoding of the input x1. Then, we do the same thing for the image x2, by feeding it to the second sub network which is totally identical to the first one to get a different encoding f(x2) of the input x2.

● To compare the two images x1 and x2, we compute the distance d between their encoding f(x1) and f(x2). If it is less than a threshold (a hyperparameter), it means that the two pictures are the same person, if not, they are two different persons.

● Working of Siamese network

If you liked the story and want to appreciate us you can clap as much as you can. Appreciate our work by your constructive comment and also you can connect to us on….

Youtube: https://www.youtube.com/channel/SocietyOFAI

LinkedIn : https://www.linkedin.com/company/society-of-ai

Facebook: https://www.facebook.com/societyofai/

Website : https://www.societyofai.in/

Advance Computer Vision

Written by Society of AI