YOLO: Revolutionising Object Detection in Computer Vision

3 min readMay 29, 2023

In recent years, computer vision has witnessed remarkable advancements, particularly in the field of object detection. One breakthrough that has transformed the landscape of object detection algorithms is YOLO (You Only Look Once). YOLO offers a revolutionary approach, allowing real-time and accurate detection of objects in images and videos. In this blog post, we will delve into the workings of YOLO, its key features, and its impact on the world of computer vision.

YOLO explained

The YOLO algorithm works by dividing the image into N grids, each having an equal dimensional region of SxS. Each of these N grids is responsible for the detection and localization of the object it contains.

Correspondingly, these grids predict B bounding box coordinates relative to their cell coordinates, along with the object label and probability of the object being present in the cell.

This process greatly lowers the computation as both detection and recognition are handled by cells from the image, but —

It brings forth a lot of duplicate predictions due to multiple cells predicting the same object with different bounding box predictions.

YOLO makes use of Non Maximal Suppression to deal with this issue.

In Non Maximal Suppression, YOLO suppresses all bounding boxes that have lower probability scores.

YOLO achieves this by first looking at the probability scores associated with each decision and taking the largest one. Following this, it suppresses the bounding boxes having the largest Intersection over Union with the current high probability bounding box.

This step is repeated till the final bounding boxes are obtained

YOLO Architecture

YOLO models comparison

YOLO:

24 convolutional layers with 2 fully connected layers at the end, inspired by GoogleNet
predicts a single object per grid cell.
makes the built model simpler.

YOLO v2:

DarkNet-19 containing a total of 19 convolutional layers and 5 max-pooling layers.
YOLOv2 increases the mean Average Precision of the network by introducing batch normalization. increases the mAP value by as much as 2 percent.
allows the prediction of multiple bounding boxes from a single cell. This is achieved by making the network predict 5 bounding boxes for each cell.

YOLO9000:

detect more classes — 9000 classes
lower mean Average Precision as compared to YOLOv2

YOLOv3:

DarkNet-53 as the model backbone — a 106 layer neural network complete with residual blocks and upsampling networks.
YOLOv3’s architectural novelty allows it to predict at 3 different scales, with the feature maps being extracted at layers 82, 94, and 106 for these predictions.
With the architecture allowing the concatenation of the upsampled layer outputs with the features from previous layers, the fine-grained features that have been extracted are preserved thus making the detection of smaller objects easier.
YOLOv3 only predicts 3 bounding boxes per cell but it makes three predictions at different scales, totaling up to 9 anchor boxes.

YOLOv4:

addition of Weighted Residual Connections, Cross Mini Batch Normalization, Cross Stage Partial Connections, Self Adversarial Training, and Mish Activation

YOLOv5:

consists of a family of object detection models and detection methods based on the YOLO model pre-trained on the COCO dataset

YOLO: Revolutionising Object Detection in Computer Vision

YOLO explained

YOLO Architecture

YOLO models comparison

Written by Maria James