ACF Based Region Proposal Extraction for YOLOv3 Network Towards High-Performance Cyclist Detection in High Resolution Images

Source: https://www.mdpi.com/1424-8220/19/12/2671/pdf

Why?

Before starting my MS, I had fortunately seen one of my seniors going through the application process for PhD admissions. He was among the top 3 of his MS class along with some publications but when I asked him about whether he was thinking to apply at Top institutes like MIT, Harvard, Stanford etc., to my surprise, he told me that he had no plans to do so. Asking about the reason, he told me that the papers which he had published were not in the top tier conferences so he had little chance to be accepted at a top ranked university and on that day, I realized the importance of “where” a paper gets published.

My major reason for starting MS was to improve my “GPA” as I had a 2.6/4 in my BS because of which I was unable to get admission anywhere out of 9 universities where I had applied in Germany in 2017 and what a feeling that was, “humiliation at its peak, feeling useless and thinking about how I wasted my 4 years and obviously, realizing the importance of GPA”. Luckily, I randomly applied to a university in UK to see if I had a chance as I didn’t knew that it was easier to get admission in UK and I got it too but decided not to go and aim for a better PhD by improving my profile through publications and an improved GPA in MS and with these goals in mind, I started my MS journey.

After my first semester, I decided to search for potential topic of my Thesis as I thought that I had a better chance to make a reasonable contribution in my research area if I could start early for which I started discussions with different Professors to see what I could do with them. After meeting some of them, I started working with one but found that my goal couldn’t be achieved doing that so I decided to send mails to Professors at other universities to explore if something was possible and surprisingly I got response from one such university Professor and I went to meet him with the aim of initially working for 3 months and depending on results obtained during the period, decide whether to continue for Thesis or not but I got an offer of MS Thesis from him on which I asked for a week to think about and decided to commit for my Thesis after that. I had a couple of options out of which I chose to work with a PhD student working on Cross View Image Retrieval.

After about 3 months we got a paper published at ICONIP-2019 but during the time, I realized that there are certain types of contributions possible in a paper:

Idea, Code, Write-up

I further noticed that I knew how to code but knew nothing about idea generation and writing so I decided to invest my time in these domains for which I decided to find an idea and work on it individually so as to go through each and every step by myself and learn through this journey and for this, I was ready to extend my MS to 3 years, if I had to, which eventually I did.

What?

I asked my supervisor (not technically yet) whether it was possible to start searching for a new topic and work on it alone and after some discussion, he agreed and I started thinking about what to do next. I was going through my previous work when I thought that I had done some work in the field of Autonomous Driving which was as follows:

I had worked on Autonomous Driving in Car Simulations where the goal was to predict steering angle:

Autonomous Driving Demo

I had also worked on an indoor self driving car in which the goal was to be able to drive autonomously inside the university campus.

With these projects, I had developed a deep interest in Autonomous Driving and now I had the chance to do research on any topic of my choice so I decided to work further in this domain for which I started searching for potential topics in the field and came up with the following:

Part of my slide on Topics in Autonomous Driving

At this point, I had decided to work on Cyclist Detection so I started reading about it in detail and found the following paper about a recent dataset for Cyclist and Pedestrian detection (Tsinghua-Daimler):

After the dataset, it was about finding state of the art. When I started, state of the art was (to the best of my knowledge):

After a couple of iterations, I understood it as follows:

YOLO network cannot achieve high precision when dealing with small size object detection in high resolution images.

Hypothesis

If the potential regions can be extracted first, high resolution images can be cropped into some regions of interest (ROI), then YOLO or SSD based methods can be used on these small regions to achieve better performance.

To cater the above mentioned problem, the paper proposed:

An effective region proposal extraction method for YOLO network to constitute an entire detection structure named ACF-PR-YOLO.

ACF-PR-YOLO structure includes three main parts:

Region Proposal extraction method based on aggregated channel features (ACF) called ACF-PR.

In ACF-PR, ACF is firstly utilized to fast extract candidates and then a bounding box merging and extending method is designed to merge the bounding boxes into correct region proposals for the following YOLOnet.

YOLOnet for fine detection in the region proposals generated by ACF-PR.

Post processing step in which the results of YOLOnet are mapped into the original image giving the detection and localization results.

The most important part is Region Proposal and needs to be discussed in some detail.

ACF-PR Region Proposal Generation Method

Generate large potential regions containing objects for the following deep network.

Source: https://www.mdpi.com/1424-8220/19/12/2671/pdf

Given an input image, the ACF computes several channels, sums every block of pixels, smooths the resulting lower resolution channels and uses boosting to distinguish objects.

ACF builds a fast feature pyramid P = {p1, p2, p3, …pn}, n representing the number of layers. Number of channels used are 10 including normalized gradient magnitude (1 channel), histogram of oriented gradients (6 channels), and LUV color channels (3 channels).

Some extras about ACF

Source: https://vision.cornell.edu/se3/wp-content/uploads/2014/09/DollarPAMI14pyramids_0.pdf

ACF detector is a fast and effective sliding window detector (30 fps on a single core). It is an evolution of the Viola and Jones (VJ) detector but with an ~100 fold decrease in false positives (at the same detection rate).

ACF is best suited for quasi-rigid object detection (e.g faces, pedestrians, cars etc.).

The ACF paper is:

A bit about Merger and Extension of Bounding Boxes

One example of the process of merging bounding boxes. Source: https://www.mdpi.com/1424-8220/19/12/2671

In this process, all bounding boxes are divided into two cases according to the distances between the bounding boxes. In one case, two bounding boxes are partially overlapped or the distance is short. In the other case, two bounding boxes are far away from each other.

In one case, each detected cyclist instance is marked with several different bounding boxes. In order to merge bounding boxes into a correct one and get the entire cyclist instance, two small boxes are merged into one when the distance between them is within a certain range.

The minimum value of the x, y coordinate on two boxes are represented using:

Source: https://www.mdpi.com/1424-8220/19/12/2671/pdf

In the other case, two bounding boxes are far apart from each other, which means that these boxes are for different instances and do not need to merge. In this case, the bounding box may contain the entire object instance, and sometimes also may contain part of the object instance or just background. For fine detection and localization, these bounding boxes also need to be sent into the following deep network for further detection. If the distance between two bounding boxes is not within a certain range, these boxes are regarded as two separate objects. In order to contain as many entire object instances as possible, these bounding boxes are extended as potential regions and as inputs for the following network.

Bounding boxes are all extended to m × m pixels to be served as potential regions, which ensures that the potential regions contain the whole objects. In this paper, m is set to 832 that is the maximum size of the cyclist instances.

The relationship between the potential region and the bounding box is,

Source: https://www.mdpi.com/1424-8220/19/12/2671/pdf

YOLO Network for Cyclist Detection

Shortcut connections have similar construction with ResNet. The route layers are to combine two feature maps or get the feature map of a previous layer. The function of the up-sample layer is to up-sample the feature map with a stride of 2 via bilinear interpolation. In addition, batch normalization layer is utilized to make improvements in convergence. It is not listed in the structure diagram below, because each convolutional layer is followed by a batch normalization layer.

Network Structure. Source: https://www.mdpi.com/1424-8220/19/12/2671/pdf

Post Processing

The detection results of YOLOv3 are based on potential regions, and need to be mapped into the original image. The coordinates of bounding boxes from YOLOv3 are based on potential regions and potential regions are gotten from original images. To get final detection results, the bounding boxes should be mapped from potential regions into original images.

The relationship between coordinates of bounding boxes in potential regions and final coordinates in the original images is,

Source: https://www.mdpi.com/1424-8220/19/12/2671/pdf

How?

The problem was that it’s code was not available and my supervisor thought it would be difficult to first implement a paper and then improving it but he agreed to let me code it and asked me about how much time I required to do so and my reply was:

It took me 3 months to implement Alpha Zero.

I had implemented Alpha Zero on Tic Tac Toe in 2018:

He asked me to provide a detailed plan of how I was going to achieve this after which I came up with the following:

A slide of my proposed plan

After planning, it was about implementation, starting off with ACF using:

After training ACF on Inria, I got the following result:

Test on Inria Dataset

Now, the task was to do the same for Cyclist Detection on Tsinghua-Daimler Dataset, what I used and what I got are as follows:

Train ACF
Modify Detector
ACF on Tsinghua-Daimler Dataset
Test on Tsinghua-Daimler Dataset

After testing ACF on Tsinghua-Daimler Dataset, the bounding boxes were to be merged and extended to form potential regions for which what I used and what I got are as follows:

merge and extend bounding boxes
Results using merge and extend

Before training YOLO, ACF patches needed to be saved along with labels in a specific format required for YOLO for which I used:

Detection using ACF
Crop Patches
Save Patches after ACF execution
Complete Pipeline using ACF to Prepare Data for YOLO

Now, it was about training YOLO on patches provided by the previous step using:

The cfg file which I used:

cfg file used for training Tiny-YOLOv3

After training, to get predictions from YOLO, I used:

Test YOLO

The results which I got were as follows:

Results after training Tiny-YOLOv3

The complete process can be summarized as follows:

Complete Process

Code

Code can be accessed using this repository.

Demo

Cyclist Detection Demo

A great Failure

After the implementation, I started working on improving the performance for which I did some experiments which all failed out of which one was:

What’s Next?

Currently, I am working on another idea and hoping that something will come out of it. Let’s see.

References

--

--