Research Proposal 2

Unmanned Aerial Vehicle (UAV) Visual Perception for Safe Landing

Talha Hanif Butt
9 min readDec 5, 2022

I had written 2 research proposals for PhD applications. I have previously written about one of them here:

Today, I will write about my second research proposal which is as follows:


Drones can increase the efficiency of delivering goods as they don’t have to cater traffic congestion and delivery personnel shortage. Companies like Amazon, Uber, UPS and FedEx are advancing drone delivery projects for the development of smart cities. Scene complexity, insufficient safety guarantee performance, excessive weight of accessories, etc. are some of the reasons behind slow development speed. The state-of-the-art performance of visual analysis algorithms is still not good enough for real applications, mainly due to the challenges of visual perception. Visual perception for UAVs is more challenging than Autonomous Driving due to: (1) Much more dynamic, complex, and flexible scenes caused by the 3D motion of drones; (2) Because to the instability of network signal, on-board processing is safer than remote processing. The computing capacity for common drones is far less than an unmanned vehicle because of the limitation in size, weight capacity etc. which imposes a need for efficient visual perception algorithms.

As a long term project, the aim is to study perception for UAVs investigating crucial problems including SLAM, scene reconstruction, scene recognition, semantic segmentation, autopilot, etc. towards the applications of UAVs in smart cities. The main focus of this thesis is safe landing of UAV in urban areas [2, 9] where visual perception is more challenging due to the high complexity of the scene, including static objects of various shapes (e.g., buildings, steady trees, etc.), moving objects (e.g., cars, pedestrians, animals, etc.), limited space for flying, etc. The goal is to study Depth map estimation and Camera motion estimation and boost their states of the art.

Related Work

There has been little research on the problem of drone safe landing through the use of on-board cameras. [2] addressed the problem of human crowd detection from drones. They have proposed a very light-weight FCN classification model to distinguish between crowded and non-crowded
scenes. VisDrone dataset [26] has been used for training. The proposed method is based on two loss functions, one for classification of whether the image contains crowd or not, and the other for predicting people count which helps in the classification task. The method also provides heatmaps
that can be used to semantically enrich the flight maps. However, they also highlight an open issue which is the lack of a well-defined concept of crowdedness. They also suggest this as the cause of poor classification performance when the people count is around 10. They suggest to investigate the concept of crowdedness based on the spatial density of the crowd. [8] compared different strategies for crowd detection which confirmed that the use of density maps is to be preferred over detect-then-count techniques as the crowd density increases. It was also observed that these density maps could be used to identify safe landing regions in crowded scenes. Density maps provide count along with spatial information from crowded scenes. The requirements for a lightweight architecture are not formally defined in the literature so [9] consider them as any architecture that has less than 1 million parameters. They have used the UCF-QNRF dataset [13] for training while tested their approach on ShangaiTech dataset [24] and Venice dataset [18] without specifically training for the later two. [3] introduced Multi-column CNN (MCNN) for image classification. [24] propose a three column CNN for crowd counting in an arbitrary still image. Input of the MCNN is the image, and its output is a crowd density map whose integral gives the overall crowd count. They suggest that density map of the crowd (say how many people per square meter) is better than head count as it preserves more information, it gives the spatial distribution of the crowd in the given image, the learned filters are more adapted to heads of different sizes, hence more suitable for arbitrary inputs whose perspective effect varies significantly. [20] focused on designing a sparse network structure to reduce the number of parameters by using three stacked filters of different size and using a merged feature map at once. [9] developed a new lightweight architecture for density map generation based on CCNN (Compact Convolutional Neural Network) trained using the Bayes Loss while pruning it to reduce the number of parameters. The density maps obtained are further utilized to propose emergency landing regions using the Polylabel algorithm. [19] propose Bayesian loss which constructs a density contribution probability model from the point annotations. The expected count at each annotated point is calculated by summing the product of the contribution probability and estimated density at each pixel, which can be reliably supervised by the ground-truth count value (apparently, one). They suggest incorporating knowledge like specific foreground or background priors, scale and temporal likelihoods, and other facts to further improve the proposed method. The goal of pruning is to increment the sparsity of a CNN reducing the number of parameters [16]. An idea that surged in [16] is that the closer to zero the norm of a channel is, the less relevant to the final inference it is. But [9] suggests that it’s not always the case which requires each channel to be tried separately and in groups keeping in mind that increasing sparsity especially in lightweight models could significantly decrease the accuracy even to the point of making the architecture useless.

Depth Estimation is essential for understanding the 3D structure of scenes from 2D images. [5] discretize depth and recast depth network learning as an ordinal regression problem. The discretization is performed using a spacing-increasing discretization (SID) strategy as the uncertainty in depth prediction increases along with the underlying ground-truth depth thus allowing a relatively larger error while predicting larger depth to avoid the influence of large depth values on the training process. The main hurdle is the rapid decrease in the spatial resolution of feature maps due to repeated pooling operations in deep feature extractors [21, 11, 4, 7, 14, 15, 23]. Ordinal Regression aims to learn a rule to predict labels from an ordinal scale [5]. The goal is to obtain high-resolution depth map, for which previous networks require incorporating multi-scale features as well as full-image features in a complex architecture, which complicates network training and largely increases the computational cost and training a regression network for depth estimation suffers from slow convergence and unsatisfactory local solutions. [5] introduce a network utilizing a dilated convolution technique and an image encoder for a high resolution depth map. To improve the training of the network, an ordinal regression training loss and a depth discretization strategy were integrated.

Motion estimation is a fundamental part of mobile robotic systems. [22] propose a CNN combined with a recurrent neural network (RNN) to detect keypoints as well as to generate corresponding descriptors. The focus is frame to frame motion estimation using information from a vision sensor. They suggest using Deep Compression [10] and SqueezeNet [12] to reduce network depth and storage consumption. CNN can also be used for producing representative binary features [17]. A shallow backbone can also be used at the cost of some accuracy loss.

Domain mapping or image-to-image translation targets at translating an image from one domain to another. [6] enforce geometry consistency for unsupervised domain mapping and compare it with DistanceGAN [1] and CycleGAN [25].

The aim is to study depth map estimation and camera motion estimation for safe and precise landing.


The goal of the study can be achieved by the following methodology:
1. Jointly estimating depth map and the camera motion together in an efficient manner. The plan is to explore the temporal relations between adjacent frames to infer the depth map and camera motion jointly in a Bayesian framework by modeling occlusion, moving objects, and illumination in a unified framework.
2. To allow fast and better generalization of the learned model to new scenarios, domain adaption methods will be investigated.
3. Development of computational and memory-efficient light-weight networks for onboard inference of depth and camera motion by exploring network binarization and compression techniques.
We will exploit knowledge distillation to improve the estimation accuracy of light-weight networks by transferring knowledge from large networks.


The results will provide insights to address UAV visual perception problems and will also lay a foundation to work on more UAV related projects, toward the development of smart cities.


[1] Sagie Benaim and Lior Wolf. One-sided unsupervised domain mapping. arXiv preprint arXiv:1706.00826, 2017.
[2] Giovanna Castellano, Ciro Castiello, Corrado Mencar, and Gennaro Vessio. Crowd detection for drone safe landing through fully-convolutional neural networks. In International conference on current trends in theory and practice of informatics, pages 301–312. Springer, 2020.
[3] Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition, pages 3642–3649. IEEE, 2012.
[4] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, pages 2650–2658, 2015.
[5] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018.
[6] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, Kun Zhang, and Dacheng Tao. Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2427–2436, 2019.
[7] Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision, pages 740–756. Springer, 2016.
[8] Javier Gonzalez-Trejo and Diego Mercado-Ravell. Dense crowds detection and surveillance with drones using density maps. In 2020 International Conference on Unmanned Aircraft Systems (ICUAS), pages 1460–1467. IEEE, 2020.
[9] Javier Antonio Gonzalez-Trejo and Diego A Mercado-Ravell. Lightweight density map architecture for uavs safe landing in crowded areas. Journal of Intelligent & Robotic Systems, 102(1):1–15, 2021.
[10] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[12] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
[13] Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak Shah. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–546, 2018.
[14] Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6647–6655, 2017.
[15] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pages 239–248. IEEE, 2016.
[16] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
[17] Kevin Lin, Jiwen Lu, Chu-Song Chen, and Jie Zhou. Learning compact binary descriptors with unsupervised deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1183–1192, 2016.
[18] Weizhe Liu, Krzysztof Lis, Mathieu Salzmann, and Pascal Fua. Geometric and physical constraints for drone-based head plane crowd density estimation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 244–249. IEEE, 2019.
[19] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6142–6151, 2019.
[20] Xiaowen Shi, Xin Li, Caili Wu, Shuchen Kong, Jing Yang, and Liang He. A real-time deep network for crowd counting. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2328–2332. IEEE, 2020.
[21] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[22] Jiexiong Tang, John Folkesson, and Patric Jensfelt. Geometric correspondence network for camera motion estimation. IEEE Robotics and Automation Letters, 3(2):1010–1017, 2018.
[23] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In European conference on computer vision, pages 842–857. Springer, 2016.
[24] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016.
[25] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
[26] Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu. Vision meets drones: A challenge. arXiv preprint arXiv:1804.07437, 2018.

That’s it for now. See you later.