Research Proposal

An end-to-end model for Autonomous Driving

Talha Hanif Butt
6 min readNov 18, 2022

So, I got a PhD offer from Halmstad University, Sweden on AI for commercial vehicles and I have accepted it. By the way, it’s my third offer, I got 2 other offers at different time instances but I wasn’t able to start first time because I had not completed my Masters till then and, the second time, I got ill and wasn’t able to go as a result. This is my third time, hopefully, this time everything goes well.

This article is about my research proposal which I wrote for PhD applications which is as follows:

Summary of the proposal

I want to work on the intersection of Reinforcement Learning and Computer Vision for navigation, the idea is to combine Reinforcement Learning with Student-Teacher networks for an end-to-end framework for navigation similar to what AlphaZero [8] did by utilizing Student-Teacher Networks for sub-tasks including Localization, Detection, Segmentation etc. The ultimate objective is to train a student network to perform all tasks by learning from different teachers similar to what a human does while going to college and continue it’s learning process after graduation through mistakes using Reinforcement Learning just as an intelligent human does during his lifetime.


In March 2016, Deepmind’s AlphaGo [7] beat world champion Go player Lee Sedol 4–1. In October 2017, AlphaGo Zero [8] had defeated AlphaGo 100–0.
In December 2017, DeepMind released another paper [8] showing how AlphaGo Zero could be adapted to beat the worldchampion programs StockFish and Elmo at chess and shogi. An algorithm for getting good at something without any prior knowledge of human expert strategy
was born

The idea of teacher-student learning emerged from the thought of compressing the knowledge of an ensemble of DNNs into a single DNN in Model Compression [2]. Instead of performing time-consuming labeling, the authors of [5] proposed to use a trained network (teacher network) and use its soft output on unlabeled data as targets for a small-size network (student network). Pioneer work by [4] showed that the additional information incapsulated in soft outputs of a teacher DNN helps during training of a student DNN. [9] proposed to make use of enhanced features in the student-teacher learning paradigm. The enhanced features are used as input to a teacher network to obtain soft targets, while a student network tries to mimic the teacher network’s outputs using the original noisy features as input, so that speech enhancement is implicitly performed within the student network. In [6], a stage-wise teacher-student learning is proposed, where in a first step an intermediate feature representation of the teacher net- work is learned by the student network before training with the actual soft outputs from the teacher network.

Building on these ideas, I propose to design an end-to-end framework consisting of student-teacher architecture based on reward functions for autonomous driving.

Research Questions

Is it possible to develop a student or students such that all driving tasks are performed efficiently?
How to get a teacher or teachers to train such a student or students?
How to come up with a reward function for such a system, would it suffice to aggregate multiple reward functions?
Is such a system really feasible keeping in view the resources required to train such a system?
Is it possible to get real time performance from such a system?

Proposed Approach

Autonomous navigation has harsh requirements of small model size and energy efficiency, in order to enable the embedded system to achieve real-time on-board object detection. Low precision neural networks are popular techniques for reducing the computation requirements and memory footprint. Among them, binary weight neural networks (BWNs) are the extreme case which quantizes the float-point into just 1 bit. BWNs are difficult to train and suffer from accuracy deprecation due to the extreme low-bit representation. To address this problem, [10] propose a knowledge transfer (KT) method to aid the training of BWN using a full-precision teacher network.

End-to-end learning from sensory data has shown promising results in autonomous navigation. While employing many sensors enhances world perception and should lead to more robust and reliable behavior of autonomous vehicles, it is challenging to train and deploy such network and at least two problems are encountered in the considered setting. The first one is the increase of computational complexity with the number of sensing devices. The other is the phenomena of network overfitting to the simplest and most informative input. [3] address both challenges with a novel, carefully tailored multi-modal experts network architecture and propose a multi-stage training procedure. The network contains a gating mechanism, which selects the most relevant input at each inference time step using a mixed discrete-continuous policy.

The trend towards autonomous systems in today’s technology comes with the need for environment perception. Deep neural networks (DNNs) constantly showed state-of-the-art performance over the last few years in visual machine perception, e.g., semantic segmentation. While DNNs work fine on uncorrupted data, recently introduced adversarial examples (AEs) led to misclassification with high confidence. This lack of robustness against such adversarial attacks questions the use of DNNs in safety-critical autonomous systems, e.g., autonomous robots. [1] address the mentioned problem with the use of a redundant teacher-student framework, consisting of a static teacher network (T), a static student network (S), and a constantly adapting student network (A). By using this triplet in combination with a novel inverse feature matching (IFM) loss, they show that a significant robustness increase of student DNNs against adversarial attacks is achieveable, while maintaining semantic segmentation quality at a reasonably high level.


During my PhD, I want to work on revolutionary ideas with an impact similar to Mastering the game of Go without human knowledge, Distinctive Image Features from Scale-Invariant Keypoints, A New Approach to Linear Filtering and Prediction Problems, Histograms of Oriented Gradients for Human Detection, Imagenet classification with deep convolutional neural networks etc. Specifically, as far as autonomous driving is concerned, I believe that an end-to-end solution similar to AlphaZero is the solution moving forward with sub-tasks including Driver monitoring, Localization, Detection, Segmentation etc. and a possible direction could be to combine Reinforcement Learning with Student-Teacher networks which I would love to explore in further detail during my PhD.


[1] Andreas Bar, Fabian Huger, Peter Schlicht, and Tim Fingscheidt. On the robustness of redundant teacher-student frameworks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[2] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
[3] Shihong Fang and Anna Choromanska. Multi-modal experts network for autonomous driving. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 6439–6445. IEEE, 2020.
[4] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[5] Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. Learning small-size dnn with output-distribution-based criteria. In Fifteenth annual conference of the international speech communication association, 2014.
[6] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
[7] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
[8] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
[9] Shinji Watanabe, Takaaki Hori, Jonathan Le Roux, and John R Hershey. Student-teacher network learning with enhanced features. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5275–5279. IEEE, 2017.
[10] Jiaolong Xu, Yiming Nie, Peng Wang, and Antonio M López. Training a binary weight object detector by knowledge transfer for autonomous driving. In 2019 International Conference on Robotics and Automation (ICRA), pages 2379–2384. IEEE, 2019.

That’s it for now. See you later.