How I got my first paper — Part — 2

Literature Review

Talha Hanif Butt
7 min readAug 21, 2020


I had previously written about how I started working towards my first paper in the article below describing the problem at hand and what we thought as a possible direction:

Today, I will focus on the Literature Review which I did after understanding the problem definition.

Base Paper: Localizing and Orienting Street Views using Overhead Imagery



Determine location and orientation of a ground level query image by matching to a reference database of overhead (satellite) images.


  1. Classification and Hybrid architectures are accurate but slow since they allow only partial feature precomputation.
  2. New loss function which significantly improves the accuracy of Siamese and Triplet embedding networks while maintaining their applicability to large scale retrieval tasks.
  3. New dataset of 1 million pairs from 11 cities in the US.


  1. View point difference b/w ground level and overhead imagery.
  2. Orientation of street view images is unknown.

Solutions Tried

  1. Training for rotation invariance.
  2. Sampling possible rotations at query time.
  3. Explicitly predicting relative rotation of ground and overhead images with deep networks.


  1. Explicit orientation supervision also improves location prediction accuracy.
  2. Resultant best performing architectures are roughly 2.5 times accurate then commonly used Siamese network baseline.
  3. Developed better loss functions using the novel distance based logistic (DBL) layer.
  4. Good representations can be learned by incorporating rotational invariance (RI) and orientation regression (OR) during training.
  5. Siamese like CNN for learning image features
    During training — Contrastive Loss is used.
    L(A, B, l) = l * D + (1-l)*max(0, m-D)
    This loss function encourages the two features to be similar if the images are a match and separates them otherwise.
  6. Triplet Network for learning image features (Ranking Network)
    L(A, B, C) = max(0, m+D(A, B)-D(A, C)), hinge loss for triplet.
    (A, B) is a match and (A, C) is not.
    This loss encourages distance of the more relevant pair to be smaller than the less relevant pair.
  7. Distance based logistic layer for pair of inputs.
Mathematical/Graphical Representation, here m = 10

8. For optimization, use logloss
L (A, B, l) = logloss(p(A, B), l)
L (A, B, l) = log(1 + exp(D(A, B)-D(A, l)))

Graphical Representation

Logloss takes in to account the uncertainty of the prediction based on how much it varies from the actual label.
This gives us a more nuanced view into the performance of our model.
For triplet network:
p(A, B, C) = 1/(1 + exp(D(A, B)-D(A, C)))
This represents the probability that it’s a valid triplet. B is more relevant to A than C is to A.
p(A, B, C) + p(A, C, B) = 1

Cross View Image Synthesis using Conditional GANs



Cross view image synthesis, aerial to street view and vice versa, using conditional generative adversarial networks (CGAN)

Proposed Architecture

Cross View Fork (X-Fork) and Cross View Sequential (X-Seq) are proposed to generate scenes with resolutions of 64 x 64 and 256 x 256 pixels.

X-Fork architecture has a single discriminator and a single generator.
The generator hallucinates both the image and it’s semantic segmentation in the target view.

X-Seq architecture utilizes two CGANs.
The first one generates the target image which is subsequently fed to the second CGAN for generating its corresponding semantic segmentation map.
The feedback from the second CGAN helps the first CGAN generate sharper images.

Challenges pertaining to cross view synthesis task

  1. Aerial images cover wider regions of the ground than the street view images, whereas street view images contain more details about objects (house, road, trees etc.) than aerial images.
  2. The information in aerial images is too noisy but also less informative for street view image synthesis.
  3. Transient objects like cars, people etc. are not present at the corresponding locations since they are taken at different times.
  4. Houses that are different in street view look similar from aerial view.
  5. Variation among roads in two views due to perspective and occlusions.
    Road edges are nearly linear and visible in street view, they are often occluded by dense vegetation and contorted in aerial view.
  6. When using model generated segmentation maps as ground truth to improve the quality of generated images, label noise and model errors may introduce some artifacts in the results.


Learning to generate segmentation map along with the image indeed improves the quality of generated image.

Basic GANs

Mathematical Representation

Conditional GANs

Mathematical Representation
  1. Conditional GANs synthesize images looking into some auxiliary variable which may be labels, text-embedding or images.
  2. In conditional GANs, both the discriminator and the generator networks receive the conditioning variable represented by c.
  3. x’ = G(z, c) is the generated image.

A survey on Visual-Based Localization: On the benefit of heterogeneous data


Visual Based Localization (VBL) consists of retrieving the pose (position + orientation) of a visual query material within a known space representation.

Cross View Image Generation using Geometry Guided Conditional GANs



Due to the difference in view points, there is a small overlapping region (field of view) and little common content b/w the two views.

Try to preserve the pixel information b/w the views so that the generated image is a realistic representation of cross view input image.

Propose to use homography as a guide to map the images b/w the views based on the common field of view to preserve details in the input image and then use GANs to inpaint the missing regions in the transformed image and add realism to it.

Evaluation & model comparison demonstrate that utilizing geometry constraints adds fine details to the generated images and can be a better approach for cross view image synthesis than purely pixel based synthesis methods.

View synthesis task is very challenging due to the presence of multiple objects in the scene, the network needs to learn the object relations and occlusions in the scene.

Generating top-view natural scenes from street view is very painstaking, this is mainly because there is very little overlap b/w the corresponding field of views.

The approach taken to solve the cross view image synthesis is to exploit the geometric relation b/w the views to guide the synthesis.

Addressing the problem of synthesizing ground level images from overhead imagery and vice versa using CGANs and also when possible, guiding the networks by feeding homography transformed images as inputs to improve the synthesized results.

First compute the homography matrix and then project the aerial images to street view perspective.
By this, an intermediate image very close to the target view image is obtained but not as realistic with missing regions.

Different CGANs that work specifically for inpainting and realism tasks to preserve the pixel information from the homography transformed image in a controlled manner have also been used.

Image-to-Image Translation with Conditional Adversarial Networks (Pix2Pix)


Investigate conditional adversarial networks as a general purpose solution to image-to-image translation problems.

These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping.

This work suggests we can achieve reasonable results without hand engineering our loss functions either.

If we ask the CNN to minimize the Euclidean distance b/w predicted & ground truth pixels, it will tend to produce blurry results, this is because Euclidean distance is minimized by averaging all plausible outputs which causes blurring.

GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss.
GANs learn a loss that adapts to the data, they can be applied to a multitude of tasks that traditionally would require very different kinds of loss functions.

CGANs are suitable for image-to-image translation tasks, where we condition on an input image and generate a corresponding output image.

Primary contribution is to demonstrate that on a wide variety of problems, conditional GANs produce reasonable results.

Second contribution is to present a simple framework sufficient to achieve good results, and to analyze the effects of several important architectural choices.

Conditional GANs learn a structured loss.
Structured losses penalize the joint configuration of the output.

GANs learn a mapping from random noise vector to output image y, G:Z->y
In contrast, conditional GANs learn a mapping from observed image x and random noise vector z to y, G: {x, z}->y

The discriminator’s job remains unchanged but the generator is tasked to not only fool the discriminator but also to be near the ground truth, using L1 as it encourages less blurring.

Mathematical Representation

That’s it for the literature review, I will soon write about the next step we took hopefully.



Talha Hanif Butt

PhD Student -- Signal and Systems Engineering, Halmstad University, Volvo Trucks

Recommended from Medium


See more recommendations