Machine learning applied to aerial images – the results from my master’s thesis

The rapid development of deep learning has given rise to new and improved ways of extracting information from images. Still, there is not much work done specifically for the task of analyzing remotely sensed imagery using these new techniques.

As a geoinformatics student choosing the subject for my master’s thesis, this seemed to be a very interesting field to look further into. Therefore, I decided to figure out how new machine learning techniques could perform the task of extracting building information from remotely sensed imagery in my master thesis. This post will go through the basic concepts of deep learning and the convolutional neural networks and then talk about the results from my thesis.

Machine Learning

Recent years we have heard a lot of talk about machine learning, and the development is extremely fast and impressive.  The idea behind these methods is to use the computers for what they do best – to perform a lot of computations fast. The algorithms allow the computer to learn things you have not explicitly programmed, by detecting patterns in large amounts of data. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.

Neural networks are used for several types of machine learning. They consist of an input layer, an output layer, and depending on the model, a number of hidden layers in-between.


Neural Networks are modelled as collections of nodes which are connected in a directed acyclic graph.

Each connection between nodes has a weight, W, that represents how “important” the particular connection is. Each node takes the weighted sum of the input and processes it through an activation function. The output of the activation function gives the output of the node.

Convolutional neural networks – CNN

CNNs are similar to ordinary neural networks, as they are made up of neurons that have learnable weights and biases. However, CNNs make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. This is a key point for why CNNs are fast despite its depth – it allows for a vast reduction in the amount of parameters in the network.

Convolution layer

The core building block of the CNNs is the convolution layer. The layer takes an image and a filter as an input.

Images are represented as matrices with pixel values, often three dimensional with different channels for different colours (e.g. RGB-image with red, green and blue channels). A filter is a small matrix that has the capability of recognizing specific features in an image. What features it recognizes is dependent on the values in the filter. A typical feature is a line, curve or a colour blob.

The filter is convolved across the input image, meaning you slide it across producing dot products. The results are saved into a new matrix called in an activation map. The activation map becomes a new representation of the image, containing information about where in the image the feature exists.


Instead of performing this process for just one filter, you repeat it with different filters – all producing different activation maps.


But you don’t stop there – you apply new filters to the activation maps, allowing the network to recognize not just the simple features (e.g. lines, curves, colour blobs) but a combination of them. An example of the features recognized in a network, and how they become more and more complex the further into the network you get, can be seen in the image below.

The further into the network, the more complex features the filters can recognize.

Pooling layer

Another important layer is pooling layers, which are used to lower the number of parameters in the network by downscaling the image. The most used pooling layer is max pooling, which extracts the highest pixel value for subparts of the image and thereby downscaling it.



Since the network downscales the images through pooling layers, you need a way to upscale it again to output the desired output. This is best achieved by using a transposed convolution process, where the image is padded with borders of zeros, and the network learns filters that upscales the image to the desired size in the best way.


Training the model

What does it actually mean to train a CNN? As explained, the networks convolution layers containing filters that can recognize features. The goal is that these filters can recognize the features that exist in the objects you want to recognize. You, therefore, feed the network a lot of example images of what you want it to learn, and the network adjusts the values in the filters. By testing the different values, the network eventually figures out what filters can be used to only activate (get high values in the activation map) on the desired objects. If you, for example, train a network to recognize buildings, the filters should activate only when it is given an image of a building.


Instead of only determining what an image depicts, you can also determine where in the image different objects are. The is achieved by performing a segmentation. A segmentation is a per-pixel classification, meaning that each pixel is assigned a class value. In this case, the classes are «building» and «not building».


When performing a segmentation you extract small patches of the image and run each patch through the CNN. Based on the classification given to the patch, the centre pixel gets assigned a class. This continues until each pixel in the image is assigned a class.

Source: Li, F.-f., Karpathy, A., and Johnson, J. (2016). Lecture 13 : Segmentation and Attention.

The model’s architecture

The implemented architecture is based on the SegNet architecture and can be seen in the figure below.


Constructed dataset

The constructed dataset for this project was produced using FKB building data, and aerial images of Norway. An automatic mapping between the two was performed, giving aerial input images with segmented label images for the corresponding area.

As well as producing a dataset with RGB image, a dataset with infrared images was as well constructed to test if it would give better performance.


The results show that there is a slight performance improvement when the infrared images are used, compared to the RGB images.

The images below show examples of the results when the model is trained and tested using infrared images. The yellow-green pixels are the ones classified as buildings, and we can observe that they match the actual position of the buildings well. The problem is that the correct shape of the buildings are hard to detect since the edges are usually rounded.


The results indicate that you can perform good estimations of how much of an area that is covered by buildings, which again can be used in change detection. Another possibility is to detect where there exist buildings, which could help the process of creating digital maps and makes it possible to detect new and unregistered buildings.


The model for this project was implemented in TensorFlow, and the code is available on GitHub.


Legg igjen en kommentar

Fyll inn i feltene under, eller klikk på et ikon for å logge inn:

Du kommenterer med bruk av din konto. Logg ut /  Endre )


Du kommenterer med bruk av din Google konto. Logg ut /  Endre )


Du kommenterer med bruk av din Twitter konto. Logg ut /  Endre )


Du kommenterer med bruk av din Facebook konto. Logg ut /  Endre )

Kobler til %s

opp ↑

%d bloggere like this: