Self-supervised multi-task learning for semantic segmentation of urban scenes

Jonathan Gonzalez Santiago; Fabian Schenkel; Wolfgang Middelmann

doi:10.1117/12.2600194

12 September 2021 Self-supervised multi-task learning for semantic segmentation of urban scenes

Jonathan Gonzalez Santiago, Fabian Schenkel, Wolfgang Middelmann

Proceedings Volume 11862, Image and Signal Processing for Remote Sensing XXVII; 118620G (2021) https://doi.org/10.1117/12.2600194
Event: SPIE Remote Sensing, 2021, Online Only

Abstract

The task of semantic segmentation plays a vital role in the analysis of remotely sensed imagery. Currently, this task is mainly solved using supervised pre-training, where very Deep Convolutional Neural Networks (DCNNs) are trained on large annotated datasets for mostly solving a classification problem. They are useful for many visual recognition tasks but heavily depend on the amount and quality of the annotations to learn a mapping function for predicting on new data. Motivated by the plethora of data generated everyday, researchers have come up with alternatives such as Self-Supervised Learning (SSL). These methods play a deciding role in boosting the progress of deep learning without the need of expensive labeling. They entirely explore the data, find supervision signals and solve a challenge known as Pretext Task (PTT) to learn robust representations. Thereafter, the learned features are transferred to resolve the so-called Downstream Task (DST), which can represent a group of computer vision applications such as classification or object detection. The current work explores the conception of a DCNN and training strategy to jointly predict on multiple PTTs in order to learn general visual representations that could lead more accurate semantic segmentations. The first Pretext Task is Image Colorization (IC) that identifies different objects and related parts present in a grayscale image to paint those areas with the right color. The second task is Spatial Context Prediction (SCP), which captures visual similarity across images to discover the spatial configuration of patches generated out of an image. The DCNN architecture is constructed considering each particular objective of the Pretext Tasks. It is subsequently trained and its acquired knowledge is transferred into a SSL trunk network to build a Fully Convolutional Network (FCN) on top of it. The FCN with SSL trunk learns a compound of features through fine-tuning to ultimately predict the semantic segmentation. With the aim of evaluating the quality of the learned representations, the performance of the trained model will be compared with inference results of a FCN-ResNet101 architecture pre-trained on ImageNet. This comparison employs the F1-Score as quality metric. Experiments show that the method is capable of achieving general feature representations that can definitely be employed for semantic segmentation purposes.

Conference Presentation

Citation Download Citation

Jonathan Gonzalez Santiago, Fabian Schenkel, and Wolfgang Middelmann "Self-supervised multi-task learning for semantic segmentation of urban scenes", Proc. SPIE 11862, Image and Signal Processing for Remote Sensing XXVII, 118620G (12 September 2021); https://doi.org/10.1117/12.2600194

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available