Presentation + Paper
12 September 2021 Self-supervised multi-task learning for semantic segmentation of urban scenes
Author Affiliations +
Abstract
The task of semantic segmentation plays a vital role in the analysis of remotely sensed imagery. Currently, this task is mainly solved using supervised pre-training, where very Deep Convolutional Neural Networks (DCNNs) are trained on large annotated datasets for mostly solving a classification problem. They are useful for many visual recognition tasks but heavily depend on the amount and quality of the annotations to learn a mapping function for predicting on new data. Motivated by the plethora of data generated everyday, researchers have come up with alternatives such as Self-Supervised Learning (SSL). These methods play a deciding role in boosting the progress of deep learning without the need of expensive labeling. They entirely explore the data, find supervision signals and solve a challenge known as Pretext Task (PTT) to learn robust representations. Thereafter, the learned features are transferred to resolve the so-called Downstream Task (DST), which can represent a group of computer vision applications such as classification or object detection. The current work explores the conception of a DCNN and training strategy to jointly predict on multiple PTTs in order to learn general visual representations that could lead more accurate semantic segmentations. The first Pretext Task is Image Colorization (IC) that identifies different objects and related parts present in a grayscale image to paint those areas with the right color. The second task is Spatial Context Prediction (SCP), which captures visual similarity across images to discover the spatial configuration of patches generated out of an image. The DCNN architecture is constructed considering each particular objective of the Pretext Tasks. It is subsequently trained and its acquired knowledge is transferred into a SSL trunk network to build a Fully Convolutional Network (FCN) on top of it. The FCN with SSL trunk learns a compound of features through fine-tuning to ultimately predict the semantic segmentation. With the aim of evaluating the quality of the learned representations, the performance of the trained model will be compared with inference results of a FCN-ResNet101 architecture pre-trained on ImageNet. This comparison employs the F1-Score as quality metric. Experiments show that the method is capable of achieving general feature representations that can definitely be employed for semantic segmentation purposes.
Conference Presentation
© (2021) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jonathan Gonzalez Santiago, Fabian Schenkel, and Wolfgang Middelmann "Self-supervised multi-task learning for semantic segmentation of urban scenes", Proc. SPIE 11862, Image and Signal Processing for Remote Sensing XXVII, 118620G (12 September 2021); https://doi.org/10.1117/12.2600194
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Image segmentation

RGB color model

Visualization

Classification systems

Convolution

Remote sensing

Back to Top