A depth estimation framework based on unsupervised learning and cross-modal translation

Jiafeng Shen; Kaiwei Wang; Kailun Yang; Kaite Xiang; Lei Fei; Xinxin Hu; Huabing Li; Hao Chen

doi:10.1117/12.2532666

23 October 2019 A depth estimation framework based on unsupervised learning and cross-modal translation

Jiafeng Shen, Kaiwei Wang, Kailun Yang, Kaite Xiang, Lei Fei, Xinxin Hu, Huabing Li, Hao Chen

Proceedings Volume 11158, Target and Background Signatures V; 1115807 (2019) https://doi.org/10.1117/12.2532666
Event: SPIE Security + Defence, 2019, Strasbourg, France

Abstract

In recent years, with the vigorous development of artificial intelligence and autonomous driving technology, the importance of scene perception technology is increasing. Unsupervised deep learning based methods have demonstrated a certain level of robustness and accuracy in some challenging scenes. By inferring depth from a single input image without any ground truth label, a lot of time and resources can be saved. However, unsupervised depth estimation has defects in robustness and accuracy under complex environment which could be improved by modifying network structure and incorporating other modal information. In this paper, we propose an unsupervised, monocular depth estimation network achieving high speed and accuracy, and a learning framework with our depth estimation network to improve depth performance by incorporating transformed images across different modalities. The depth estimator is an encoder-decoder network to generate the multi-scale dense depth map. The sub-pixel convolutional layer is adopted to obtain depth super-resolution by replacing the up-sample branches. The cross-modal depth estimation using near-infrared image and RGB image enhances the performance of depth estimation than pure RGB image. The training mode is to transfer both images to the same modality and then carry out super-resolved depth estimation for each stereo camera pair. Compared with the initial results of depth estimation using only RGB images, the experiment verifies that our depth estimation network with the cross-modal fusion system designed in this paper achieves better performance on public datasets and a multi-modal dataset collected by our stereo vision sensor.

Conference Presentation

Citation Download Citation

Jiafeng Shen, Kaiwei Wang, Kailun Yang, Kaite Xiang, Lei Fei, Xinxin Hu, Huabing Li, and Hao Chen "A depth estimation framework based on unsupervised learning and cross-modal translation", Proc. SPIE 11158, Target and Background Signatures V, 1115807 (23 October 2019); https://doi.org/10.1117/12.2532666

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available