A Semi-Supervised Approach to Monocular Depth Estimation, Depth Refinement, and Semantic Segmentation of Driving Scenes using a Siamese Triple Decoder Architecture

John Paul Tan Yusiong, Prospero Clara Naval

Abstract


Depth estimation and semantic segmentation are two fundamental tasks in scene understanding. These two tasks are usually solved separately, although they have complementary properties and are highly correlated. Jointly solving these two tasks is very beneficial for real-world applications that require both geometric and semantic information. Within this context, the paper presents a unified learning framework for generating a refined depth estimation map and semantic segmentation map given a single image. Specifically, this paper proposes a novel architecture called JDSNet. JDSNet is a Siamese triple decoder architecture that can simultaneously perform depth estimation, depth refinement, and semantic labeling of a scene from an image by exploiting the interaction between depth and semantic information. A semi-supervised method is used to train JDSNet to learn features for both tasks where geometry-based image reconstruction methods are employed instead of ground-truth depth labels for the depth estimation task while ground-truth semantic labels are required for the semantic segmentation task. This work uses the KITTI driving dataset to evaluate the effectiveness of the proposed approach. The experimental results show that the proposed approach achieves excellent performance on both tasks, and these indicate that the model can effectively utilize both geometric and semantic information.

Full Text:

PDF

References


L. Chen, Z. Yang, J. Ma, and Z. Luo (2018) Driving Scene Perception Network: Real-time Joint Detection, Depth Estimation and Semantic Segmentation, Proceedings of the IEEE Winter Conference on Applications of Computer Vision, IEEE, pp. 1283-1291. https://doi.org/10.1109/WACV.2018.00145.

G. Giannone and B. Chidlovskii (2019) Learning Common Representation from RGB and Depth Images, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, IEEE.

R. Cipolla, Y. Gal and A. Kendall (2018) Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 7482-7491. https://doi.org/10.1109/CVPR.2018.00781.

J. Liu, Y.Wang, Y. Li, J. Fu, J. Li, and H. Lu (2018) Collaborative Deconvolutional Neural Networks for Joint Depth Estimation and Semantic Segmentation, IEEE Transactions on Neural Networks and Learning Systems, IEEE, vol. 29, no. 11, pp. 5655-5666. https://doi.org/10.1109/TNNLS.2017.2787781.

D. Sanchez-Escobedo, X. Lin, J. R. Casas, and M. Pardas (2018) Hybridnet for Depth Estimation and Semantic Segmentation, Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, pp. 1563-1567. https://doi.org/10.1109/ICASSP.2018.8462433.

Peng Wang, Xiaohui Shen, Zhe Lin, S. Cohen, B. Price, and A. Yuille (2015) Towards unified depth and semantic prediction from a single image, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 2800-2809. https://doi.org/10.1109/CVPR.2015.7298897.

B. Liu, S. Gould, and D. Koller (2010) Single image depth estimation from predicted semantic labels, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 1253-1260. https://doi.org/10.1109/CVPR.2010.5539823.

L. Ladicky, J. Shi, and M. Pollefeys (2014) Pulling things out of perspective, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 89-96. https://doi.org/10.1109/CVPR.2014.19.

C. Hazirbas, L. Ma, C. Domokos, and D. Cremers (2016) Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture, Proceedings of the Asian Conference on Computer Vision, Springer, pp. 213-228. https://doi.org/10.1007/978-3-319-54181-5_14.

O. H. Jafari, O. Groth, A. Kirillov, M. Y. Yang, and C. Rother (2017) Analyzing modular CNN architectures for joint depth prediction and semantic segmentation, Proceedings of the 2017 International Conference on Robotics and Automation, IEEE, pp. 4620-4627. https://doi.org/10.1109/ICRA.2017.7989537.

V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen and I. Reid (2019) Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations, Proceedings of the 2019 International Conference on Robotics and Automation, IEEE, pp. 7101-7107. https://doi.org/10.1109/ICRA.2019.8794220.

A. Mousavian, H. Pirsiavash, and J. Kosecka (2019) Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks, Proceedings of the 2016 Fourth International Conference on 3D Vision, IEEE, pp. 611-619. https://doi.org/10.1109/3DV.2016.69.

P. Z. Ramirez, M. Poggi, F. Tosi, S. Mattoccia, and L. Di Stefano (2018) Geometry meets semantic for semi-supervised monocular depth estimation, Proceedings of the 14th Asian Conference on Computer Vision, Springer, pp. 611-619. https://doi.org/10.1007/978-3-030-20893-6_19.

C. Godard, O. M. Aodha and G. J. Brostow (2017) Unsupervised Monocular Depth Estimation with Left-Right Consistency, Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition, IEEE, pp. 6602-6611. https://doi.org/10.1109/CVPR.2017.699.

J. P. Yusiong and P. Naval (2019) AsiANet: Autoencoders in Autoencoder for Unsupervised Monocular Depth Estimation, Proceedings of the IEEE Winter Conference on Applications of Computer Vision, IEEE, pp. 443-451. https://doi.org/10.1109/WACV.2019.00053.

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error measurement to structural similarity, IEEE Transactions on Image Processing, IEEE, vol. 13, no. 4, pp. 600-612.

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks, Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 2017-2025.

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 3213-3223. https://doi.org/10.1109/CVPR.2016.350.

Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? The kitti vision benchmark suite, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 3354-3361. https://doi.org/10.1109/CVPR.2012.6248074.

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al. (2016) Tensorflow: a system for large-scale machine learning, Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, USENIX Association, pp. 265-283.

D. Kingma and J. Ba (2015) Adam: A method for stochastic optimization, Proceedings of the International Conference on Learning Representations.

T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 6612-6619. https://doi.org/10.1109/CVPR.2017.700.

R. Mahjourian, M. Wicke, and A. Angelova (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 5667-5675. https://doi.org/10.1109/CVPR.2018.00594.

Z. Yin and J. Shi (2018) GeoNet: Unsupervised learning of dense depth, optical flow and camera pose, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 1983-1992. https://doi.org/10.1109/CVPR.2018.00212.




DOI: https://doi.org/10.31449/inf.v44i4.3018

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.