Deep Learning in Visual SLAM Survey

This document presents a survey of deep learning techniques for visual simultaneous localization and mapping (VSLAM). It proposes a new taxonomy for deep learning methods in VSLAM and reviews important approaches. The survey discusses strengths and weaknesses of different deep learning VSLAM methods and current challenges and future directions in this field.

Uploaded by

ragnoraffaele162

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

246 views25 pages

Deep Learning in Visual SLAM Survey

Uploaded by

ragnoraffaele162

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Received 5 February 2023, accepted 17 February 2023, date of publication 27 February 2023, date of current version 2 March 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3249661

Deep Learning Techniques for

Visual SLAM: A Survey
SAAD MOKSSIT 1 , DANIEL BONILLA LICEA2 , BASSMA GUERMAH1 ,
AND MOUNIR GHOGHO 1,3 , (Fellow, IEEE)
1 TICLab, College of Engineering and Architecture, International University of Rabat, Rabat 11103, Morocco
2 Multi-Robot Systems Group (MRS), Faculty of Electrical Engineering, Czech Technical University in Prague, 121 35 Prague, Czech Republic
3 Faculty of Engineering, University of Leeds, LS2 9JT Leeds, U.K.

Corresponding author: Saad Mokssit ([email protected])

The work of Daniel Bonilla Licea was supported in part by the European Union’s Horizon 2020 Research and Innovation Program AERIAL
COgnitive Integrated Multi-Task Robotic System with Extended Operation Range and Safety (AERIAL-CORE) under Agreement 871479.

ABSTRACT Visual Simultaneous Localization and Mapping (VSLAM) has attracted considerable attention
in recent years. This task involves using visual sensors to localize a robot while simultaneously con-
structing an internal representation of its environment. Traditional VSLAM methods involve the laborious
hand-crafted design of visual features and complex geometric models. As a result, they are generally
limited to simple environments with easily identifiable textures. Recent years, however, have witnessed the
development of deep learning techniques for VSLAM. This is primarily due to their capability of modeling
complex features of the environment in a completely data-driven manner. In this paper, we present a survey
of relevant deep learning-based VSLAM methods and suggest a new taxonomy for the subject. We also
discuss some of the current challenges and possible directions for this field of study.

INDEX TERMS Visual SLAM, deep learning, joint learning, active learning, survey.

I. INTRODUCTION driving [11], augmented reality [12], planetary explo-

Simultaneous localization and mapping (SLAM) is a tech- ration [13] and military planning [14]. VSLAM has several
nique for localizing a mobile agent with respect to an advantages over range-based SLAM. These include its inex-
unknown environment while at the same time building a pensive cost, low power consumption and rich data collection.
map of that environment. SLAM methods can be roughly Traditional approaches to solving the VSLAM prob-
classified into two major categories: range-based [1], [2] and lem [15], [16] rely on tracking salient and transformation-
visual-based methods [3]. invariant features, such as SIFT (Scale-Invariant Feature
Range-based SLAM uses range finders like laser Transform) [17] and ORB (Oriented Fast and Rotated
scanners [4], [5], radar [6], and/or sonars [7] to scan the BRIEF) [18] between successive image frames, to estimate
surroundings and build a point cloud representation of the the relative motion of the camera. They then reconstruct the
environment. It has demonstrated good performance in a 3D map of the environment through multi-view geometry
variety of applications [8], [9], [10], but it is costly and theory [19]. This yields a pipeline structure composed of
operates badly in adverse weather conditions such as rain and several hand-crafted modular units, each of which performs a
snow. well defined task such as feature extraction, visual odometry,
Visual SLAM (VSLAM) uses on-board cameras to cap- depth estimation, and mapping; see Fig. 1
ture frames and construct a representation of the environ- SIFT and ORB features are widely used in the VSLAM
ment. It has a wide range of applications like autonomous field thanks to their robustness, distinctiveness, and fast
processing properties. Nevertheless, their utilization results
The associate editor coordinating the review of this manuscript and in poor performance in difficult environments (e.g. feature-
approving it for publication was Larbi Boubchir . less environments, dynamic environments, severe changes
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
20026 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
S. Mokssit et al.: Deep Learning Techniques for Visual SLAM: A Survey

in illumination, etc.). Further, when VSLAM is used to Other survey papers cover individual aspects of VSLAM
accomplish a higher level task such as search and rescue, such as depth estimation [30], [31], optical flow estimation
autonomous driving, or finding a fire extinguisher to put out [32], [33], visual odometry estimation [34], or loop closure
a fire in a burning structure, the agent in these applications detection [22].
requires semantic knowledge about the environment. Thus, Table 1 summarizes existing surveys on VSLAM. A quick
the geometric information embedded in SIFT and ORB fea- analysis of the table reveals the need for a survey to
tures is not always sufficient. encompass the recent advancements in deep learning based
An alternative approach is the use of deep learning tech- approaches for VSLAM.
niques which have been shown to automatically learn com-
plex features from visual inputs. This characteristic has been
leveraged to develop highly accurate image classification III. VSLAM GENERAL FRAMEWORK
and object recognition models, which have been successfully Classical VSLAM techniques can be divided into two cate-
applied to VSLAM for camera pose estimation [20], place gories: feature-based approaches [15], [16], [35] and dense
recognition [21], loop closure detection [22], semantic map- approaches [36], [37], [38]. In the feature-based approach,
ping [23], among other tasks. input frames are first preprocessed to extract salient, robust,
To the best of our knowledge, existing VSLAM surveys and transformation-invariant keypoints. On the other hand,
are limited to methods that learn specific pipeline compo- in dense methods, frames are directly processed. In what fol-
nents in isolation from one another. Recent techniques based lows, we will describe a unified architecture that is applicable
on jointly optimising the various VSLAM components have to both categories, see Fig. 1; in the case of dense approaches,
been shown to achieve better performance. In this survey, the feature extractor is the identity function.
we provide a comprehensive review of VSLAM methods The visual odometry module [39], [40] first examines the
including these recent techniques. features, and then uses feature matching and outlier rejection
Our main contributions are listed below: techniques to find reliable pixel-pair correspondences in con-
• We propose a novel taxonomy for deep learning tech- secutive frames. These correspondences are further exploited
niques applied to VSLAM; to estimate the optical flow. Simultaneously, the visual odom-
• We present a comprehensive review of the most impor- etry module estimates the depth of various points. Finally,
tant deep learning methods applied to VSLAM; by combining the depth and optical flow estimations, the
• We explore deep learning-based VSLAM from a holistic visual odometry module generates a relative pose estimate.
standpoint rather than focusing on individual VSLAM The local mapping module creates a local representation
components; of the agent’s surroundings by projecting the scene ele-
• We discuss the strengths and weaknesses of the different ments onto an internal local coordinate frame, annotated with
deep learning-based approaches to VSLAM; their corresponding estimated depth. Afterwards, local maps
• We discuss some current challenges and future direc- are fused into a global map with the help of relative pose
tions for deep learning-based VSLAM estimation.
Each local map that has been added to the global map
The remainder of the paper is organised as follows. In sec- is then fed to the local optimizer, which tries to minimize
tion II, we review existing surveys on VSLAM. We then, short-term error accumulation that may come from inaccu-
briefly present the general framework of a VSLAM system rate measurements and/or estimations. Basically, it iteratively
in section III. Section IV describes the proposed taxonomy. refines map and pose estimates to enforce alignment between
Sections V, VI, VII and VIII discuss existing works on deep consecutive local maps, resulting in a more consistent local
learning based VSLAM. Open problems and suggestions for representation of the environment.
future research directions are presented in section IX. Finally, Generally, since the local optimizer is based on the relative
conclusions are drawn in section X. alignment of consecutive local maps, errors in one map may
cause successive maps to align to the wrong direction. While
II. RELATED WORK these errors may seem insignificant in the short term, they
Many surveys and tutorials deal with different aspects of continuously accumulate over time and become significant in
SLAM without focusing on VSLAM. For instance, the prob- the long run, thus negatively impacting the maps and poses’
abilistic SLAM formulation is well explained in [1], [24], global consistency.
and [2]. The authors of [25] give a thorough discussion on the The global optimization module solves this issue by relying
SLAM development and briefly illustrate the application of on loop events detected by the loop closure detection module.
deep learning to SLAM. In [26], the deep learning approaches At each time step, this module tries to check if the current
used in SLAM are reviewed in details. scene has been previously visited by comparing the current
A few surveys focus on VSLAM. Model-based approaches frame with a stored database of previously visited places.
to VSLAM are surveyed in [27] and [28]. The authors of [29] In the case of a loop detection event, the current frame is
provide an overview of deep learning methods for VSLAM, projected back onto the previously constructed map. The
focusing on visual odometry and loop closure detection. global optimizer then corrects for the disparity between the