Deep Learning in Visual SLAM Survey
Deep Learning in Visual SLAM Survey
ABSTRACT Visual Simultaneous Localization and Mapping (VSLAM) has attracted considerable attention
in recent years. This task involves using visual sensors to localize a robot while simultaneously con-
structing an internal representation of its environment. Traditional VSLAM methods involve the laborious
hand-crafted design of visual features and complex geometric models. As a result, they are generally
limited to simple environments with easily identifiable textures. Recent years, however, have witnessed the
development of deep learning techniques for VSLAM. This is primarily due to their capability of modeling
complex features of the environment in a completely data-driven manner. In this paper, we present a survey
of relevant deep learning-based VSLAM methods and suggest a new taxonomy for the subject. We also
discuss some of the current challenges and possible directions for this field of study.
INDEX TERMS Visual SLAM, deep learning, joint learning, active learning, survey.
in illumination, etc.). Further, when VSLAM is used to Other survey papers cover individual aspects of VSLAM
accomplish a higher level task such as search and rescue, such as depth estimation [30], [31], optical flow estimation
autonomous driving, or finding a fire extinguisher to put out [32], [33], visual odometry estimation [34], or loop closure
a fire in a burning structure, the agent in these applications detection [22].
requires semantic knowledge about the environment. Thus, Table 1 summarizes existing surveys on VSLAM. A quick
the geometric information embedded in SIFT and ORB fea- analysis of the table reveals the need for a survey to
tures is not always sufficient. encompass the recent advancements in deep learning based
An alternative approach is the use of deep learning tech- approaches for VSLAM.
niques which have been shown to automatically learn com-
plex features from visual inputs. This characteristic has been
leveraged to develop highly accurate image classification III. VSLAM GENERAL FRAMEWORK
and object recognition models, which have been successfully Classical VSLAM techniques can be divided into two cate-
applied to VSLAM for camera pose estimation [20], place gories: feature-based approaches [15], [16], [35] and dense
recognition [21], loop closure detection [22], semantic map- approaches [36], [37], [38]. In the feature-based approach,
ping [23], among other tasks. input frames are first preprocessed to extract salient, robust,
To the best of our knowledge, existing VSLAM surveys and transformation-invariant keypoints. On the other hand,
are limited to methods that learn specific pipeline compo- in dense methods, frames are directly processed. In what fol-
nents in isolation from one another. Recent techniques based lows, we will describe a unified architecture that is applicable
on jointly optimising the various VSLAM components have to both categories, see Fig. 1; in the case of dense approaches,
been shown to achieve better performance. In this survey, the feature extractor is the identity function.
we provide a comprehensive review of VSLAM methods The visual odometry module [39], [40] first examines the
including these recent techniques. features, and then uses feature matching and outlier rejection
Our main contributions are listed below: techniques to find reliable pixel-pair correspondences in con-
• We propose a novel taxonomy for deep learning tech- secutive frames. These correspondences are further exploited
niques applied to VSLAM; to estimate the optical flow. Simultaneously, the visual odom-
• We present a comprehensive review of the most impor- etry module estimates the depth of various points. Finally,
tant deep learning methods applied to VSLAM; by combining the depth and optical flow estimations, the
• We explore deep learning-based VSLAM from a holistic visual odometry module generates a relative pose estimate.
standpoint rather than focusing on individual VSLAM The local mapping module creates a local representation
components; of the agent’s surroundings by projecting the scene ele-
• We discuss the strengths and weaknesses of the different ments onto an internal local coordinate frame, annotated with
deep learning-based approaches to VSLAM; their corresponding estimated depth. Afterwards, local maps
• We discuss some current challenges and future direc- are fused into a global map with the help of relative pose
tions for deep learning-based VSLAM estimation.
Each local map that has been added to the global map
The remainder of the paper is organised as follows. In sec- is then fed to the local optimizer, which tries to minimize
tion II, we review existing surveys on VSLAM. We then, short-term error accumulation that may come from inaccu-
briefly present the general framework of a VSLAM system rate measurements and/or estimations. Basically, it iteratively
in section III. Section IV describes the proposed taxonomy. refines map and pose estimates to enforce alignment between
Sections V, VI, VII and VIII discuss existing works on deep consecutive local maps, resulting in a more consistent local
learning based VSLAM. Open problems and suggestions for representation of the environment.
future research directions are presented in section IX. Finally, Generally, since the local optimizer is based on the relative
conclusions are drawn in section X. alignment of consecutive local maps, errors in one map may
cause successive maps to align to the wrong direction. While
II. RELATED WORK these errors may seem insignificant in the short term, they
Many surveys and tutorials deal with different aspects of continuously accumulate over time and become significant in
SLAM without focusing on VSLAM. For instance, the prob- the long run, thus negatively impacting the maps and poses’
abilistic SLAM formulation is well explained in [1], [24], global consistency.
and [2]. The authors of [25] give a thorough discussion on the The global optimization module solves this issue by relying
SLAM development and briefly illustrate the application of on loop events detected by the loop closure detection module.
deep learning to SLAM. In [26], the deep learning approaches At each time step, this module tries to check if the current
used in SLAM are reviewed in details. scene has been previously visited by comparing the current
A few surveys focus on VSLAM. Model-based approaches frame with a stored database of previously visited places.
to VSLAM are surveyed in [27] and [28]. The authors of [29] In the case of a loop detection event, the current frame is
provide an overview of deep learning methods for VSLAM, projected back onto the previously constructed map. The
focusing on visual odometry and loop closure detection. global optimizer then corrects for the disparity between the
current frame and its earlier representation in the internal learned VSLAM modules, their efficient fusion within the
map, yielding globally consistent maps and poses. system requires additional effort, and, usually, the overall
VSLAM performance may be as high as that of the individual
IV. DEEP VISUAL SLAM TAXONOMY modules.
The proposed taxonomy (fig 2) divides deep learning tech- Joint learning methods, on the other hand, aim to miti-
niques for VSLAM into four categories: modular learning, gate the above-mentioned issue by either jointly optimizing
joint learning, confidence learning, and active learning. various modules at once or by learning the full VSLAM
Modular learning encapsulates methods that learn individ- pipeline. Generally, these methods exhibit better overall
ual modules within the classical VSLAM pipeline in isolation performance, but require significant effort in the training
from the others. Despite the high performance of individually phase.
Confidence learning techniques are another alternative. leveraging probabilistic reasoning capabilities, VSLAM can
They account for the uncertainty of the whole process be made more resilient to the different hazards it may
either by estimating it or by reducing it. Hence, by encounter.
features from the inference process. In some circumstances easier to estimate the depth of closer objects than faraway
however, when parts of the scene is ambiguous or under hard ones.
environmental conditions for example, more than one local Another difficulty is that most of the proposed solutions are
feature may be required to ensure accurate predictions. unable to generalize to various types of cameras. To solve this
To tackle the first issue, scale invariant loss functions have issue, [51] adds a camera model to the depth prediction net-
been proposed. For instance, [47], [48] used the relative work, which allows it to estimate appropriate camera-related
distance between pixel pairs as a supervisory signal, hence projections and transformations. This makes the network able
overcoming scale ambiguity that is generally produced when to predict the depth seen by different kinds of cameras as it
relying solely on ground truth global depth maps. Further- explicitly includes entries for their specific parameters.
more, the authors of [47] proposed to divide the problem into
two learning phases. In the first one, they learn a coarse depth
map, using a sequence of 2D convolutions followed by fully 2) SELF SUPERVISED LEARNING
connected layers. The objective of the coarse predictions is The process of manually annotating each input image
to exploit global cues to infer the global scale of the scene. to create suitable datasets for supervised learning is
In the second phase, the coarse predictions are refined using time-consuming and error-prone. Instead, researchers con-
a Fine network composed only of 2D convolutions that takes centrated their efforts on the automatic extraction of auxiliary
the input image in addition to the coarse depth map and supervisory signals from the raw data itself.
outputs refined per pixel depths. We argue that one important A common approach consists of using video inter-frame
implementation detail that contributes to the success of this geometric constraints. In this setting, each pixel in the sub-
architecture relates to the field of view allowed for each sequent frame is projected back onto the preceding one, and
output unit of the two networks. In the Coarse network, each pixel differences between the real and reconstructed frames
output pixel can see the hole input frame thanks to the last are used as a supervisory signal. However, standard deep
fully connected layer, while in the Fine network, each output learning architectures cannot directly learn this projection
unit can only examine a local patch of the input image. operation since it requires knowledge of depth maps and
For the second problem, some researchers developed new inter-frame transformations as well.
architectures that consider features at different scales. In this In this context, [52] used a multitask architecture that
regard, the authors of [49] improved their previous ‘‘Coarse seeks to predict both depth maps and relative poses between
to Fine’’ network [47] to progressively refine coarse predic- frame sequences, then aggregates those predictions to train
tions to higher resolutions. Their new architecture is based the model based on novel view synthesis. The proposed
on a three-stage refinement process. In a first step, they framework also estimates an explainability mask which tries
predict a coarse depth map, using an architecture similar to to down-weight regions where the model believes it may
their previous coarse network. Their refinement procedure is fail, hence neglecting their associated errors in the training
however slightly different. First, instead of using the coarse loss function. In [53], the authors demonstrated that strate-
depth output map, they pass the multi-channel feature map gies relying on pose network estimation do not properly
of the coarse network. Hence, providing their model with solve the scale ambiguity problem as they overlook geomet-
more contextual information that can be further exploited by ric correlations between depth and poses. They proposed a
the refining network. Second, the refining process is done in differentiable implementation of the commonly used direct
a progressive manner (2 steps in their experiments), where visual odometry algorithm to estimate camera poses instead.
at each scale, they incorporate a more detailed but narrower This leads to better performance as relative poses are more
view of the image, using convolutions with finer strides at accurately computed.
higher resolution. The same architecture was also tested on It is well established that stereo-based approaches give bet-
other tasks such as surface normal estimation and seman- ter results compared to their monocular counterparts. Inspired
tic labeling and achieved high performance on those tasks by this consideration, many researchers tried to exploit known
as well. This may indicate that the progressive refinement relative poses between left and right stereo pairs to train
used by this architecture can serve as an important build- their models on stereo cameras first, while performing testing
ing block of incoming research on deep visual based scene on monocular cameras. A simple approach [54] consists of
understanding. estimating disparity maps, which are then used to synthesize a
However, in the above architectures, the depth estima- warped image from the original (right) image. The difference
tion errors are weighted the same regardless of the object’s between the warped and original (left) images is then used as
proximity to the camera. In [50], the authors devised a a supervisory signal to train the network. The authors also
spacing-discretization (SID) strategy that allows for larger introduced a smoothness loss to guide the training toward
estimation errors when predicting deeper depths and recasts continuous disparity estimation. The authors of [55] then
depth estimation as an ordinal regression problem. This improved left-right disparity consistency by extending this
is motivated by the fact that, in real life, humans find it design to include both left and right reconstruction losses.
at different pyramid scales, then minimizing flow residu- Nevertheless, estimating optical flow at low resolution first
als between each predicted flow and the warped flow from and then refining at higher resolutions presents some diffi-
the previous pyramid level. Generally, the residual motion culties. For example, errors at coarse resolution are hardly
between successive pyramid levels is small, and it can be recovered, and small fast-moving objects are often missed.
approximated with a small network. As a result, SPyNet is In RAFT [65], the authors maintain and update a single
simpler and 96% smaller than FlowNet. Nevertheless, despite high-resolution optical flow by building multi-scale 4D corre-
its computational efficiency, it does not perform as well as lation volumes of all pairs of pixels to capture frame similar-
FlowNet2. ity. They then use recurrent units to iteratively update optical
One reason could be the use of pyramidal processing flow estimates, thus achieving the best performance on all
directly on the raw image frames. Most state-of-the-art clas- public optical flow benchmarks.
sical approaches first extract lighting invariant features in a
preprocessing stage before estimating optical flow with more 2) SELF SUPERVISED LEARNING
discriminative representations such as cost volume. Inspired A typical method for learning optical flow in an unsupervised
by this observation, [64] learns to extract feature pyramids manner is by using warping operations on input frames [66].
from the raw images and then leverages optical flow upsam- In this context, the optical flow between two frames can be
pled from subsequent pyramid layers to warp the feature estimated and then used to synthesize one frame from the
representation of two consecutive frames against each other at other. If the estimate was accurate, the result should be a
each pyramid level. Finally, a cost volume between the feature synthetic image that is identical to the original. The model
pyramids is computed, which is then utilized to estimate the can then be trained to adjust using the observed discrepancies
actual optical flow. And, since warping operation and cost by comparing pixels brightness differences between the two
volume construction do not require any learnable parameters, images and pushing some regularization such as smoothness
the result is a lighter network that is 17 times smaller and on the flow estimate, hence indirectly producing accurate
performs better than FlowNet2. optical flows.
Based on the brightness constancy assumption, [67] lever- the optical flow computation to estimate the best transforma-
aged the prominent accuracy of classical approaches to train tion (in the 3D world) that explains the planar displacement of
a CNN model for estimating optical flow. It uses proxy pixels in the two consecutive frames (in the 2D image plane).
ground truth optical flow, derived by traditional methods, as a In other words, it uses various optimization techniques to
supervisory signal. In addition, an image reconstruction loss estimate how the camera should move to observe the induced
is introduced to the learning scheme to prevent the model feature displacements between frames.
from learning classical approaches’ failure situations. Compared to traditional approaches, deep learning based
Yet, relying solely on photometric loss can be problem- visual odometry does not require feature extraction, feature
atic in areas with important illumination variations, repeti- matching, or complex geometric operations, leading to more
tive textures, and occlusions. Some researchers attempted to direct solutions.
solve the occlusion problem by learning occlusion masks and
training their models to reconstruct only the non-occluded
portions of consecutive frames [68], [69]. This may however, 1) SUPERVISED LEARNING
simply remove outliers from the training sample, leaving the Early applications of deep learning to computer vision were
optical flow near occlusion regions uncertain. limited to object recognition and classification. These objects
To overcome this challenge, [70] introduced global epipo- had to be recognized regardless of their position and orienta-
lar constraints to the training process. The presented method tion within an image. Hence, convolutional neural networks
differs from classical photometric matching approaches by were found appropriate for this task as they inherently extract
considering the two frames as belonging to different views of translation and rotation invariant features from a given image.
the same scene, and since each pixel of one view is related Learning relative motion, however, necessitates to extract
to a pixel of the other view by the same fundamental matrix, temporal features in addition to spatial ones and keep track of
optical flow estimation can be reduced to estimating this fun- the geometrical transformations that occur between frames.
damental matrix. Then, the optical flow is iteratively refined As a result, learning visual odometry requires a different
to comply with this estimate. architectural design.
Finally, to address difficulties with illumination variations, First attempts to learn visual odometry focused on learning
[71] first adjusts the input frames to similar illumination specific parts of the classical pipeline. For instance, [75], [76]
conditions before estimating optical flow. More specifically, learn to extract keypoints, [77], [78] learn keypoint descrip-
it leverages the structure from motion technique to extract tors and [79] learns to match between visual features. How-
relative camera pose between successive frames and uses it ever, those approaches proved inefficient in the full odom-
as a supervisory signal for learning light-invariant optical etry estimation problem as they need to be embedded in a
flow. In recent work, some researchers have tried to estimate more general pipeline with several modules. Over time, errors
optical flow under hard illumination conditions such as low- within each module accumulate, leading to a degradation in
light environments [72] or foggy scenes [73]. the performance of the odometry system. One of the first
works that directly predicts relative motion from consecutive
image frames was [80]. It formulates visual odometry as a
C. VISUAL ODOMETRY classification problem. The proposed architecture takes as
Visual odometry deals with the task of estimating the relative input five stacked consecutive left and right frames, process
motion of the camera by analyzing pixels’ displacement in independently each individual frame in the early layers of a
consecutive frames. In other words, the vision system tracks CNN network, then aggregates the learned representations in
visual features to estimate the vehicle’s relative rotation and the last layer to learn discrete changes in direction and veloc-
translation between two time steps. The global trajectory ity on the stacked inputs. This can be viewed as trying to learn,
can then be recovered by integrating the relative movements for each individual frame, spatial features that are relevant for
along the travel period. inferring temporal dependencies between frames. However,
The main challenge here lies in the fact that we want to this method suffers from high error accumulation due to the
estimate 3D motion from 2D images. 2D images generally discretization process. Moreover, the temporal features that
represent 2D projections of the 3D real world, captured at are extracted rely on a high level of abstraction of the initial
various time intervals. Hence, the dimension corresponding frames since they are learned until the last layers of the neural
to the depth information is lacking in the projected images. network, and some of the inter-frame dependencies that are
The classical pipeline for visual odometry estimation tries easily accessible at low levels of abstraction are ignored by
therefore to recover this missing dimension. To this effect, the proposed model. Alternatively, the authors of [81] pro-
it first extracts persistent visual features from the consecutive posed two different architectures for learning spatio-temporal
frames (i.e., features that can be re-extracted from different features from consecutive frames. Their first network ‘‘Early
view points, under different illuminations, scales, and various fusion’’ tries to extract the main temporal features since the
geometric transformations of the scene that may result from first layer, using a CNN network where the temporal dimen-
the vehicle motion). Then, it exploits the one-to-one feature sion is collapsed in the first layer and subsequent layers just
correspondence between successive image pairs provided by build upon those main features to learn interactions between
them. Their second network, ‘‘Slow Fusion’’, slowly reduces on translational and rotational errors. Moreover the number
the temporal dimension of the input frames until it completely of training epochs used by the ‘‘Slow Fusion’’ architecture
vanishes. It processes the input frames using a sequence of is considerably smaller than the ‘‘Early Fusion’’ one (350
3D convolutions and max pooling operators which reduce the for ‘‘Slow Fusion’’ compared to 1000 for ‘‘Early Fusion).
temporal dimension by two at each step. Then, when the tem- This may indicate that learning inter-frame dependencies in
poral dimension is completely consumed, standard 2D convo- a progressive manner can be more beneficial for the visual
lutions are used to extract more complex features. Hence, the odometry task.
progressive reduction of the temporal dimension can be seen One key observation in the use of 3D convolutions in
as learning at each time step temporal features at varying lev- the above architectures is that they try to simultaneously
els of abstrarction. The two architectures were tested on the extract spatial and temporal features and learn interactions
KITTI odometry benchemark and the authors showed that the between both of them at the same time. We also want to
‘‘Slow Fusion’’ network slightly outperforms ‘‘Early Fusion’’ point out that since each 3D convolution examines only pixels
(or data points in a feature map) that are close in space its usage in practical scenarios. Second, relying on photomet-
and time, only short range dependencies are considered at ric error implicitly assumes that the scene is static and free of
each layer to generate the features of the next layer. Hence, occlusions. While the authors of [52] tried to mitigate this by
some of the long range interactions that may be easily iden- introducing an explainability mask, their approach does not
tifiable in the first layers may be hard to learn by these fully address the problem.
architectures. To fix those issues, a large body of work followed. To solve
DeepVO [82] solves the visual odometry problem by the global scale challenge, the authors of [84] trained their
adding a recurrent neural network (RNN) on top of CNN. network on stereo image pairs and performed testing on
In the proposed framework, appearance features are first monocular ones. As the geometrical transformation between
extracted from the current frame using CNN. They are then left and right views is fixed and known during the whole
fed to an RNN network, which tracks temporal appearance training, an additional left-right photometric consistency is
changes across frames to infer relative odometry. The archi- introduced to the model. As a result, their network benefits
tecture also incorporates prior odometery knowledge to per- from the stereo view’s capacity to capture scale information.
form estimation, which potentially prevents the model from In [85], the authors introduced a 3D geometrical loss to the
overfitting to the training dataset. It implements CNN as training process by enforcing consistency between the pre-
a FlowNet architecture, which is specialized in extracting dicted depth map and a reconstructed one. More specifically,
optical flow data from image sequences. the network predicts the depth map of both the source and
DeepVO showed impressive performance on the KiTTI target images, as well as their relative transformation. The
dataset [83] even in previously unseen scenarios, com- predicted depth is then projected onto 3D space, and the
peting in terms of localization error with state-of-the-art estimated relative transformation is used to align the two
monocular visual odometry algorithms and establishing depths (predicted and reconstructed), thus yielding consistent
DeepVO as a baseline architecture for end-to-end VO scale of the predictions.
learning.
We can argue that the great performance of DeepVO is
mainly related to the decoupling of learning spatial and D. LEARN TO MAP
temporal features. Since each input frame is by itself rep- Recent years have seen a surge in algorithms and techniques
resentative of its spatial features, it may be easier for the to model the three-dimensional physical world and perform
CNN network to look at each frame individually to extract efficient reasoning on the generated representations.
spatial features that are relevant for the visual odometry task. 3D object modeling was first studied in this context in
Then, the role of RNN would be to try to find out short and the field of computer graphics, where practitioners devote
long-range non-linear dependencies between those extracted laborious efforts to redesigning complex objects in CAD
features in the time domain. Another benefit is that supervised systems. Mapping is another technique used in robotics to
visual odometry learning naturally makes pose predictions represent the perceived real world. It designates the mobile
from monocular images with the global scale maintained. agent’s capacity to build a consistent representation of the
This is thanks to deep learning networks’ ability to implic- scenes it perceives during operation. Mapping differs from
itly encode scale related features from a large collection of 3D object modeling in that it does not require human
images. intervention. Instead, it relies solely on external sensory
information such as visual inputs, range data and some
internal processing to represent the perceived scene as a
2) SELF SUPERVISED LEARNING hole rather than individually modeling each object within the
To solve odometry estimation without manual supervi- scene.
sion, [52] proposed using novel view synthesis as a supervi- When creating a representation of the real world environ-
sory signal. Their architecture is composed of two networks: ment, many factors must be considered. Perhaps the oldest
DepthNet and PoseNet. The depth network is based on an and most widely used method seeks to determine whether
auto-encoder design and implemented as a DispNet Network. or not specific regions of the search space are occupied,
It takes a single image as input and generates its correspond- with the goal of achieving safe and collision-free naviga-
ing depth map. The pose network, which consists of a convnet tion. This yields space-free map representation. A second
architechture, estimates the relative transformation between a consideration consists of answering the question of what the
source and a target view. The two estimations are then used to world looks like. This results in a geometrical representa-
construct a synthetic image from a source image. If the depth tion that describes the layout and shape of the perceived
and relative motion were accurately estimated, the estimated scene. Finally, knowing what each part of the search space
synthesized target should match the ground truth one. For relates to, recognizing the seen items, and dividing them
this purpose, the authors trained their network on matching into well-defined classes can all be useful. This gives rise
pixels’ brightness. to a semantic representation of the environment. In the fol-
Initially, two major difficulties remained unsolved. First, lowing, we will describe each map representation in more
the generated pose estimate is scale ambiguous, which limits detail.
1) SPACE-FREE MAPS pixel in the image, which gives rise to a dense depth view of
To describe an environment in terms of space-free regions, the surroundings.
two major paradigms have been extensively used in the liter- As was previously stated, there is a plethora of ways for
ature: grid-based and topological [86]. learning depth maps from visual inputs in the literature, vary-
Grid-based approaches produce accurate metric maps that ing essentially in the type of input used in training and testing
represent the environment through evenly-spaced grids. How- (either monocular or stereo), the architectural design and the
ever, exploiting those maps suffer from high temporal and supervisory signals used during the learning process.
spatial complexity. Topological maps, on the other hand, use Point clouds, on the other hand [88], consider only a
a graph structure to describe the environment, with nodes subsample of the image pixels to build a 3D model of the
denoting recognizable places and edges designating direct perceived scene. More accurately, it samples pixels from the
collision-free paths between them. This results in a more effi- viewed images and projects them back into the 3D space.
cient and compact map, though the burden of maintainability The benefit is an easier understanding and manipulation of
increases when the environment becomes larger. the representation. However, the same geometry can be rep-
Metric and topological maps are orthogonal by nature, resented by different point clouds, and inversely, the same
with the weaknesses of one remedied by the strengths of point cloud can model different geometries, which may lead
the other, and choosing between the two requires a trade-off to ambiguity.
between accuracy and efficiency. In practice, both maps are Mesh encoding [89] is another representation alternative
coupled to each other [87], with grid maps being used first where objects are modeled by encoding their salient features
to provide an accurate estimate of the obstacles’ location and such as edges and surfaces.
disentangle similarly looking places, followed by topological Last, the volumetric voxel representation [90], [91]
representations for more efficient planning and navigation. describes a given scene by populating a uniform grid with
elementary volumetric units that constitute parts of solid
objects in the scene.
2) GEOMETRIC MAPS It is worth mentioning that mesh and volumetric repre-
The physical layout and structure of the environment can be sentations can endow visual systems with great expressive
represented in a variety of ways. The most prevalent ones are power, as they explicitly encode the structure and shape of
depth maps, pointclouds, meshes and volumetric voxel repre- the perceived elements. However, the high maintainability
sentations. In the depth map representation, the depth of every costs they incur hinder their wide application in practice.
pixel in the perceived scene is estimated. This corresponds Therefore, most visual SLAM systems prefer using depth and
to predicting the distances separating the camera from each point cloud representations instead.
the mismatch between the current observation and the built On the other hand, recent advances in deep learning
estimates. Nonetheless, visual aliasing and perceptual vari- have shown superior performance in image classification
ability make it challenging to recognize visited locations. tasks [113], suggesting that the features extracted by deep
Perceptual aliasing (false positive) refers to a high degree of neural networks are more convenient for visual tasks.
similarity between different places that leads to incorrect loop In [114], the authors compared the performance of the fea-
detection. Conversely, perceptual variability (false negative) tures learned from various layers of a CNN network with
designates the change in the appearance of the same scene SIFT descriptors on a descriptor matching benchmark; deep
caused by factors such as changing arrangements of movable features from different layers were shown to consistently
objects within the scene, which may prevent loop closure outperform SIFT. More intriguingly, the CNN network was
identification. trained on a classification challenge rather than a descrip-
Traditional methods for loop closure detection usually tor matching task. This implies that the learnt deep CNN
build a database of the perceived images expressed in a features intelligently encompass relevant visual features that
hand crafted feature space. They use an image description can be applied to various visual tasks. Many academics were
technique such as Bag-of-Words [107], Fisher kernels [108], inspired to develop new deep learning algorithms to tackle
or vectors of locally aggregated descriptors [109], [110], and the loop closure problem as a result of this.
compare each recently observed image with every entry in the
database using a similarity distance like the cosine distance. 1) SUPERVISED LEARNING
This results in a visual similarity matrix where loop closures Several researchers tried to use supervised methods for
are spotted in off-diagonals with high similarity scores [111]. loop closure detection. For instance, [115] combined fea-
However, this approach rapidly becomes intractable as the tures extracted from a pretrained CNN network with
environment becomes larger due to the many image pairs spatio-temporal filtering to perform place recognition. The
that need to be compared. Alternatively, [112] proposed an proposed method produces for each layer of the deep net-
incremental implementation of the problem by casting loop work a corresponding confusion matrix Mk , k = 1, . . . , 21,
closure detection in a Bayesian framework, resulting in a where each element Mk (i, j) indicates the Euclidean dis-
real-time scene recognition method. However, their solution tance between the feature vector responses to the ith training
needs a fine design of a probabilistic transition model that image and the jth testing image and candidate loop closures
strongly depends on the environment and robot motion. occur in minima locations. Then, these hypotheses are further
validated through spatial and temporal filters, with the spatial Hence, learning representations that take those interdepen-
filter enforcing the spatial proximity of the plausible closures dencies into account can lead to more accurate models. This
and the temporal filter verifying their temporal closeness; a is the essence of joint learning, in which the learning archi-
precision of 100% and a recall rate improvement of 75% was tecture is made up of several sub-networks, each of which
achieved on the measured data set. is responsible for learning a specific sub-task. However, the
The authors of [116] evaluated the performance of image individual sub-tasks are not explicitly learned. Rather, they
representations generated by a pretrained CNN model at are jointly optimised to perform a more general objective,
intermediate layers in terms of their ability to detect loop in the sense that learning the global task will be possible
closures. Two key findings were the ability of deep features only if each individual sub-task has been correctly, though
to surpass their state-of-the-art competitors in the presence implicitly, learnt. The benefit is a more reliable model as
of significant lighting changes and the fast feature extrac- it must use a solid theoretical prior for connecting between
tion capability of deep learning methods compared to hand- the sub-networks to make implicit learning of the sub-tasks
crafted ones. possible.
Similarly, [117] compares the performance of four popular As opposed to modular learning, joint learning can exploit
deep learning architectures (PCANet [118], AlexNet [113], the full relationship between the different modules, albeit
CaffeNet [119] and GoogLeNet [120]) to two hand-crafted at the expense of a more complex architecture. To the best
techniques, one based on local BoW features [107] and the of our knowledge, only depth, optical flow, and ego-motion
other on global GIST descriptors [121], in the problem of have been jointly optimized in the context of a deep learning
loop closure detection. In their approach, the authors used the framework due to their well established interdependencies.
learned last layer features for each deep network as image Other newly proposed approaches use a single end-to-end
descriptors. Then, each pair of image descriptors (from the deep learning architecture to directly optimize the entire
same network) is concatenated into a single vector, together visual SLAM pipeline. In the following, we will go through
with a ground truth label that indicates whether the two each of these approaches in further depth.
images close a loop. Finally, a Support Vector Machine
(SVM) classifier is trained on the constructed dataset for loop
closure detection. A. DEPTH, OPTICAL FLOW, AND EGO-MOTION
This procedure was carried out for each deep learning and As mentioned earlier, depth and ego-motion are related by
hand-crafted method, and the results demonstrated a signif- well-known geometric constraints. However, when consid-
icant gain in accuracy and processing time in deep learning ering optical flow as well, a stronger relationship has been
approaches. established in recent works, resulting in models that better
describe the motion perceived within the scene and overcome
2) SELF SUPERVISED LEARNING the limitations of using depth and ego-motion alone when
Other researchers were interested in learning a loop closure applied to dynamic regions.
detector without explicit supervision. Reference [122] pre- Sfm-Net [125] is, to the best of our knowledge, the first
sented a self-supervised method for learning loop closure architecture that combine depth, ego-motion and optical flow
detection. With the goal of building a more resilient net- in a unified framework. The method is based on two autoen-
work, the proposed approach relies on a stacked auto-encoder coders, one for estimating scene structure and the other for
trained to automatically recover input patches that were ran- estimating motion. It dedicates a separate channel to each
domly and intentionally affected by noise in the form of pixel source of motion (camera and moving objects). Then, object
dropout. The encoder part of the network was then used as an masks are generated to assign each pixel to its correspond-
image descriptor, and additional weight was assigned to each ing motion channel. Finally, a warping technique based on
output unit response according to its discriminatory property optical flow computation is used to assess the consistency
to penalize redundant features. The effects of corruption were of the learnt estimates. However, Sfm-Net needs to know
then evaluated by experiments. a priori how many moving objects are within the scene (in the
further, based on an auto-encoder architecture, the method experiments conducted in [125], only 3 dynamic objects were
in [123] alters the input data with random projective transfor- considered), limiting its application to only environments
mations to enforce viewpoint invariance in the learnt descrip- with a few dynamic elements.
tion. The neural network is then trained on recovering HOG Perhaps another limitation of Sfm-Net relates to the nature
features [124] rather than original images to take advantage of learning object motion. The latter is modeled from scratch,
from their robustness to changes in illumination. This resulted neglecting the fact that the apparent movement of pixels
in an appearance invariant image descriptor, which is more is both the result of object motion and ego-motion. Geo-
suitable for measuring places similarity. Net [126] tries to address this issue by using a two-stage learn-
ing scheme.The core idea consists of separating the motion of
VI. JOINT LEARNING objects within the scene from that resulting solely from the
It has been widely established that many visual SLAM mod- movement of the camera. More precisely, the source image’s
ules are linked together by various geometric constraints. pixel-wise 3D locations Ps are first computed by projecting
each pixel ps into 3D space using the predicted depth map In the competing phase, static scene reconstructor and mov-
Ds (ps ) and camera intrinsic parameters K ; see Eq. (2). The ing region reconstructor networks compete against each other
camera’s estimated relative motion Ts→t is then used to track, to minimize losses only in their assigned region given by
in 3D space, each computed 3D point; see Eq. (3). Following the motion segmentation network. Then, in the collaborating
that, each 3D target Pt is reprojected back to the image plane; phase, those two networks form a consensus to improve the
see Eq. (4). The discrepancy between the pixel source and segmentation network’s pixel assignment. The benefit is that
pixel target coordinates results in the rigid flow produced by more robust segmentation can be used for improved structure
the camera motion alone; see Eq. (5). Finally, this rigid flow and motion estimation.
is iteratively refined using a ResNet network to match the
motion of each dynamic object within the scene.
B. END-TO-END LEARNING
Ps = Ds (ps )K −1 ps (2)
End-to-end learning is a very promising direction to solve the
Pt = Ts→t Ps (3) VSLAM problem as it directly optimises all VSLAM mod-
pt = KPt (4) ules at once. As a consequence, it provides models that are
rig more resilient to noise and uncertainties. Yet, building an end-
fs→t = pt − ps (5)
to-end learning architecture is not simple as it involves careful
We observe that one major benefit of such an approach handling of all inter-module dependencies in a differentiable
is that it explicitly separates between features common to manner to make learning through backpropagation possible.
all scenes representing camera intrinsic motion and those The recent introduction of a differentiable implementation
specific to each scene resulting from the motion of dynamic of the particle filter [132], a widely used algorithm in clas-
objects. This results in disentangled representations that may sical VSLAM approaches, has made end-to-end learning of
generalize better to unseen environments. Similarly, [127] VSLAM possible.
employs residual flow estimation in the case of stereo video. For instance, [133] has introduced DMN, a differentiable
However, residual flow can only correct for small errors and mapping network that uses a differentiable particle filter to
generally tends to fail in the case of large pixel displacements. learn a view embedding map of the environment that is
To improve performance, [128] introduced cross-task learn- specifically optimized for visual localization. However, since
ing into optical flow estimation. The main idea is to enforce the map’s representation is abstract (not easily interpretable),
matches between warped scenes coming from residual flow it cannot be used for other tasks other than localization.
and those produced by a dedicated optical flow network, DeepSLAM [134], on the other hand, jointly learns both
under the insight that simultaneous learning of the same the robot’s pose and the environment’s 3D map in an end-to-
task under diverse cues provides more consistent supervisory end unsupervised manner. It combines an autoencoder-based
signals. Nonetheless, the aforementioned approaches handle mapping network to regress a depth map of the environment
dynamic objects in an implicit manner by correcting for together with an RCNN-based tracking network to estimate
inconsistent estimates. the 6 DoF pose of the robot. Then, using a pretrained loop
Other works suggest leveraging semantic segmentation closure detector together with a graph optimization proce-
instead for a more robust estimation. For instance, in [129] dure, it enables global consistency of the generated 3D map
and [130], the authors propose modeling the motion in the and pose of the robot. More specifically, at training time,
dynamic area as well. They first segment the image into static DeepSLAM jointly minimizes stereo warping loss between
and dynamic parts. Then, for each moving object that has left and right stereo pairs together with temporal warping
been detected, they apply a separate network to estimate its loss between consecutive frames. Several supervisory signals
3D rigid transformation. Finally, they populate the warped are employed during the training phase, including stereo
scene resulting from ego-motion solely with that generated by photometric warping loss, consistency between right and left
dynamic object motion using the expression in Eq. (6), where estimated poses, novel view synthesis, and 3D geometric
T̂ obj refers to an object’s rigid motion (the only difference registration. Thus, it allows for a more consistent estimate of
between pixel correspondance from source to target in static the map and the pose of the robot. Another clear advantage of
and dynamic setups is the introduction of T̂ obj ). The network DeepSLAM is that it uses only RGB input at test time, which
is then trained to match the original frame with the warped makes it suitable both for indoor and outdoor scenarios.
one. Yet, despite including a module for processing error maps
between consecutive estimations, the uncertainty handling is
pt = KTs→t T̂ obj Ds (ps )K −1 ps (6)
only directed towards outlier rejection rather than tracking the
Another interesting direction was proposed in [131], which true confidence in the various predictions.
casts depth, ego-motion, optical flow, and motion segmenta- More recently, SLAM-net [135] proposes a differentiable
tion as a game problem. In this setting, the network modules implementation of the particle filter-based FastSLAM algo-
are assimilated into players who compete and collaborate rithm [136] to learn transition and observation models, which
to reduce warping losses. The key idea consists of training are subsequently used to perform mapping and localiza-
the network in two phases: competing and collaborating. tion in a probabilistically consistent manner. Nevertheless,
it assumes planar motion of the robot, restricting its appli- deep network takes as input the trajectory generated by a con-
cation in more complex environments. ventional semantic SLAM algorithm. It then identifies and
corrects probable pose estimation errors under a variety of
VII. CONFIDENCE LEARNING uncertainties, including measurement errors, sensor failures,
SLAM is essentially a state estimation problem where uncer- and data processing faults.
tainty is inherently present. Most existing work in the Other studies investigated how uncertainty in scene
literature use a fixed model for describing camera noise understanding could help SLAM systems, in addition
distribution [137]. However, this might not always be the to motion uncertainty. In this regard, [142] proposed
case in real-world applications due to many unpredictable an information-theoretic strategy to reduce uncertainty in
factors such as occlusions, the sudden appearance of obsta- selected map keypoints. Their approach is based on careful
cles, and texture-less environments. In SLAM, each current feature selection of points that provide the highest reduc-
prediction serves as input to the next estimate. Thus, even if tion of Shanon entropy. Consequently, a sparse map of fea-
the estimation errors are small, as it is the case when using tures that can be reliably detected over long distances was
Deep Learning, they may accumulate over time and lead, produced.
in the long term, to inconsistent maps and poses. As a result,
keeping track of those uncertainties is of utmost importance
for building more accurate and reliable SLAM solutions. B. UNCERTAINTY ESTIMATION
This motivated many researchers to explore deep learning A common deep learning architecture for estimating uncer-
potential for handling SLAM uncertainty. In this context, tainty is the Bayesian neural network [143]. A variant which
we find two different approaches: those that try to directly is well-suited to visual tasks is called the Bayesian convo-
reduce uncertainty without explicit estimation, and others that lutional neural network [144], and has been widely used in
directly estimate the different uncertainty values. computer vision.
In the visual SLAM context, [145] uses a bayesian con-
A. UNCERTAINTY REDUCTION volutional neural network to regress camera 6-DOF poses
Most deep unsupervised approaches for learning VSLAM directly from raw RGB images. Their model was able to mea-
rely on the brightness constancy assumption, which stipulates sure camera relocalization errors, which were then exploited
that pixels of different frames that correspond to the same to improve the estimates further, obtaining 2m and 6◦ of accu-
scene coordinate (in 3D) must share the same color. racy for large-scale outdoor environments and 0.5m and 10◦
This assumption is generally violated in real-world cir- of accuracy indoors. Similarly, [146] leveraged convolutional
cumstances due to illumination variations, non-Lambertian bayesian network to incorporate global orientation from the
surfaces, and the presence of dynamic objects [138]. Ini- sun into the visual odometry estimation. On the other hand,
tial attempts to reduce the uncertainty associated with this the authors of [147] introduced a novel monocular depth esti-
assumption were made in the context of depth and ego-motion mation network trained without supervision on stereo videos.
estimation by training an explainability mask, which outputs Their method is based on modeling the pixel photometric
the model’s belief of where it might succeed. This results in a uncertainty, and to avoid wrong data association that may
per-pixel soft mask that down-weights predictions in regions come from left and right image illumination variations, they
of high uncertainty [52], leading to an implicit uncertainty first align both images to the same brightness conditions
reduction.However we observe that this method only acts as a using predicted brightness transformation parameters. This
filter that prevents ambiguous features from being considered makes their uncertainty modeling more resilient to other non
during the training phase. This means that unmodeled arte- trivial artefacts such as non-Lambertian surfaces, featureless
facts that had not been seen during training may still mislead areas and moving obstacles, achieving state of the art perfor-
the model predictions. mance on KITTI [83] and EuRoC MAV [148] datasets.
Other authors have proposed specialized networks for error In addition, mapping uncertainty was also investigated
correction. For instance, the authors in [139] designed DPC- in [149] and [150]. The authors of [149] introduced CMP,
Net, a deep neural network for pose correction that can a probabilistic mapper and planner which incrementally
be added to an existing pose estimator to fuse small pose updates its belief about the map’s free space and occupied
adjustments to the original estimates. The network takes regions. It warps previous beliefs (from past frames) with
as input two successive stereo pairs and learns geometric current egomotion to predict how the map will change as
transformations to be applied to the original estimate by a result of motion. Then, it aligns its updated belief with
means of a convolutional neural network. It parameterizes the current observation, and a deep model is trained end-to-end
predictive corrections using Lie algebra formulation [140] to to optimize selected actions to achieve a high-level planning
take into account the correlation between the translational and task. The benefit of such an approach is that the generated
rotational errors, and demonstrates an accuracy improvement maps are task-oriented. For instance, the authors show that
even in situations of poorly calibrated lens distortions. Simi- CMP can predict free space in regions that have not yet
larly, the authors of [141] explored the use of a stacked LSTM been fully observed and that leads to a location of interest.
network to correct visual SLAM pose estimation errors. The To the best of our knowledge, this was the first successful
implementation of a deep visual SLAM system under clas- reduce the map and pose uncertainties. However, constantly
sical SLAM principles (Bayesian updates of beliefs). Yet, looping around a single location may be a waste of time
despite focusing only on mapping (ego-motion is assumed to and resources. Moreover not all locations and views of the
be provided to the system), CMP can serve as an important environment are equally informative and some areas may
guide to shrink the gap between deep learning and classical even mislead the estimates due to factors such as poor lighting
approaches. Another limitation of CMP concerns its static and clutter. Hence, a VSLAM system needs an appropriate
scene assumption. exploration strategy to achieve more accurate map and pose
The authors of [150] developed a deep learning archi- estimates.
tecture to probabilistically capture the trajectory of dynamic In practice, most work on VSLAM uses a human oper-
vehicles on highways. Their approach involves integrating ator in the exploration phase to initiate the robot internal
the attention mechanism with LSTM to create a dynamic map representation by ensuring maximal coverage of the
occupancy grid map that represents the new vehicle locations area of interest. However, this soon becomes impractical and
after a fixed time. This is motivated by the insight that some laborious as the environment grows in size. On the other
vehicles influence the behavior of other vehicles more than hand, some recent techniques have been proposed to make
others, and capturing temporal features of those elements of VSLAM systems able to learn how to efficiently explore the
interest may ease the learning process. environment in an active manner without human intervention.
However, the previous methods treat VSLAM uncer- Active exploration involves appropriate reaction to the
tainty estimate as unimodal estimation, which goes against various events that a robot may encounter during navigation.
the inherent interdependence between the various SLAM In VSLAM, active exploration uses the information gained
modules (uncertainty of one module strongly affects the from the past views of the environment to decide the next
uncertainty of others). Very recently, [135] proposed a action the vehicle should take to gain in performance or visit
differentiable implementation of FastSLAM, a classical new interesting areas of the environment. In this regard, deep
SLAM system, and was able to encode a deep learning model reinforcement learning has attracted many researchers thanks
capable of tracking the various uncertainties of mapping and to its capability of learning through interactions. It was ini-
localization and adjusting its beliefs accordingly in a proba- tially used in the context of collision-free navigation, in which
bilistically consistent framework. a mobile robot uses simple extrinsic rewards to encourage
obstacle avoidance [158].
VIII. ACTIVE LEARNING However, efficient navigation in the context of VSLAM
Active SLAM refers to the robot’s ability to intelligently entails more than just avoiding obstacles; it also requires
explore its environment and take optimal decisions and optimal selection of actions to gain in performance or reduce
actions to improve its localization, map, and perception of the uncertainty in mapping the environment and localizing the
environment. It can be addressed from two different perspec- robot. In this context, the authors of [159] used deep rein-
tives: exploration and perception. Active exploration [152] forcement learning to directly map robot observations to the
focuses on controlling the robot motion to reduce pose actions needed to effectively explore new environments in
and map uncertainty, while active perception [153], [154] an end to end manner. To this effect, they designed a new
is defined as the process of ingeniously controlling robot intrinsic reward that favor discovering unexplored areas by
sensors to obtain relevant information about the environment computing at each step the difference in coverage between
and reduce sensing errors. the current estimated map and the built-in map from the
Aside from traditional methods, such as model predictive previous time step. Any increase in coverage is then rewarded
control [155], frontier-based methods [156] and random tree positively. The model was also pretrained using imitation
strategies [157], the problem of selecting the most convenient learning by mimicking expert demonstrations to overcome
action during robot navigation has been recently addressed by the challenge of sparse rewards in complex real 3D environ-
deep learning methods. ments. This helped in accelerating the learning process.
However, [160] pointed out that directly learning the low
A. ACTIVE EXPLORATION level actions end to end can suffer from sample inefficiency
Initially, any SLAM system is completely unaware of the map due to the high complexity of real world environments and
of its environment. At each time step it can only build a local the many scenarios that need to be explored. Instead, they
description of its surroundings, limited by the range of its proposed a hierarchical approach, consisting of three learn-
on-board sensors. Those local maps are then integrated in a able modules. First, a neural SLAM module uses supervised
subsequent step to form an overall estimate of the global map. learning to predict maps and robot poses from incoming
Ideally, a SLAM system needs to visit each area of its RGB images and sensors reading. Those maps and poses
environment at least once to build a full description of the are then consumed by the Global policy module to produce
global map. However, in most cases a single pass through long-term goals using reinforcement learning with a train-
an area is not sufficient. This is mainly due to errors in the ing scheme that favor high coverage of the area to explore
estimation model or the sensors. In theory, revisiting the same similar to [159]. The long-term goal is then converted to a
area multiple times during the robot exploration phase may short-term goal using an analytical method. Finally, a local
policy network is trained using imitation learning to map the learning how to efficiently acquire informative visual obser-
short-term goal to the action the robot needs to execute. This vations exist in the literature.
hierarchical decomposition made their model able to gener- Initial attempts targeted active recognition. The objective
alize better to unseen environments and outperform previous was to learn which action to take to remove ambiguity from
methods on the exploration task. perceived objects. For instance, [164] trained a recurent neu-
Other interesting lines of research focused on favoring ral network on learning motion policies that improve internal
actions that maximize robot’s knowledge of the environment representation of the environment conditioned on all past
either by steering the robot towards areas that are hard to views. Similarly, [165] learned visual feature representations
predict [161], or using information theory to maximize the conditioned by driving inputs to predict the action that leads
environment’s entropy reduction [162]. From a SLAM stand- to the next best view for more accurate recognition.
point, this has numerous advantages as it explicitly reduces In the context of visual SLAM, active perception is more
the uncertainty of the environment, resulting in more accurate concerned about reducing the uncertainty in the estimates.
estimates. To this effect, [166] used reinforcement learning to train an
agent on selecting actions that reduce its uncertainty on the
B. ACTIVE PERCEPTION unobserved part of the environment. The proposed method
Active perception can be defined as an intelligent data acqui- was able to understand 3D shapes from very few viewpoints.
sition process that guides a robot’s decisions in situations of
partial observability. It makes incremental beliefs about the IX. CHALLENGES AND FUTURE OPPORTUNITIES
state of the environment based on successive observations Although deep learning has shown astonishing success in
and directs the robot’s motion and perception toward loca- solving the visual SLAM problem, there are still many open
tions that will improve its understanding of the environment. challenges that need careful attention.
In other words, it is the robot that decides how to perceive the
world based on its current and past observations. A. VSLAM WITH HIGH LEVEL SEMANTIC MAPS
In theory, complete certainty of what we observe is only Most of the work that has been carried out in VSLAM
achieved if all possible views of the scene are explored. is limited to representing the appearance and geometrical
Yet, as the environment grows in complexity, an extensive structure of the environment. Although there have lately
exploration of the environment becomes laborious and time- been some attempts to equip VSLAM systems with semantic
consuming. However, in practice, not every part and view understanding, the extracted semantic information is gen-
of the environment is equally informative. For example, not erally confined to segmenting the different entity classes
each part of an object needs to be fully observed in order present in the environment. However, a true comprehension
to recognize the object [163]. Hence, many approaches to of the real world should go beyond merely recognizing what
is present in the surroundings. It also needs the knowledge trust of deep learning based solutions and help with their wide
of what is happening, what each entity is doing, whether real world adoption. Without intermediary guidance to solve a
a place is safe or risky, how each element interacts with learning task, the predictions of a deep learning model gener-
others, and the context in which the actions are taking place. ally rely on complex features of the inputs. For instance, it has
All of these issues should be addressed in order to create been shown in [171] that in a traditional CNN, a high level
more meaningful semantic maps that will strongly harness filter may represent a mixture of patterns. A simple task as
system performance on other downstream tasks. In our opin- identifying cats in an image may examine simultaneously the
ion, one major challenge that requires attention to enhance head and the leg parts of a cat to produce its predictions. This
the significance of semantic maps is maintaining consistency lead to entangled representations that are hard to interpret.
among semantic classes across consecutive frames. Presently, Most of the research to address deep learning interpretability
the majority of studies in this field only focus on extracting have been carried out in a post processing stage. The main
semantics from the current frame, neglecting the linking of objective is to detect the most important features the model
previously found semantics. However, it is often the case that pays attention to when trained on a learning task. For instance,
consecutive frames contain numerous shared elements, and the authors of [172] presented Network Dissection, a general
instances of the same object should not deviate greatly from framework for interpreting the latent representations learned
their location in both frames. As such, it is imperative to by a CNN model at different layers. In [173], a new method
investigate new approaches that can evaluate both spatial and has been introduced to analyse the most critical frames and
temporal consistency among discovered semantics, in order spatial features that a 3D CNN and Convolutional LSTM
to generate more accurate and meaningful semantic maps. attend to when solving a video classification task. However,
in our opinion, those approaches are only able to explain
B. EFFICIENCY representations that are easily interpretable and most of the
complex features may be ignored. Few authors tried to force
Despite the great accuracy of many deep learning VSLAM
their network to attend to interpretable features of the input
methods, there has been little effort to improve the compu-
at the learning stage. An interesting approach was presented
tation and energy consumption of the proposed solutions,
in [171] to increase CNN interpretability. The authors sug-
which are both critical for real-time tasks and long-term
gested to add an additional loss on each high level filter that
autonomy. Most of the proposed VSLAM solutions were
encourages their activation only in regions that are close in
tested under constrained environmental conditions, primar-
space, and therefore forcing them to describe a single part of
ily within indoor settings and with few dynamic elements,
the object. However, the field of interpretable deep learning
thereby reducing the computational demands of such sys-
is still in its infancy stage and its application to VSLAM is
tems. However, many real world applications such as oceanic
very little studied.
exploration, search and rescue operations or space explo-
ration need to operate for extended periods of time over
large scale and constantly changing dynamic environments.
D. GENERALIZATION
For these applications, the internal map and trajectory rep-
In general, deep learning based VSLAM solutions are trained
resentations entail substantial memory usage as a result of
on specific scenarios and the validity of the model is ensured
the ongoing exploration of new areas and the need to track
by testing on unseen parts of the dataset with some of the
and distinguish multiple dynamic elements from the static
configurations changed. However, nothing guaranties that
surroundings simultaneously. This may be unsuitable for
all the situations that may arise in a real deployment have
deployment on low-resource devices or when the environ-
been covered at the training stage. We argue that the lack
ment becomes very large. An idea to address this issue is to
of generalization capability in deep learning based VSLAM
construct maps that include merely elements of the scene that
methods is mainly due to the difficulty in constructing a
are relevant for the task at hand. Although some recent works
dataset that encompasses all potential scenarios that a vehicle
proposed compact representations based on learning map
may encounter. The existing datasets primarily consist of
embedding [167], [168], [169], [170], they were only tested
data gathered under typical conditions, such as favorable
in static environments with simple illumination variations.
weather, normal traffic flow, and sound road infrastructure.
It would therefore be very interesting if such representations
Nevertheless, when deployed, a VSLAM system may be
could be extended to more challenging contexts.
subject to adverse weather conditions such as heavy rain or
snow. This can add visibility concerns that if not already han-
C. INTERPRETABILITY dled at the training stage may lead to impaired performance.
Deep learning techniques have known a tremendous success Additionally, unexpected events, such as animal crossings,
in solving the VSLAM problem. However, the proposed temporary constructions, road damage, or anomalous driver
models are often regarded as a black box and most of the behavior, may have significant impact if not detected by the
underlying mechanisms are kept hidden. In most real-life model. Therefore, it is of utmost importance to consider such
scenarios, it is strongly desired to comprehend why the deep edge cases in the construction of the dataset. Furthermore,
model produces a given prediction. This may enforce people’s the collected data may be biased towards more prevalent
examples, such as straight driving. As a result, balancing the is of utmost importance for reliable and safe real-world
dataset and ensuring its diversity is another crucial factor that deployment.
must be addressed to achieve generalization in the trained
models. Last, the authors of [174] found a way to apply X. CONCLUSION
imperceptible changes to the input images to completely mis- This work provides a comprehensive overview of deep learn-
lead the predictions of previously correctly classified inputs. ing methods to solve the visual SLAM problem. It pro-
This indicates that merely testing on unseen dataset is not poses a novel taxonomy, covering the subject from various
sufficient to endow the deep model with robust generalization perspectives. Applying learning strategies to tackle robots’
capabilities. Designing new methods for proving the general- localization and perception remarkably boosts the VSLAM
ization of deep learning models is thus necessary. Perhaps a performance. It allows robots to benefit from deep learn-
good direction may be to assess deep learning performance ing architectures’ tremendous capacities to capture complex,
on stress conditions with aggressive driving behaviours or by hard-to-model features of the environment and easily cope
introducing meaningful perturbations on parts of the input. with the various uncertainties of visual sensory inputs. As a
result, more robust solutions for real-world applications are
E. PROBALISTIC 3D VSLAM available. In addition, deep learning methods can be easily
The majority of deep learning-based VSLAM architectures optimised for the task of interest in a purely data-driven
approach the problem by means of deterministic map and and human-intervention-free manner. This makes them a
robot pose predictions, neglecting its inherent probabilistic compelling alternative to classical hand-crafted solutions.
nature and leading in most situations to sub-optimal solutions. Although deep learning based VSLAM is still in its infancy
Recent efforts have shown that it is possible to learn VSLAM stage, it has outperformed classical state-of-the-art methods
in an end-to-end fashion by encoding the probabilistic depen- in many challenging scenarios, including environments with
dence between its underlying components in a differentiable variable illuminations, few or repetitive textures, occlusions,
manner. However, the proposed uncertainty-aware methods and dynamic elements. This suggests that deep learning meth-
were only applied to simple 2D static environments. It would ods are the way to go for making robots able to perceive,
be very interesting to extend such approaches to more chal- understand, and act in our real world.
lenging 3D and dynamic environments.
REFERENCES
[1] H. Durrant-Whyte and T. Bailey, ‘‘Simultaneous localization and map-
F. REAL WORLD DEPLOYMENT ping: Part I,’’ IEEE Robot. Autom. Mag., vol. 13, no. 2, pp. 99–110,
In the context of VSLAM, prediction accuracy is generally Jun. 2006.
[2] T. Bailey and H. Durrant-Whyte, ‘‘Simultaneous localization and map-
regarded as the gold standard by which to evaluate the pro- ping (SLAM): Part II,’’ IEEE Robot. Autom. Mag., vol. 13, no. 3,
posed methods, neglecting other important factors that may pp. 108–117, Sep. 2006.
impact their practical deployment in the real world. For exam- [3] F. Zeng, C. Wang, and S. S. Ge, ‘‘A survey on visual navigation for
artificial agents with deep reinforcement learning,’’ IEEE Access, vol. 8,
ple, system analysis in failure situations is little studied in the pp. 135426–135442, 2020.
literature. Many sources of failure may be encountered during [4] J. A. M. Rodriguez, Laser Scanner Technology. Norderstedt, Germany:
deployment, including measurement inaccuracies, hardware Books on Demand, 2012.
malfunctions, and incorrect predictions. Most of the proposed [5] T. Neff, The Laser That’s Changing the World. Amherst, MA, USA:
Prometheus Books, 2018.
approaches blindly perform the robot pose and map pre- [6] G. R. Curry, Radar Essentials: A Concise Handbook for Radar
diction without any assessment mechanism. However, it is Design and Performance Analysis, vol. 2. Edison, NJ, USA: IET,
well known that deep learning predictions are not immune 2011.
[7] N. Kolev, Sonar Systems. Norderstedt, Germany: Books on Demand,
to errors. Moreover, in VSLAM, the error at a single stage 2011.
can accumulate, leading to subsequent mispredictions. This [8] X. Zhang, J. Lai, D. Xu, H. Li, and M. Fu, ‘‘2D LiDAR-based SLAM and
raises safety concerns about the use of VSLAM solutions in path planning for indoor rescue using mobile robots,’’ J. Adv. Transp.,
vol. 2020, pp. 1–14, Nov. 2020.
the real world.
[9] D. Droeschel and S. Behnke, ‘‘Efficient continuous-time SLAM for 3D
Besides, most proposed methods are evaluated using LiDAR-based online mapping,’’ in Proc. IEEE Int. Conf. Robot. Autom.
datasets with unchanging environmental characteristics (e.g., (ICRA), May 2018, pp. 5000–5007.
same layout, weather, illumination properties, etc.). This may [10] J. Ruan, B. Li, Y. Wang, and Z. Fang, ‘‘GP-SLAM+: Real-time 3D
LiDAR SLAM based on improved regionalized Gaussian process map
pose an issue if the system is deployed for a long period reconstruction,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS),
of time, especially when drastic changes in the environment Oct. 2020, pp. 5171–5178.
happen. [11] G. Ros, A. Sappa, D. Ponsa, and A. M. Lopez, ‘‘Visual slam for driverless
cars: A brief survey,’’ in Proc. Intell. Vehicles Symp. (IV) Workshops,
Lastly, robustness to unexpected events has not been well vol. 2, 2012, pp. 1–6.
researched in the literature. For example, robot motion on [12] B. Tang and S. Cao, ‘‘A review of VSLAM technology applied in
uneven terrain, push action by a moving obstacle or hard augmented reality,’’ IOP Conf. Ser., Mater. Sci. Eng., vol. 782, no. 4,
vision due to insufficient lighting, rain, snow or fog may Mar. 2020, Art. no. 042014.
[13] Y. Cheng, M. Maimone, and L. Matthies, ‘‘Visual odometry on the Mars
induce perturbations both in localization and perception. exploration rovers,’’ in Proc. IEEE Int. Conf. Syst., Man Cybern., vol. 1,
Hence, assessing robustness to those kinds of uncertainties Oct. 2005, pp. 903–910.
[14] L. Ruotsalainen, S. Gröhn, M. Kirkko-Jaakkola, L. Chen, R. Guinness, [39] D. Scaramuzza and F. Fraundorfer, ‘‘Visual odometry [tutorial],’’ IEEE
and H. Kuusniemi, ‘‘Monocular visual SLAM for tactical situational Robot. Autom. Mag., vol. 18, no. 4, pp. 80–92, Dec. 2011.
awareness,’’ in Proc. Int. Conf. Indoor Positioning Indoor Navigat. [40] F. Fraundorfer and D. Scaramuzza, ‘‘Visual odometry: Part II: Matching,
(IPIN), Oct. 2015, pp. 1–9. robustness, optimization, and applications,’’ IEEE Robot. Autom. Mag.,
[15] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, ‘‘MonoSLAM: vol. 19, no. 2, pp. 78–90, Jun. 2012.
Real-time single camera SLAM,’’ IEEE Trans. Pattern Anal. Mach. [41] S. Zagoruyko and N. Komodakis, ‘‘Learning to compare image patches
Intell., vol. 29, no. 6, pp. 1052–1067, Jun. 2007. via convolutional neural networks,’’ in Proc. IEEE Conf. Comput. Vis.
[16] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, ‘‘ORB-SLAM: A Pattern Recognit. (CVPR), Jun. 2015, pp. 4353–4361.
versatile and accurate monocular SLAM system,’’ IEEE Trans. Robot., [42] A. Shaked and L. Wolf, ‘‘Improved stereo matching with con-
vol. 31, no. 5, pp. 1147–1163, Oct. 2015. stant highway networks and reflective confidence learning,’’ in Proc.
[17] D. G. Lowe, ‘‘Distinctive image features from scale-invariant keypoints,’’ IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Dec. 2004. pp. 4641–4650.
[18] E. Karami, S. Prasad, and M. Shehata, ‘‘Image matching using SIFT, [43] H. Park and K. M. Lee, ‘‘Look wider to match image patches with
SURF, BRIEF and ORB: Performance comparison for distorted images,’’ convolutional neural networks,’’ IEEE Signal Process. Lett., vol. 24,
2017, arXiv:1710.02726. no. 12, pp. 1788–1792, Dec. 2017.
[19] A. M. Andrew, Multiple View Geometry in Computer Vision, R. Hartley [44] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang, ‘‘A deep visual corre-
and A. Zisserman, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2000, spondence embedding model for stereo matching costs,’’ in Proc. IEEE
p. 607. Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 972–980.
[20] Y. Shavit and R. Ferens, ‘‘Introduction to camera pose estimation with [45] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and
deep learning,’’ 2019, arXiv:1907.05272. T. Brox, ‘‘A large dataset to train convolutional networks for disparity,
[21] X. Zhang, L. Wang, and Y. Su, ‘‘Visual place recognition: A survey optical flow, and scene flow estimation,’’ in Proc. IEEE Conf. Comput.
from deep learning perspective,’’ Pattern Recognit., vol. 113, May 2021, Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4040–4048.
Art. no. 107760. [46] D. Hoiem, A. A. Efros, and M. Hebert, ‘‘Automatic photo pop-up,’’ in
[22] S. Arshad and G.-W. Kim, ‘‘Role of deep learning in loop closure detec- Proc. ACM SIGGRAPH Papers, Jul. 2005, pp. 577–584.
tion for visual and LiDAR SLAM: A survey,’’ Sensors, vol. 21, no. 4,
[47] D. Eigen, C. Puhrsch, and R. Fergus, ‘‘Depth map prediction from a
p. 1243, Feb. 2021.
single image using a multi-scale deep network,’’ in Proc. Adv. Neural Inf.
[23] F. Martín, F. González, J. M. Guerrero, M. Fernández, and J. Ginés, Process. Syst., vol. 27, 2014, pp. 1–9.
‘‘Semantic 3D mapping from deep image segmentation,’’ Appl. Sci.,
[48] W. Chen, Z. Fu, D. Yang, and J. Deng, ‘‘Single-image depth perception
vol. 11, no. 4, p. 1953, Feb. 2021.
in the wild,’’ 2016, arXiv:1604.03901.
[24] S. Thrun, ‘‘Probabilistic algorithms in robotics,’’ AI Mag., vol. 21, no. 4,
[49] D. Eigen and R. Fergus, ‘‘Predicting depth, surface normals and semantic
p. 93, 2000.
labels with a common multi-scale convolutional architecture,’’ in Proc.
[25] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2650–2658.
I. Reid, and J. J. Leonard, ‘‘Past, present, and future of simultaneous local-
[50] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, ‘‘Deep
ization and mapping: Toward the robust-perception age,’’ IEEE Trans.
ordinal regression network for monocular depth estimation,’’ in
Robot., vol. 32, no. 6, pp. 1309–1332, Dec. 2016.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[26] C. Chen, B. Wang, C. X. Lu, N. Trigoni, and A. Markham, ‘‘A survey on
pp. 2002–2011.
deep learning for localization and mapping: Towards the age of spatial
machine intelligence,’’ 2020, arXiv:2006.12567. [51] J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and
J. Civera, ‘‘CAM-convs: Camera-aware multi-scale convolutions for
[27] T. Taketomi, H. Uchiyama, and S. Ikeda, ‘‘Visual SLAM algorithms: A
single-view depth,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
survey from 2010 to 2016,’’ IPSJ Trans. Comput. Vis. Appl., vol. 9, no. 1,
Recognit. (CVPR), Jun. 2019, pp. 11826–11835.
pp. 1–11, Dec. 2017.
[28] M. Servières, V. Renaudin, A. Dupuis, and N. Antigny, ‘‘Visual and [52] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, ‘‘Unsupervised learning
visual-inertial SLAM: State of the art, classification, and experimental of depth and ego-motion from video,’’ in Proc. IEEE Conf. Comput. Vis.
benchmarking,’’ J. Sensors, vol. 2021, pp. 1–26, Feb. 2021. Pattern Recognit. (CVPR), Jul. 2017, pp. 1851–1858.
[29] C. Duan, S. Junginger, J. Huang, K. Jin, and K. Thurow, ‘‘Deep learning [53] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, ‘‘Learning depth
for visual SLAM in transportation robotics: A review,’’ Transp. Saf. from monocular videos using direct methods,’’ in Proc. IEEE/CVF Conf.
Environ., vol. 1, no. 3, pp. 177–184, Dec. 2019. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 2022–2030.
[30] C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, ‘‘Monocular depth [54] R. Garg, V. K. Bg, G. Carneiro, and I. Reid, ‘‘Unsupervised CNN for
estimation based on deep learning: An overview,’’ Sci. China Technol. single view depth estimation: Geometry to the rescue,’’ in Proc. Eur. Conf.
Sci., vol. 63, pp. 1612–1627, Jun. 2020. Comput. Vis. USA: Springer, 2016, pp. 740–756.
[31] H. Laga, L. V. Jospin, F. Boussaid, and M. Bennamoun, ‘‘A survey on [55] C. Godard, O. M. Aodha, and G. J. Brostow, ‘‘Unsupervised monocular
deep learning techniques for stereo-based depth estimation,’’ IEEE Trans. depth estimation with left-right consistency,’’ in Proc. IEEE Conf. Com-
Pattern Anal. Mach. Intell., vol. 44, no. 4, pp. 1738–1764, Apr. 2022. put. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 270–279.
[32] S. Savian, M. Elahi, and T. Tillo, ‘‘Optical flow estimation with deep [56] D. Fortun, P. Bouthemy, and C. Kervrann, ‘‘Optical flow modeling and
learning, a survey on recent advances,’’ in Deep Biometrics. USA: computation: A survey,’’ Comput. Vis. Image Understand., vol. 134,
Springer, 2020, pp. 257–287. pp. 1–21, May 2015.
[33] J. Hur and S. Roth, ‘‘Optical flow estimation in the deep learning age,’’ [57] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić,
in Modelling Human Motion. USA: Springer, 2020, pp. 119–140. X. Wang, and P. Westling, ‘‘High-resolution stereo datasets with
[34] K. Wang, S. Ma, J. Chen, F. Ren, and J. Lu, ‘‘Approaches, challenges, and subpixel-accurate ground truth,’’ in Pattern Recognition. Münster,
applications for deep visual odometry: Toward complicated and emerging Germany: Springer, Sep. 2014, pp. 31–42.
areas,’’ IEEE Trans. Cogn. Develop. Syst., vol. 14, no. 1, pp. 35–49, [58] M. Menze and A. Geiger, ‘‘Object scene flow for autonomous vehicles,’’
Mar. 2022. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
[35] G. Klein and D. Murray, ‘‘Parallel tracking and mapping for small AR pp. 3061–3070.
workspaces,’’ in Proc. 6th IEEE ACM Int. Symp. Mixed Augmented [59] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, ‘‘Indoor segmentation
Reality, Nov. 2007, pp. 225–234. and support inference from RGBD images,’’ in Proc. ECCV, vol. 7576,
[36] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, ‘‘DTAM: Dense 2012, pp. 746–760.
tracking and mapping in real-time,’’ in Proc. Int. Conf. Comput. Vis., [60] A. Saxena, M. Sun, and A. Y. Ng, ‘‘Make3D: Learning 3D scene structure
Nov. 2011, pp. 2320–2327. from a single still image,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
[37] J. Engel, T. Schöps, and D. Cremers, ‘‘LSD-SLAM: Large-scale direct vol. 31, no. 5, pp. 824–840, May 2008.
monocular SLAM,’’ in Proc. Eur. Conf. Comput. Vis. USA: Springer, [61] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,
2014, pp. 834–849. P. Van Der Smagt, D. Cremers, and T. Brox, ‘‘FlowNet: Learning optical
[38] J. Engel, V. Koltun, and D. Cremers, ‘‘Direct sparse odometry,’’ IEEE flow with convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis.
Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625, Mar. 2017. (ICCV), Dec. 2015, pp. 2758–2766.
[62] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, [87] G. L. Oliveira, N. Radwan, W. Burgard, and T. Brox, ‘‘Topometric
‘‘FlowNet 2.0: Evolution of optical flow estimation with deep networks,’’ localization with deep learning,’’ in Robotics Research. USA: Springer,
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, 2020, pp. 505–520.
pp. 2462–2470. [88] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, ‘‘PointNet:
[63] A. Ranjan and M. J. Black, ‘‘Optical flow estimation using a spatial Deep learning on point sets for 3D classification and segmentation,’’
pyramid network,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
(CVPR), Jul. 2017, pp. 4161–4170. pp. 652–660.
[64] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, ‘‘PWC-Net: CNNs for optical [89] Y. Zhou, C. Wu, Z. Li, C. Cao, Y. Ye, J. Saragih, H. Li, and Y. Sheikh,
flow using pyramid, warping, and cost volume,’’ in Proc. IEEE/CVF Conf. ‘‘Fully convolutional mesh autoencoder using efficient spatially varying
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8934–8943. kernels,’’ 2020, arXiv:2006.04325.
[65] Z. Teed and J. Deng, ‘‘RAFT: Recurrent all-pairs field transforms for [90] M. Muglikar, Z. Zhang, and D. Scaramuzza, ‘‘Voxel map for visual
optical flow,’’ in Proc. Eur. Conf. Comput. Vis. USA: Springer, 2020, SLAM,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020,
pp. 402–419. pp. 4181–4187.
[66] Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, and H. Zha, ‘‘Unsupervised deep [91] J. Malik, I. Abdelaziz, A. Elhayek, S. Shimada, S. A. Ali, V. Golyanik,
learning for optical flow estimation,’’ in Proc. AAAI Conf. Artif. Intell., C. Theobalt, and D. Stricker, ‘‘HandVoxNet: Deep voxel-based network
vol. 31, 2017, pp. 1–7. for 3D hand shape and pose estimation from a single depth map,’’ in Proc.
[67] Y. Zhu, Z. Lan, S. Newsam, and A. G. Hauptmann, ‘‘Guided optical flow IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
learning,’’ 2017, arXiv:1702.02295. pp. 7113–7122.
[68] Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu, ‘‘Occlusion [92] X. Han, S. Li, X. Wang, and W. Zhou, ‘‘Semantic mapping for mobile
aware unsupervised learning of optical flow,’’ in Proc. IEEE/CVF Conf. robots in indoor scenes: A survey,’’ Information, vol. 12, no. 2, p. 92,
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4884–4893. Feb. 2021.
[69] J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger, ‘‘Unsupervised [93] D. Lu and Q. Weng, ‘‘A survey of image classification methods and
learning of multi-frame optical flow with occlusions,’’ in Proc. Eur. Conf. techniques for improving classification performance,’’ Int. J. Remote
Comput. Vis. (ECCV), 2018, pp. 690–706. Sens., vol. 28, no. 5, pp. 823–870, 2007.
[70] Y. Zhong, P. Ji, J. Wang, Y. Dai, and H. Li, ‘‘Unsupervised deep epipolar [94] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and
flow for stationary or dynamic scenes,’’ in Proc. IEEE/CVF Conf. Com- M. Pietikäinen, ‘‘Deep learning for generic object detection: A survey,’’
put. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 12095–12104. Int. J. Comput. Vis., vol. 128, no. 2, pp. 261–318, Oct. 2020.
[71] B. Liao, J. Hu, and R. O. Gilmore, ‘‘Optical flow estimation com- [95] L. Guo, ‘‘Indoor scene reconstruction using the Manhattan assumption,’’
bining with illumination adjustment and edge refinement in live- Ph.D. thesis, Tianjin Univ., China, 2015.
stock UAV videos,’’ Comput. Electron. Agricult., vol. 180, Jan. 2021, [96] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, ‘‘Seman-
Art. no. 105910. ticFusion: Dense 3D semantic mapping with convolutional neural net-
[72] Y. Zheng, M. Zhang, and F. Lu, ‘‘Optical flow in the dark,’’ in Proc. works,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2017,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 4628–4635.
pp. 6749–6757. [97] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison,
[73] W. Yan, A. Sharma, and R. T. Tan, ‘‘Optical flow in dense foggy scenes ‘‘ElasticFusion: Dense SLAM without a pose graph,’’ in Proc. Robot.,
using semi-supervised learning,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Sci. Syst. XI, Jul. 2015, pp. 1–9.
Pattern Recognit. (CVPR), Jun. 2020, pp. 13259–13268. [98] F. Furrer, T. Novkovic, M. Fehr, A. Gawel, M. Grinvald, T. Sattler,
[74] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, ‘‘A naturalistic open R. Siegwart, and J. Nieto, ‘‘Incremental object database: Building 3D
source movie for optical flow evaluation,’’ in Computer Vision—ECCV models from multiple partial observations,’’ in Proc. IEEE/RSJ Int. Conf.
2012. Florence, Italy: Springer, Oct. 2012, pp. 611–625. Intell. Robots Syst. (IROS), Oct. 2018, pp. 6835–6842.
[75] W. Hartmann, M. Havlena, and K. Schindler, ‘‘Predicting matchability,’’ [99] K. Tateno, F. Tombari, and N. Navab, ‘‘Real-time and scalable incremen-
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 9–16. tal segmentation on dense SLAM,’’ in Proc. IEEE/RSJ Int. Conf. Intell.
[76] Y. Verdie, K. M. Yi, P. Fua, and V. Lepetit, ‘‘TILDE: A temporally Robots Syst. (IROS), Sep. 2015, pp. 4465–4472.
invariant learned DEtector,’’ in Proc. IEEE Conf. Comput. Vis. Pattern [100] N. Sünderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, ‘‘Meaningful
Recognit. (CVPR), Jun. 2015, pp. 5279–5288. maps with object-oriented semantic mapping,’’ in Proc. IEEE/RSJ Int.
[77] M. Brown, G. Hua, and S. Winder, ‘‘Discriminative learning of local Conf. Intell. Robots Syst. (IROS), Sep. 2017, pp. 5079–5085.
image descriptors,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, [101] S. Yang, Y. Huang, and S. Scherer, ‘‘Semantic 3D occupancy mapping
no. 1, pp. 43–57, Jan. 2011. through efficient high order CRFs,’’ in Proc. IEEE/RSJ Int. Conf. Intell.
[78] K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Learning local feature Robots Syst. (IROS), Sep. 2017, pp. 590–597.
descriptors using convex optimisation,’’ IEEE Trans. Pattern Anal. Mach. [102] M. Grinvald, F. Furrer, T. Novkovic, J. J. Chung, C. Cadena, R. Siegwart,
Intell., vol. 36, no. 8, pp. 1573–1585, Aug. 2014. and J. Nieto, ‘‘Volumetric instance-aware semantic mapping and 3D
[79] C. Liu, J. Yuen, and A. Torralba, ‘‘Sift flow: Dense correspondence across object discovery,’’ IEEE Robot. Autom. Lett., vol. 4, no. 3, pp. 3037–3044,
scenes and its applications,’’ IEEE Trans. Pattern Anal. Mach. Intell., Jul. 2019.
vol. 33, no. 5, pp. 978–994, Aug. 2011. [103] M. Jaimez, C. Kerl, J. Gonzalez-Jimenez, and D. Cremers, ‘‘Fast odom-
[80] K. Konda and R. Memisevic, ‘‘Learning visual odometry with a convolu- etry and scene flow from RGB-D cameras based on geometric clus-
tional network,’’ in Proc. VISAPP, 2015, pp. 486–490. tering,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2017,
[81] M. Weber, C. Rist, and J. M. Zöllner, ‘‘Learning temporal features with pp. 3992–3999.
CNNs for monocular visual ego motion estimation,’’ in Proc. IEEE 20th [104] R. Scona, M. Jaimez, Y. R. Petillot, M. Fallon, and D. Cremers,
Int. Conf. Intell. Transp. Syst. (ITSC), Oct. 2017, pp. 1–6. ‘‘StaticFusion: Background reconstruction for dense RGB-D SLAM in
[82] S. Wang, R. Clark, H. Wen, and N. Trigoni, ‘‘DeepVO: Towards end- dynamic environments,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
to-end visual odometry with deep recurrent convolutional neural net- May 2018, pp. 3849–3856.
works,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2017, [105] Y. Ai, T. Rui, M. Lu, L. Fu, S. Liu, and S. Wang, ‘‘DDL-SLAM: A robust
pp. 2043–2050. RGB-D SLAM in dynamic environments combined with deep learning,’’
[83] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driv- IEEE Access, vol. 8, pp. 162335–162342, 2020.
ing? The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. Comput. [106] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
Vis. Pattern Recognit., Jun. 2012, pp. 3354–3361. ‘‘A benchmark for the evaluation of RGB-D SLAM systems,’’ in Proc.
[84] R. Li, S. Wang, Z. Long, and D. Gu, ‘‘UnDeepVO: Monocular visual IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580.
odometry through unsupervised deep learning,’’ in Proc. IEEE Int. Conf. [107] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, ‘‘Visual
Robot. Autom. (ICRA), May 2018, pp. 7286–7291. categorization with bags of keypoints,’’ in Proc. Workshop Stat.
[85] J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and Learn. Comput. Vis. (ECCV), vol. 1, Prague, Czech Republic, 2004,
I. Reid, ‘‘Unsupervised scale-consistent depth and ego-motion learning pp. 1–2.
from monocular video,’’ 2019, arXiv:1908.10553. [108] F. Perronnin and C. Dance, ‘‘Fisher kernels on visual vocabularies for
[86] S. Thrun, ‘‘Learning metric-topological maps for indoor mobile robot image categorization,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
navigation,’’ Artif. Intell., vol. 99, no. 1, pp. 21–71, Feb. 1998. nit., Jun. 2007, pp. 1–8.
[109] H. Jégou, M. Douze, C. Schmid, and P. Pérez, ‘‘Aggregating local descrip- [132] R. Jonschkowski, D. Rastogi, and O. Brock, ‘‘Differentiable par-
tors into a compact image representation,’’ in Proc. IEEE Comput. Soc. ticle filters: End-to-end learning with algorithmic priors,’’ 2018,
Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3304–3311. arXiv:1805.11122.
[110] R. Arandjelovic and A. Zisserman, ‘‘All about VLAD,’’ in Proc. IEEE [133] P. Karkus, A. Angelova, V. Vanhoucke, and R. Jonschkowski, ‘‘Dif-
Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 1578–1585. ferentiable mapping networks: Learning structured map representations
[111] K. L. Ho and P. Newman, ‘‘Detecting loop closure with scene sequences,’’ for sparse visual localization,’’ in Proc. IEEE Int. Conf. Robot. Autom.
Int. J. Comput. Vis., vol. 74, no. 3, pp. 261–286, Sep. 2007. (ICRA), May 2020, pp. 4753–4759.
[112] A. Angeli, S. Doncieux, J.-A. Meyer, and D. Filliat, ‘‘Real-time [134] R. Li, S. Wang, and D. Gu, ‘‘DeepSLAM: A robust monocular SLAM
visual loop-closure detection,’’ in Proc. IEEE Int. Conf. Robot. Autom., system with unsupervised deep learning,’’ IEEE Trans. Ind. Electron.,
May 2008, pp. 1842–1847. vol. 68, no. 4, pp. 3577–3587, Apr. 2021.
[113] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification [135] P. Karkus, S. Cai, and D. Hsu, ‘‘Differentiable SLAM-Net: Learning
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. particle SLAM for visual navigation,’’ in Proc. IEEE/CVF Conf. Comput.
Process. Syst., vol. 25, 2012, pp. 1097–1105. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 2815–2825.
[114] P. Fischer, A. Dosovitskiy, and T. Brox, ‘‘Descriptor matching [136] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, ‘‘FastSLAM: A
with convolutional neural networks: A comparison to SIFT,’’ 2014, factored solution to the simultaneous localization and mapping problem,’’
arXiv:1405.5769. in Proc. AAAI/IAAI, 2002, pp. 593–598.
[115] Z. Chen, O. Lam, A. Jacobson, and M. Milford, ‘‘Convolutional neural [137] J. Wu and H. Zhang, ‘‘Camera sensor model for visual SLAM,’’ in Proc.
network-based place recognition,’’ 2014, arXiv:1411.1509. 4th Can. Conf. Comput. Robot Vis. (CRV), May 2007, pp. 149–156.
[116] Y. Hou, H. Zhang, and S. Zhou, ‘‘Convolutional neural network-based [138] M. Klodt and A. Vedaldi, ‘‘Supervising the new with the old: Learning
image representation for visual loop closure detection,’’ in Proc. IEEE SFM from SFM,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
Int. Conf. Inf. Autom., Aug. 2015, pp. 2238–2245. pp. 698–713.
[117] Y. Xia, J. Li, L. Qi, H. Yu, and J. Dong, ‘‘An evaluation of deep learning in [139] V. Peretroukhin and J. Kelly, ‘‘DPC-Net: Deep pose correction for visual
loop closure detection for visual SLAM,’’ in Proc. IEEE Int. Conf. Inter- localization,’’ IEEE Robot. Autom. Lett., vol. 3, no. 3, pp. 2424–2431,
net Things (iThings) IEEE Green Comput. Commun. (GreenCom) IEEE Jul. 2018.
Cyber, Phys. Social Comput. (CPSCom) IEEE Smart Data (SmartData), [140] T. Drummond and R. Cipolla, ‘‘Visual tracking and control using lie alge-
Jun. 2017, pp. 85–91. bras,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.,
[118] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, ‘‘PCANet: A simple vol. 2, Jun. 1999, pp. 652–657.
deep learning baseline for image classification?’’ IEEE Trans. Image [141] R. Azzam, Y. Alkendi, T. Taha, S. Huang, and Y. Zweiri, ‘‘A stacked
Process, vol. 24, no. 12, pp. 5017–5032, Dec. 2015. LSTM-based approach for reducing semantic pose estimation error,’’
[119] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, IEEE Trans. Instrum. Meas., vol. 70, pp. 1–14, 2021.
S. Guadarrama, and T. Darrell, ‘‘Caffe: Convolutional architecture for [142] P. Ganti and S. L. Waslander, ‘‘Network uncertainty informed semantic
fast feature embedding,’’ in Proc. 22nd ACM Int. Conf. Multimedia, feature selection for visual SLAM,’’ in Proc. 16th Conf. Comput. Robot
Nov. 2014, pp. 675–678. Vis. (CRV), May 2019, pp. 121–128.
[120] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, [143] E. Goan and C. Fookes, ‘‘Bayesian neural networks: An introduction
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’ and survey,’’ in Case Studies in Applied Bayesian Data Science. USA:
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, Springer, 2020, pp. 45–87.
pp. 1–9. [144] Y. Gal and Z. Ghahramani, ‘‘Bayesian convolutional neural networks with
[121] A. Oliva and A. Torralba, ‘‘Modeling the shape of the scene: A holistic Bernoulli approximate variational inference,’’ 2015, arXiv:1506.02158.
representation of the spatial envelope,’’ Int. J. Comput. Vis., vol. 42, no. 3, [145] A. Kendall and R. Cipolla, ‘‘Modelling uncertainty in deep learning for
pp. 145–175, 2001. camera relocalization,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
[122] X. Gao and T. Zhang, ‘‘Loop closure detection for visual SLAM systems May 2016, pp. 4762–4769.
using deep neural networks,’’ in Proc. 34th Chin. Control Conf. (CCC), [146] V. Peretroukhin, L. Clement, and J. Kelly, ‘‘Reducing drift in visual
Jul. 2015, pp. 5851–5856. odometry by inferring sun direction using a Bayesian convolutional neural
[123] N. Merrill and G. Huang, ‘‘Lightweight unsupervised deep loop closure,’’ network,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2017,
2018, arXiv:1805.07703. pp. 2035–2042.
[124] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human [147] N. Yang, L. von Stumberg, R. Wang, and D. Cremers, ‘‘D3VO: Deep
detection,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern depth, deep pose and deep uncertainty for monocular visual odome-
Recognit. (CVPR), vol. 1, Jun. 2005, pp. 886–893. try,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[125] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and Jun. 2020, pp. 1281–1292.
K. Fragkiadaki, ‘‘SfM-Net: Learning of structure and motion from [148] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari,
video,’’ 2017, arXiv:1704.07804. M. W. Achtelik, and R. Siegwart, ‘‘The EuRoC micro aerial vehicle
[126] Z. Yin and J. Shi, ‘‘GeoNet: Unsupervised learning of dense depth, optical datasets,’’ Int. J. Robot. Res., vol. 35, no. 10, pp. 1157–1163, 2016.
flow and camera pose,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [149] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, ‘‘Cogni-
Recognit., Jun. 2018, pp. 1983–1992. tive mapping and planning for visual navigation,’’ in Proc. IEEE Conf.
[127] S. Lee, S. Im, S. Lin, and I. S. Kweon, ‘‘Learning residual flow as dynamic Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2616–2625.
motion from stereo videos,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots [150] K. Messaoud, I. Yahiaoui, A. Verroust-Blondet, and F. Nashashibi,
Syst. (IROS), Nov. 2019, pp. 1180–1186. ‘‘Attention based vehicle trajectory prediction,’’ IEEE Trans. Intell. Vehi-
[128] Y. Zou, Z. Luo, and J.-B. Huang, ‘‘DF-Net: Unsupervised joint learning of cles, vol. 6, no. 1, pp. 175–185, Mar. 2021.
depth and flow using cross-task consistency,’’ in Proc. Eur. Conf. Comput. [151] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
Vis. (ECCV), 2018, pp. 36–53. U. Franke, S. Roth, and B. Schiele, ‘‘The cityscapes dataset for semantic
[129] Q. Dai, V. Patii, S. Hecker, D. Dai, L. Van Gool, and K. Schindler, ‘‘Self- urban scene understanding,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
supervised object motion and depth estimation from video,’’ in Proc. Recognit. (CVPR), Jun. 2016, pp. 3213–3223.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), [152] L. Carlone, J. Du, M. K. Ng, B. Bona, and M. Indri, ‘‘Active SLAM
Jun. 2020, pp. 1004–1005. and exploration with particle filters using Kullback–Leibler divergence,’’
[130] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova, ‘‘Unsupervised J. Intell. Robot. Syst., vol. 75, no. 2, pp. 291–311, Aug. 2014.
monocular depth and ego-motion learning with structure and semantics,’’ [153] Q. V. Le, A. Saxena, and A. Y. Ng, ‘‘Active perception: Interac-
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops tive manipulation for improving object detection,’’ Standford Univ. J.,
(CVPRW), Jun. 2019, pp. 1–8. 2008.
[131] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and [154] R. Bajcsy, Y. Aloimonos, and J. Tsotsos, ‘‘Revisiting active perception,’’
M. J. Black, ‘‘Competitive collaboration: Joint unsupervised learning of Auto. Robots, vol. 42, no. 2, pp. 177–196, May 2018.
depth, camera motion, optical flow and motion segmentation,’’ in Proc. [155] C. Leung, S. Huang, and G. Dissanayake, ‘‘Active SLAM using model
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, predictive control and attractor based exploration,’’ in Proc. IEEE/RSJ
pp. 12240–12249. Int. Conf. Intell. Robots Syst., Oct. 2006, pp. 5026–5031.
[156] B. Yamauchi, ‘‘A frontier-based approach for autonomous exploration,’’ DANIEL BONILLA LICEA received the M.Sc.
in Proc. IEEE Int. Symp. Comput. Intell. Robot. Autom. (CIRA) Towards degree from the Centro de Investigación y Estudios
New Comput. Princ. Robot. Autom., Jul. 1997, pp. 146–151. Avanzados (CINVESTAV), Mexico City, in 2011,
[157] S. Rodriguez, X. Tang, J.-M. Lien, and N. M. Amato, ‘‘An obstacle-based and the Ph.D. degree from the University of Leeds,
rapidly-exploring random tree,’’ in Proc. IEEE Int. Conf. Robot. Autom. U.K, in 2016. From May 2011 to June 2012,
(ICRA), May 2006, pp. 895–900. he was an Intern with the Signal Processing Team,
[158] L. Tai and M. Liu, ‘‘Mobile robots exploration through CNN-based Intel Labs, Guadalajara, Mexico. In 2016, he was
reinforcement learning,’’ Robot. Biomimetics, vol. 3, no. 1, pp. 1–8,
invited for a Short Research Visit with the Centre
Dec. 2016.
de Recherche en Automatique de Nancy (CRAN),
[159] T. Chen, S. Gupta, and A. Gupta, ‘‘Learning exploration policies for
navigation,’’ 2019, arXiv:1903.01959. France. In 2017, he has collaborated in a research
[160] D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhut- project with the Centro de Investigación en Computacion (CIC), Mexico.
dinov, ‘‘Learning to explore using active neural SLAM,’’ 2020, From 2017 to 2020, he held a postdoctoral position with the International
arXiv:2004.05155. University of Rabat, Morocco. Currently, he holds a postdoctoral position
[161] O. Zhelo, J. Zhang, L. Tai, M. Liu, and W. Burgard, ‘‘Curiosity-driven with the Czech Technical University in Prague, Czech Republic. His research
exploration for mapless navigation with deep reinforcement learning,’’ interests include signal processing and communications-aware robotics.
2018, arXiv:1804.00456.
[162] F. Chen, S. Bai, T. Shan, and B. Englot, ‘‘Self-learning exploration and
mapping for mobile robots via deep reinforcement learning,’’ in Proc.
AIAA Scitech Forum, Jan. 2019, p. 396.
[163] K. C. Soska and S. P. Johnson, ‘‘Development of three-dimensional object
completion in infancy,’’ Child Develop., vol. 79, no. 5, pp. 1230–1236,
Sep. 2008.
[164] D. Jayaraman and K. Grauman, ‘‘Look-ahead before you leap: End-to-
end active recognition by forecasting the effect of motion,’’ in Proc. Eur.
Conf. Comput. Vis. USA: Springer, 2016, pp. 489–505.
BASSMA GUERMAH received the Engineering
[165] D. Jayaraman and K. Grauman, ‘‘Learning image representations tied to
ego-motion,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
degree in software engineering (major of promo-
pp. 1413–1421. tion) from the National Institute of Statistics and
[166] D. Jayaraman and K. Grauman, ‘‘Learning to look around: Intelligently Applied Economics (INSEA), in 2014, and the
exploring unseen environments for unknown tasks,’’ in Proc. IEEE/CVF Ph.D. degree in computer science and telecom-
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1238–1247. munications from the National Institute of Posts
[167] J. Zhang, L. Tai, M. Liu, J. Boedecker, and W. Burgard, ‘‘Neural SLAM: and Telecommunications (INPT), in 2018. She
Learning to explore with external memory,’’ 2017, arXiv:1706.09520. is currently a Professor with the Computer Sci-
[168] M. Zhang, K. T. Ma, S.-C. Yen, J. H. Lim, Q. Zhao, and J. Feng, ence Engineering School, International University
‘‘Egocentric spatial memory,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots of Rabat (UIR) and a member of TICLab. Her
Syst. (IROS), Oct. 2018, pp. 137–144. research interests include machine learning/deep learning (artificial intelli-
[169] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison, gence), signal processing, robotics, context-aware service-oriented comput-
‘‘CodeSLAM—Learning a compact, optimisable representation for dense ing, ontologies, and semantic web.
visual SLAM,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 2560–2568.
[170] S. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, ‘‘SceneCode:
Monocular dense semantic reconstruction using learned encoded scene
representations,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
nit. (CVPR), Jun. 2019, pp. 11776–11785.
[171] Q. Zhang, Y. N. Wu, and S.-C. Zhu, ‘‘Interpretable convolutional neural
networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 8827–8836.
[172] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, ‘‘Network
dissection: Quantifying interpretability of deep visual representations,’’ MOUNIR GHOGHO (Fellow, IEEE) received the
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, M.Sc. and Ph.D. degrees from the National Poly-
pp. 6541–6549. technic Institute of Toulouse, France, in 1993 and
[173] J. Manttari, S. Broomé, J. Folkesson, and H. Kjellstrom, ‘‘Interpreting 1997, respectively. He was an EPSRC Research
video features: A comparison of 3D convolutional networks and con- Fellow with the University of Strathclyde (Scot-
volutional LSTM networks,’’ in Proc. Asian Conf. Comput. Vis., 2020, land), from September 1997 to November 2001.
pp. 1–16. In December 2001, he joined the School of Elec-
[174] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, tronic and Electrical Engineering, University of
and R. Fergus, ‘‘Intriguing properties of neural networks,’’ 2013, Leeds, U.K., where he was promoted to a Full
arXiv:1312.6199. Professor, in 2008. He is still affiliated with the
University of Leeds. In 2010, he joined the International University of
Rabat, where he is currently the Dean of the College of Doctoral Studies
and the Director of the ICT Research Laboratory (TICLab). He has coor-
dinated around 20 research projects and supervised over 30 Ph.D. students
SAAD MOKSSIT received the Engineering in the U.K. and Morocco. He is the Co-Founder and the Co-Director of the
degree in embedded and mobile systems engi- CNRS-associated with the International Research Laboratory DataNet in the
neering (major of promotion) from the l’Ecole field of big data and artificial intelligence. His research interests include
Nationale Supérieure d’Informatique et d’Analyse machine learning, signal processing, and wireless communication. He is
des Systèmes (ENSIAS), in 2018. He is currently a fellow of the Asia–Pacific AI Association (AAIA). He was a recipient
pursuing the Ph.D. degree with the International of the 2013 IBM Faculty Award and the 2000 U.K. Royal Academy of
University of Rabat (UIR). He is also a member Engineering Research Fellowship. He served as an Associate Editor for many
of TICLab. His research interests include machine journals, including the IEEE Signal Processing Magazine and the IEEE
learning/deep learning, robotics, autonomous driv- TRANSACTIONS ON SIGNAL PROCESSING.
ing, and computer vision.
Monocular depth estimation is inherently more challenging than stereo depth estimation due to the lack of a second viewpoint, making the task ambiguous since an infinite number of 3D scenes can generate the same 2D image. Initial methods used oversimplified models classifying pixels as sky, ground, or objects, and applied hand-crafted cues assuming vertical stacking of objects, resulting in detail-lacking and poorly generalizable maps . Deep learning techniques address these challenges by utilizing CNNs for their superior image processing capabilities, introducing scale-invariant loss functions to overcome scale ambiguities, and employing end-to-end networks that regress depth from a single view, reducing reliance on pixel classification . However, CNNs can produce low-resolution outputs as depth increases if pooling operations remove too many features .
Stereo cameras offer improved accuracy in depth estimation by leveraging the parallax between two perspectives, making them more precise compared to monocular cameras which hypothesize depth from a single viewpoint . This benefit is particularly pronounced in scenarios requiring high-depth accuracy, as stereo systems effectively reduce ambiguity in depth interpretation . However, stereo systems carry the disadvantages of increased cost, weight, and resource demands, making them less ideal for applications with limited resources . In contrast, monocular systems are more versatile and cost-effective but face challenges with scale ambiguity and inherently produce less precise depth maps due to their reliance on a single viewpoint .
Ordinal regression has been applied in monocular depth estimation to recast the problem into a more intuitive task for neural networks, which aligns more closely with human perception. Humans find depth estimation easier for nearby objects than distant ones, suggesting that depth perception can be understood as an ordinal task . This approach allows the model to focus on ranking depths relative to each other rather than predicting absolute measures, thereby improving its ability to understand spatial relationships within the scene . It mitigates errors typically associated with deeper depth predictions by providing structured learning objectives that prioritize correct orderings in the depth map over precise scalar distances .
Creating suitable datasets is challenging because precise optical flow ground truth labeling is complex and labor-intensive. Additionally, real-world scenes involve variable lighting and complex motions that are difficult to replicate consistently across datasets . To address this, researchers developed synthetic datasets like the 'Flying Chairs', which use controlled backgrounds and object motions to simulate realistic optical flow scenarios, allowing for the training of deep networks in a supervised manner . While synthetic, the dataset provided a sufficient baseline to illustrate the capabilities of deep learning for optical flow estimation by creating a rich set of example motions and environments .
To address the scale ambiguity problem in self-supervised depth estimation models, researchers have proposed using multi-task architectures to jointly predict both depth maps and relative poses. This approach helps accurately compute relative poses and provides a better grounding for depth estimates by incorporating knowledge of scene structure . Additionally, a differentiable implementation of direct visual odometry has been introduced to more accurately estimate camera poses, thus improving depth estimation . These strategies have proven effective in enhancing model performance by providing a means to incorporate geometric constraints and utilize known poses, reducing reliance on ground-truth depth labels and improving scalability across different environments .
Self-supervised learning automates the extraction of auxiliary supervisory signals from raw data, reducing the need for manual annotation which is time-consuming and error-prone. In visual SLAM systems, it employs video inter-frame geometric constraints, where subsequent frame pixels are projected back to preceding frames, and the discrepancies are used as supervisory signals . This approach allows for the simultaneous prediction of depth maps and relative poses between frames, aiding camera pose estimation through novel view synthesis . The introduction of explainability masks helps the model ignore regions prone to error during training, improving robustness against scale ambiguities . However, traditional methods struggled to directly account for the necessary geometric correlations between depth and motion .
Stereo-based approaches yield better results in depth estimation because they utilize the geometric clues provided by the parallax between two separate views, which helps in resolving depth ambiguity more effectively than monocular approaches . The disparity maps generated from stereo pairs offer precise depth cues that monocular systems inherently lack . Knowledge from stereo cameras can be leveraged in monocular systems by initially training models with stereo cameras to learn disparity and depth perception, then applying these insights to monocular data by simulating similar depth cues through techniques such as disparity estimation and view synthesis . This approach exploits the rich geometric information gained from stereo data to inform monocular models, enhancing their performance without needing two camera inputs .
Classical hand-crafted methods for optical flow estimation rely on assumptions such as constant pixel intensity over time and local translational movements, which restrict them to small displacements and fail under significant motion or illumination changes . These assumptions make them unsuitable for real-world scenarios with large object movements or lighting variations . Deep learning-based methods address these challenges by leveraging complex feature extraction from inputs without imposing such hard constraints, allowing them to learn and adapt to diverse motion patterns in varying lighting conditions . These methods evolve their learning from data, enabling them to manage extensive displacements and capture more intricate aspects of motion .
Explainability masks in self-supervised depth estimation models are designed to identify and down-weight regions where the network predicts potential errors, thus excluding these areas from directly contributing to the training loss. This is crucial because it allows the model to focus on regions with more reliable predictions, avoiding adverse impacts on training from hard-to-predict regions . They enhance the model's robustness, particularly in scenarios where some image regions contain less reliable information or model confidence is low . By neglecting these errors, models can produce more accurate and consistent predictions across various inputs .
Scale-invariant loss functions improve monocular depth estimation by ensuring that the learning process remains robust against variations in image scale. By using relative distances between pixel pairs instead of absolute depth values, these functions overcome scale ambiguities that can occur when solely relying on ground truth depth maps for supervision . This innovation allows the model to focus on depth relationships within the scene rather than absolute measurements, leading to improved accuracy in estimating depths across different scales and enhancing generalization to new environments .