cv unit 4
cv unit 4
Feature-based alignment is the problem of estimating the motion between two or more
sets of matched 2D or 3D points.
2D alignment using least squares Given a set of matched feature points {(xi , x 0 i )} and a
planar parametric transformation1 of the form
how can we produce the best estimate of the motion parameters p? The usual way to do
this is to use least squares, i.e., to minimize the sum of squared residuals
is the residual between the measured location xˆ 0 i and its corresponding current
predicted location x˜ 0 i = f(xi ; p).
2. Feature Detection: Identify distinctive key points or features in each image that are
robust to changes in scale, rotation, and illumination.
3. Feature Description: Describe the detected key points in a way that allows for
reliable matching across images despite variations in viewpoint or lighting
conditions.
6. Robustness: Ensure that the alignment process is robust to noise, occlusions, and
other artifacts present in the images.
7. Efficiency: Perform the alignment process efficiently to allow for real-time or near-
real-time performance in applications such as augmented reality or video
processing.
8. Applicability: Enable alignment for various computer vision tasks such as image
stitching, object recognition, image registration, and augmented reality.
The minimum can be found by solving the symmetric positive definite (SPD) system of
normal
Equations
where
is called the Hessian and b For the case of pure translation, the
resulting
equations have a particularly simple form, i.e., the translation is the average translation
between corresponding points or, equivalently, the translation of the point centroids.
Uncertainty weighting. The above least squares formulation assumes that all feature points
are matched with the same accuracy. This is often not the case, since certain points may fall
into more textured regions than others. If we associate a scalar variance estimate with
each correspondence, we can minimize the weighted least squares problem instead
Application: Panography
One of the simplest (and most fun) applications of image alignment is a special form of image
stitching called panography. In a panograph, images are translated and optionally rotated
and scaled before being blended with simple averaging.
In most of the examples seen on the web, the images are aligned by hand for best artistic
effect.
Consider a simple translational model. We want all the corresponding features in different
images to line up as best as possible. Let tj be the location of the jth image coordinate frame
in the global composite frame and xij be the location of the ith matched feature in the jth
image. In order to align the images, we wish to minimize the least squares error
To minimize the non-linear least squares problem, we iteratively find an update _p to the
current parameter estimate p by minimizing
For the other 2D motion models, the derivatives in Table 8.1 are all fairly straightforward,
except for the projective 2D motion (homography), which arises in image-stitching
applications
3D alignment
Instead of aligning 2D sets of image features, many computer vision applications require the
alignment of 3D points. In the case where the 3D transformations are linear in the motion
parameters, e.g., for translation, similarity, and affine, regular least squares can be used.
which arises more frequently and is often called the absolute orientation problem, requires
slightly different techniques. If only scalar weightings are being used (as opposed to full 3D
per-point anisotropic covariance estimates), the weighted centroids of the two point clouds
c and c0 can be used to estimate the translation We are then left with the
One commonly used technique is called the orthogonal Procrustes algorithm and involves
computing the singular value decomposition (SVD) of the 3 * 3 correlation matrix.
The rotation matrix is then obtained as (Verify this for yourself when ^x0 =
R^x.)
Another technique is the absolute orientation algorithm for estimating the unit quaternion
corresponding to the rotation matrix R, which involves forming a 4 _ 4 matrix from the
entries in C and then finding the eigenvector associated with its largest positive eigenvalue
Pose estimation
This pose estimation problem is also known as extrinsic calibration, as opposed to the
intrinsic calibration of internal camera parameters such as focal length. The problem of
recovering pose from three correspondences, which is the minimal amount of information
necessary, is known as the perspective-3-point-problem (P3P), 2 with extensions to larger
numbers of points collectively known as PnP. In this section, we look at some of the
techniques that have been developed to solve such problems, starting with the direct linear
transform (DLT), which recovers a 3 _ 4 camera matrix, followed by other “linear”
algorithms, and then looking at statistically optimal iterative algorithms.
Linear algorithms
The simplest way to recover the pose of the camera is to form a set of rational linear
equations analogous to those used for 2D motion estimation from the camera matrix form
of perspective projection.
where (xi; yi) are the measured 2D feature locations and (Xi; Yi;Zi) are the known 3D feature
locations. As with, this system of equations can be solved in a linear fashion for the unknowns
in the camera matrix P by multiplying the denominator on both sides of the equation.
Because P is unknown up to a scale, we can either fix one of the entries, e.g., p23 = 1, or find
the smallest singular vector of the set of linear equations. The resulting algorithm is called
the direct linear transform (DLT) and is commonly attributed
to (For a more in-depth discussion, To compute the unknowns in P, at least six
correspondences between 3D and 2D locations must be known.
Fig: Pose estimation by the direct linear transform and by measuring visual angles and
distances between pairs of points.
As with the case of estimating homographies , more accurate results for the entries in P can
be obtained by directly minimizing the set of Equations using non-linear least squares with
a small number of iterations. Note that instead of taking the ratios of the X=Z and Y=Z values
as in it is also possible to take a cross product of the 3-vector (xi; yi; 1) image measurement
and the 3-D ray (X; Y;Z) and set the three elements of this cross-product to 0. The resulting
three equations, when interpreted as a set of least squares constraints, in effect compute the
squared sine of the angle between the two rays.
As with other areas on computer vision, deep neural networks have also been applied to
pose estimation. Some representative papers include for object pose estimation, and
papers such as and discussed in Section on location recognition. There is also a very active
community around estimating pose from RGB-D images, with the most recent papers
evaluated on the BOP (Benchmark for 6DOF Object Pose)
The most accurate and flexible way to estimate pose is to directly minimize the squared (or
robust) reprojection error for the 2D points as a function of the unknown pose parameters
in (R; t) and optionally K using non-linear least squares. We can write the projection
equations as
where is the current residual vector (2D error in predicted position) and
the partial derivatives are with respect to the unknown pose parameters (rotation,
translation, and optionally calibration). The robust loss function is used to reduce the
influence of outlier correspondences. The resulting projection equations can be written as
Note that in these equations, we have indexed the camera centers cj and camera rotation
quaternions qj by an index j, in case more than one pose of the calibration object is being
used.
The advantage of this chained set of transformations is that each one has a simple partial
derivative with respect both to its parameters and to its input. Thus, once the predicted
value of ~xi has been computed based on the 3D point location pi and the current values of
the pose parameters (cj ; qj ; k), we can obtain all of the required partial derivatives using
the chain rule
Where p(k) indicates one of the parameter vectors that is being optimized. (This same “trick”
is used in neural networks as part of back propagation. The one special case in this
formulation that can be considerably simplified is the computation of the rotation update.
Instead of directly computing the derivatives of the 3 _ 3 rotation matrix R(q) as a function
of the unit quaternion entries, you can prepend the incremental rotation matrix
given in to the current rotation matrix and compute the partial derivative of the transform
with respect to these parameters, which results in a simple cross product of the backward
chaining partial derivative and the outgoing 3D vector,
The inference of human pose (head, body, and limb locations and attitude) from a single
images can be viewed as yet another kind of segmentation task. We have already discussed
some pose estimation techniques on pedestrian detection section. Starting with the seminal
work 2D and 3D pose detection and estimation rapidly developed as an active research area,
with important advances and datasets
One of the most exciting applications of pose estimation is in the area of location recognition,
which can be used both in desktop applications (“Where did I take this holiday snap?”) and
in mobile smartphone applications. The latter case includes not only finding out your current
location based on a cell-phone image, but also providing you with navigation directions or
annotating your images with useful information, such as building names and restaurant
reviews (i.e., a pocketable form of augmented reality). This problem is also often called visual
(or image-based) localization.
Some approaches to location recognition assume that the photos consist of architectural
scenes for which vanishing directions can be used to pre-rectify the images for easier
matching.
The main difficulty in location recognition is in dealing with the extremely large community
(user-generated) photo collections on websites such as Flickr.
In the latter case, the overlap between adjacent database images can be used to verify and
prune potential matches using “temporal” filtering, i.e., requiring the query image to match
nearby overlap ping database images before accepting the match. Similar ideas have been
used to improve location recognition from panoramic video sequences. Recognizing indoor
locations inside buildings and shopping malls poses its own set of challenges, including
textureless areas and repeated elements
Some of the most recent approaches to localization use deep networks to generate feature
Descriptors perform large-scale instance retrieval map images to 3D scene coordinates or
perform end-to-end scene coordinate regression, absolute pose regression (APR) or relative
pose regression (RPR). Recent evaluations of these techniques have shown that classical
approaches based on feature matching followed by geometric pose optimization typically
outperform pose regression approaches in terms of accuracy and generalization. The Long-
Term Visual Localization benchmark has a leaderboard listing the best-performing
localization systems. Another variant on location recognition is the automatic discovery of
landmarks, i.e., frequently photographed objects and locations.
The concept of organizing the world’s photo collections by location has even been recently
extended to organizing all of the universe’s (astronomical) photos in an application called
astrometry.The technique used to match any two star fields is to take quadruplets of nearby
stars (a pair of stars and another pair inside their diameter) to form a 30-bit geometric hash
by encoding the relative positions of the second pair of points using the inscribed square
as the reference frame. Traditional information retrieval techniques (k-d trees built for
different parts of a sky atlas) are then used to find matching quads as potential star field
location hypotheses, which can then be verified using a similarity transform.
Triangulation
One of the simplest ways to solve this problem is to find the 3D point p that lies closest to
The optimal value for p, which lies closest to all of the rays, can be computed as a regular
least squares problem by summing over all the r2 j and finding the optimal value of p,
An alternative formulation, which is more statistically optimal and which can produce
significantly better estimates if some of the cameras are closer to the 3D point than others,
is to minimize the residual in the measurement equations
where (xj ; yj) are the measured 2D feature locations and are the known entries
in camera matrix Pj. As with, this set of non-linear equations can be converted into a linear
least squares problem by multiplying both sides of the denominator, again resulting in the
direct linear transform (DLT) formulation. Note that if we use homogeneous coordinates p =
(X; Y;Z;W), the resulting set of equations is homogeneous and is best solved as a singular
value decomposition (SVD) or eigenvalue problem (looking for the smallest singular vector
or eigenvector). If we set W = 1, we can use regular linear least
squares, but the resulting system may be singular or poorly conditioned, i.e., if all of the
viewing rays are parallel, as occurs for points far away from the camera.
So far in our study of 3D reconstruction, we have always assumed that either the 3D point
positions or the 3D camera poses are known in advance. In this section, we take our first
look at structure from motion, which is the simultaneous recovery of 3D structure and pose
from image correspondences. In particular, we examine techniques that operate on just
two frames with point correspondences. We divide this section into the study of classic “n-
point” algorithms, special (degenerate) cases, projective (uncalibrated) reconstruction, and
self-calibration for cameras whose intrinsic calibrations are unknown.
Consider which shows a 3D point p being viewed from two cameras whose relative
position can be encoded by a rotation R and a translation t. As we do not know anything
about the camera positions, without loss of generality, we can set the first camera at the
origin c0 = 0 and at a canonical orientation
R0 = I. The 3D point p0 = d0^x0 observed in the first image at location ^x0 and at a z
distance of d0 is mapped into the second image by the transformation.
Where are the (local) ray direction vectors. Taking the cross product of the
two
(Interchanged) sides with t in order to annihilate it on the right-hand side yields.
because the right-hand side is a triple product with two identical entries. (Another way to
say this is that the cross product matrix [t]_ is skew symmetric and returns 0 when pre-
and post-multiplied by the same vector.). We therefore arrive at the basic epipolar
constraint,
Where
An alternative way to derive the epipolar constraint is to notice that, for the cameras to be
oriented so that the rays intersect in 3D at point p, the vectors connecting the two camera
centers and the rays corresponding to pixels x0 and x1, namely must be co-
planar. This requires that the triple product.
Eight-point algorithm. Given this fundamental relationship (11.30), how can we use it to
recover the camera motion encoded in the essential matrix E? If we have N corresponding
measurements {(xi0, xi1)}, we can form N homogeneous equations in the nine elements of
E = {e00 . . . e22}
Self-calibration
The results of structure from motion computation are much more useful if a metric
reconstruction is obtained, i.e., one in which parallel lines are parallel, orthogonal walls are
at right angles, and the reconstructed model is a scaled version of reality. Over the years, a
large number of self-calibration (or auto-calibration) techniques have been developed for
converting a projective reconstruction into a metric one, which is equivalent to recovering
the unknown calibration matrices Kj associated with each image.In situations where
additional information is known about the scene, different methods may be employed. For
example, if there are parallel lines in the scene, three or more vanishing points, which are
the images of points at infinity, can be used to establish the homography for the plane at
infinity, from which focal lengths and rotations can be recovered. If two or more finite
orthogonal vanishing points have been observed, the single-image calibration method based
on vanishing points can be used instead
encode the unknown focal lengths. For simplicity, let us rewrite each of the numerators and
denominators in (11.56) as
Factorization
While two-frame techniques are useful for reconstructing sparse geometry from stereo image pairs
and for initializing larger-scale 3D reconstructions, most applications can benefit from the much
larger number of images that are usually available in photo collections and videos of scenes.
In this section, we briefly review an older technique called factorization, which can provide useful
solutions for short video sequences, and then turn to the more commonly used bundle adjustment
approach, which uses non-linear least squares to obtain optimal solutions under general camera
configurations
When processing video sequences, we often get extended feature tracks from which it is possible to
recover the structure and motion using a process called factorization. Consider the tracks generated
by a rotating ping pong ball, which has been marked with dots to make its shape and motion more
discernable . We can readily see from the shape of the tracks that the moving object must be a sphere,
but how can we infer this mathematically? It turns out that, under orthography or related models we
discuss below, the shape and motion can be recovered simultaneously using a singular value
decomposition.
Once the rotation matrices and 3D point locations have been recovered, there still exists a bas-relief
ambiguity, i.e., we can never be sure if the object is rotating left to right or if its depth reversed version
is moving the other way. (This can be seen in the classic rotating Necker Cube visual illusion.)
Additional cues, such as the appearance and disappearance of points, or perspective effects, both of
which are discussed below, can be used to remove this ambiguity.
For motion models other than pure orthography, e.g., for scaled orthography or paraperspective, the
approach above must be extended in the appropriate manner. Such techniques are relatively
straightforward to derive from first principles; more details can be found in papers that extend the
basic factorization approach to these more flexible models. Additional extensions of the original
factorization algorithm include multi-body rigid motion, sequential updates to the factorization, the
addition of lines and planes and re-scaling the measurements to incorporate individual location
uncertainties.
A disadvantage of factorization approaches is that they require a complete set of tracks, i.e., each
point must be visible in each frame, for the factorization approach to work and deal with this problem
by first applying factorization to smaller denser subsets and then using known camera (motion) or
point (structure) estimates to hallucinate additional missing values, which allows them to
incrementally incorporate more features and cameras.
Bundle adjustment
As we have mentioned several times before, the most accurate way to recover structure and motion
is to perform robust non-linear minimization of the measurement (re-projection) errors, which is
commonly known in the photogrammetry (and now computer vision) communities as bundle
adjustment.
The biggest difference between these formulas and full bundle adjustment is that our feature
location measurements xij now depend not only on the point (track) index i but also on the camera
pose index j,
and that the 3D point positions pi are also being simultaneously updated. In addition, it is common
to add a stage for radial distortion parameter estimation
if the cameras being used have not been pre-calibrated, as shown in Figure.
While most of the boxes (transforms) have previously been explained , the leftmost box has not.
This box performs a robust comparison of the predicted and measured 2D locations xˆij and ˜xij
after re-scaling by the measurement noise covariance Σij . In more detail, this operation can be
written as
The advantage of the chained representation introduced above is that it not only makes the
computations of the partial derivatives and Jacobians simpler but it can also be adapted to any
camera configuration. Consider for example a pair of cameras mounted on a robot that is moving
around in the world, as shown in Figure 11.15a. By replacing the rightmost two transformations in
Figure 11.14 with the transformations shown in Figure 11.15b, we can simultaneously recover the
position of the robot at each time and the calibration of each camera with respect to the rig, in
addition to the 3D structure of the world.
Exploiting sparsity
Large bundle adjustment problems, such as those involving reconstructing 3D scenes from thousands
of internet photographs can require solving non-linear least squares problems with millions of
measurements (feature matches) and tens of thousands of unknown parameters (3D point positions
and camera poses). Unless some care is taken, these kinds of problem can become intractable,
because the (direct) solution of dense least squares problems is cubic in the number of unknowns.
Fortunately, structure from motion is a bipartite problem in structure and motion. Each feature point
xij in a given image depends on one 3D point position pi and one 3D camera pose (Rj , cj ). This is
illustrated in Figure 11.16a, where each circle (1–9) indicates a 3D point, each square (A–D) indicates
a camera, and lines (edges) indicate which points are visible in which cameras (2D features). If the
values for all the points are known or fixed, the equations for all the cameras become independent,
and vice versa.
If we order the structure variables before the motion variables in the Hessian matrix A (and hence
also the right-hand side vector b), we obtain a structure for the Hessian shown in Figure 11.16c.19
When such a system is solved using sparse Cholesky factorization, the fill-in occurs in the smaller
motion explore the use of iterative (conjugate gradient) techniques for the solution of bundle
adjustment problems.
In more detail, the reduced motion Hessian is computed using the Schur complement
where APP is the point (structure) Hessian (the top left block of Figure 11.16c), APC is the point-
camera Hessian (the top right block), and ACC and A0 CC are the motion Hessians before and after
the point variable elimination (the bottom right block of Figure 11.16c).
Notice that A0 CC has a non-zero entry between two cameras if they see any 3D point in common.
This is indicated with dashed arcs in Figure 11.16a and light blue squares in Figure 11.16c.
Whenever there are global parameters present in the reconstruction algorithm, such as camera
intrinsics that are common to all of the cameras, or camera rig calibration parameters such as those
shown in Figure 11.15, they should be ordered last (placed along the right and bottom edges of A) to
reduce fill-in
One of the neatest applications of structure from motion is to estimate the 3D motion of a video or
film camera, along with the geometry of a 3D scene, in order to superimpose 3D graphics or
computer-generated images (CGI) on the scene. In the visual effects industry, this is known as the
match move problem (Roble 1999), as the motion of the synthetic 3D camera used to render the
graphics must be matched to that of the real-world camera. For very small motions, or motions
involving pure camera rotations, one or two tracked points can suffice to compute the necessary
visual motion.
For planar surfaces moving in 3D, four points are needed to compute the homography, which can
then be used to insert planar overlays, e.g., to replace the contents of advertising billboards during
sporting events. The general version of this problem requires the estimation of the full 3D camera
pose along with the focal length (zoom) of the lens and potentially its radial distortion parameters
When the 3D structure of the scene is known ahead of time, pose estimation techniques such as view
correlation or through-the-lens camera control (Gleicher and Witkin 1992) can be used, as described
in For more complex scenes, it is usually preferable to recover the 3D structure simultaneously with
the camera motion using structure-from-motion techniques. The trick with using such techniques is
that to prevent any visible jitter between the synthetic graphics and the actual scene, features must
be tracked to very high accuracy and ample feature tracks must be available in the vicinity of the
insertion location
The most general algorithms for structure from motion make no prior assumptions about the objects
or scenes that they are reconstructing. In many cases, however, the scene contains higher-level
geometric primitives, such as lines and planes. These can provide information complementary to
interest points and also serve as useful building blocks for 3D modeling and visualization.
Furthermore, these primitives are often arranged in particular relationships, i.e., many lines and
planes are either parallel or orthogonal to each other.
Sometimes, instead of exploiting regularity in the scene structure, it is possible to take advantage of
a constrained motion model. For example, if the object of interest is rotating on a turntable, i.e.,
around a fixed but unknown axis, specialized techniques can be used to recover this motion.
In other situations, the camera itself may be moving in a fixed arc around some center of rotation
Specialized capture setups, such as mobile stereo camera rigs or moving vehicles equipped with
multiple fixed cameras, can also take advantage of the knowledge that individual cameras are
(mostly) fixed with respect to the capture rig, as shown in Figu Sometimes, instead of exploiting
regularity in the scene structure, it is possible to take advantage of a constrained motion model.
For example, if the object of interest is rotating on a turntable), i.e., around a fixed but unknown axis,
specialized techniques can be used to recover this motion. In other situations, the camera itself may
be moving in a fixed arc around some center of rotation (Shum and He 1999). Specialized capture
setups, such as mobile stereo camera rigs or moving vehicles equipped with multiple fixed cameras,
can also take advantage of the knowledge that individual cameras are (mostly) fixed with respect to
the capture rig, as shown in Figure.
Line-based techniques
It is well known that pairwise epipolar geometry cannot be recovered from line matches alone, even
if the cameras are calibrated. To see this, think of projecting the set of lines in each image into a set
of 3D planes in space. You can move the two cameras around into any configuration you like and still
obtain a valid reconstruction for 3D lines. Line-based techniques. When lines are visible in three or
more views, the trifocal tensor can be used to transfer lines from one pair of images to another.
Author describe a widely used technique for matching 2D lines based on the average of 15 × 15 pixel
correlation scores evaluated at all pixels along their common line segment intersection.35 In their
system, the epipolar geometry is assumed to be known, e.g., computed from point matches. For wide
baselines, all possible homographies corresponding to planes passing through the 3D line are used
to warp pixels and the maximum correlation score is used. For triplets of images, the trifocal tensor
is used to verify that the lines are in geometric correspondence before evaluating the correlations
between line segments. Figure 11.22 shows the results of using their system.
Instead of reconstructing 3D lines, use RANSAC to group lines into likely coplanar subsets. Four lines
are chosen at random to compute a homography, which is then verified for these and other plausible
line segment matches by evaluating color histogram-based correlation scores. The 2D intersection
points of lines belonging to the same plane are then used as virtual measurements to estimate the
epipolar geometry, which is more accurate than using the homographies directly.
It describe a 3D modeling system that constructs calibrated panoramas from multiple images and
then has the user draw vertical and horizontal lines in the image to demarcate the boundaries of
planar regions. The lines are used to establish an absolute rotation for each panorama and are then
used (along with the inferred vertices and planes) to build a 3D structure, which can be recovered up
to scale from one or more images
Plane-based techniques
In scenes that are rich in planar structures, e.g., in architecture, it is possible to directly estimate
homographies between different planes, using either feature-based or intensity-based methods. In
principle, this information can be used to simultaneously infer the camera poses and the plane
equations, i.e., to compute plane-based structure from motion.
It show how a fundamental matrix can be directly computed from two or more homographies using
algebraic manipulations and least squares. Unfortunately, this approach often performs poorly,
because the algebraic errors do not correspond to meaningful reprojection errors.
A better approach is to hallucinate virtual point correspondences within the areas from which each
homography was computed and to feed them into a standard structure from motion algorithm. An
even better approach is to use full bundle adjustment with explicit plane equations, as well as
additional constraints to force reconstructed co-planar features to lie exactly on their corresponding
planes. (A principled way to do this is to establish a coordinate frame for each plane, e.g., at one of
the feature points, and to use 2D in-plane parameterizations for the other points.)
While the computer vision community has been studying structure from motion, i.e., the
reconstruction of sparse 3D models from multiple images and videos, since the early 1980s), the
mobile robotics community has in parallel been studying the automatic construction of 3D maps from
moving robots.36 In robotics, the problem was formulated as the simultaneous estimation of 3D
robot and landmark poses (Figure 11.23), and was known as probabilistic mapping and simultaneous
localization and mapping (SLAM). In the computer vision community, the problem was originally
called visual odometry although that term is now usually reserved for shorter-range motion
estimation that does not involve building a global map with loop closing.
Early versions of such algorithms used range-sensing techniques, such as ultrasound, laser range
finders, or stereo matching, to estimate local 3D geometry, which could then be fused into a 3D
model..
SLAM differs from bundle adjustment in two fundamental aspects. First, it allows for a variety of
sensing devices, instead of just being restricted to tracked or matched feature points. Second, it solves
the localization problem online, i.e., with no or very little lag in providing the current sensor pose.
This makes it the method of choice for both time-critical robotics applications such as autonomous
navigation.
As you can tell from this very brief overview, SLAM is an incredibly rich and rapidly evolving field of
research, full of challenging robust optimization and real-time performance problems. A good source
for finding a list of the most recent papers and algorithms is the KITTI Visual Odometry/SLAM
Evaluation.
Application: Autonomous navigation Since the early days of artificial intelligence and robotics,
computer vision has been used to enable manipulation for dextrous robots and navigation for
autonomous robots (Janai, Guney ¨ et al. 2020; Kubota 2019). Some of the earliest vision-based
navigation systems include the Stanford Cart and CMU Rover, the Terregator and the CMU Nablab.
originally could only advance 4m every 10 sec (< 1 mph), and which was also the first system to use
a neural network for driving .The early algorithms and technologies advanced rapidly, with the
VaMoRs system of operating a 25Hz Kalman filter loop and driving with good lane markings at 100
km/h. By the mid 2000s, when DARPA introduced their Grand Challenge and Urban Challenge,
vehicles equipped with both range-finding lidar cameras and stereo cameras were able to traverse
rough outdoor terrain and navigate city streets at regular human driving speeds. Systems led to the
formation of industrial research projects at companies such as Google and Tesla, as well numerous
startups, many of which exhibit their vehicles at computer vision conferences
Translational alignment
The simplest way to establish an alignment between two images or image patches is to shift one
image relative to the other. Given a template image I0(x) sampled at discrete pixel locations {xi = (xi
, yi)}, we wish to find where it is located in image I1(x). A least squares solution to this problem is to
find the minimum of the sum of squared differences (SSD) function
where u = (u, v) is the displacement and ei = I1(xi + u) − I0(xi) is called the residual error (or the
displaced frame difference in the video coding literature).1 (We ignore for the moment the possibility
that parts of I0 may lie outside the boundaries of I1 or be otherwise not visible.) The assumption that
corresponding pixel values remain the same in the two images is often called the brightness
constancy constraint.
In general, the displacement u can be fractional, so a suitable interpolation function must be applied
to image I1(x). In practice, a bilinear interpolant is often used, but bicubic interpolation can yield
slightly better results.
Color images can be processed by summing differences across all three-color channels, although it is
also possible to first transform the images into a different color space or to only use the luminance
(which is often done in video encoders). Robust error metrics. We can make the above error metric
more robust to outliers by replacing the squared error terms with a robust function ρ(ei)
The robust norm ρ(e) is a function that grows less quickly than the quadratic penalty associated with
least squares. One such function, sometimes used in motion estimation for video coding because of
its speed, is the sum of absolute differences (SAD) metric3 or L1 norm, i.e., However, because this
function is not differentiable at the origin, it is not well suited to gradient-descent approaches such
as the ones presented. Instead, a smoothly varying function that is quadratic for small values but
grows more slowly away from the origin is often used
where a is a constant that can be thought of as an outlier threshold. An appropriate value for the
threshold can itself be derived using robust statistics. e.g., by computing the median absolute
deviation, MAD = medi |ei |, and multiplying it by to obtain a robust estimate of the standard deviation
of the inlier noise process proposes a generalized robust loss function that can model various outlier
distributions and thresholds, as discussed in more detail in, and also has a Bayesian method for
estimating the loss function parameters.
Spatially varying weights. The error metrics above ignore that fact that for a given alignment, some
of the pixels being compared may lie outside the original image boundaries. Furthermore, we may
want to partially or completely downweight the contributions of certain pixels. For example, we may
want to selectively “erase” some parts of an image from consideration when stitching a mosaic where
unwanted foreground objects have been cut out. For applications such as background stabilization,
we may want to downweight the middle part of the image, which often contains independently
moving objects being tracked by the camera
All of these tasks can be accomplished by associating a spatially varying per-pixel weight with each
of the two images being matched. The error metric then becomes the weighted (or windowed) SSD
function
The simplest solution is to do a full search over some range of shifts, using either integer or sub-pixel
steps. This is often the approach used for block matching in motion compensated video compression,
where a range of possible motions (say, ±16 pixels) is explored.
To accelerate this search process, hierarchical motion estimation is often used: an image pyramid is
constructed and a search over a smaller number of discrete pixels (corresponding to the same range
of motion) is first performed at coarser levels.
The motion estimate from one level of the pyramid is then used to initialize a smaller local search at
the next finer level. Alternatively, several seeds (good solutions) from the coarse level can be used to
initialize the fine-level search. While this is not guaranteed to produce the same result as a full search,
it usually works almost as well and is much faster.
Fourier-based alignment
When the search range corresponds to a significant fraction of the larger image (as is the case in
image stitching, the hierarchical approach may not work that well, as it is often not possible to
coarsen the representation too much before significant features are blurred away. In this case, a
Fourier-based approach may be preferable.
Windowed correlation.
Unfortunately, the Fourier convolution theorem only applies when the summation over xi is
performed over all the pixels in both images, using a circular shift of the image when accessing pixels
outside the original boundaries. While this is acceptable for small shifts and comparably sized images,
it makes no sense when the images overlap by a small amount or one image is a small subset of the
other.
Phase correlation.
A variant of regular correlation that is sometimes used for motion estimation is phase
correlation.Here, the spectrum of the two signals being matched is whitened by dividing each per-
frequency product in the magnitudes of the Fourier transforms.
Parametric motion
Parametric motion estimation In this section, we consider motion estimation approaches which
estimate the parameters of the motion models. As previously discussed, these models can be applied
to a coherently moving region of support. An important special case is when a single region of support
corresponding to the whole image is selected.
In this case, referred to as global motion estimation, the dominant motion is estimated. This dominant
motion is resulting from camera motion, such as dolly, track, boom, pan, tilt and roll, which is a widely
used cinematic technique in filmmaking and video production. Hereafter, we describe two classes of
techniques for parametric motion estimation. We also discuss difficulties arising due to outliers, and
related robust estimators.
Indirect parametric motion estimation A first class of approaches indirectly computes the motion
parameters from a dense motion field rather than from the image pixels. More specifically, a dense
motion field is first estimated, and then the parametric motion model is fitted on the obtained motion
vectors. A Least Mean Square (LMS) technique is commonly used for this model fitting. More
specifically, the motion parameters are derived from the expressions