^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

Construction of three-dimensional structures from video sequences has wide applications for intelligent video analysis. This paper summarizes the key issues of the theory and surveys the recent advances in the state of the art. Reconstruction of a scene object from video sequences often takes the basic principle of structure from motion with an uncalibrated camera. This paper lists the typical strategies and summarizes the typical solutions or algorithms for modeling of complex three-dimensional structures. Open difficult problems are also suggested for further study.

Over the past two decades, many researchers seek to reconstruct the model of a three-dimensional (3D) scene structure and camera motion from video sequences taken with an uncalibrated camera or unordered photo collections from the Internet. Most traditionally, depth measurement and 3D metric reconstruction can be done from two uncalibrated stereo images [

The basic concept and knowledge of the problem can be found from the fundamentals of the multiview geometry through the books and thesis such as

For multiview modeling of a rigid scene, an approach is presented in [

Sparse 3D measurements of real scenes are readily estimated from N-view image sequences using structure-from-motion techniques. There is a fast algorithm for rigid structure from image sequences in [

The paper describes the progress in automatic recovering 3D scene structures together with 3D camera positions from a sequence of images acquired by an unknown camera undergoing unknown movement [

In fact, reconstruction of nonrigid scenes is very important in structure from motion. The recovery of 3D structure and camera motion for nonrigid scenes from single-camera video footages is a key problem in computer vision. For an implicit imaging model of nonrigid scenes, there is an approach that gives a nonrigid structure-from-motion algorithm based on computing matching tensors over subsequences, and each nonrigid matching tensor is computed, along with the rank of the subsequence, using a robust estimator incorporating a model selection criterion that detects erroneous image points [

With the development of structure-from-motion algorithms, geometry constraint and optimization are necessary for reconstructing a good 3D model of the object or scene. Many researchers give us some useful approaches. For example, a technique is proposed in [

3D affine measurements may be computed from a single perspective view of a scene given only minimal geometric information determined from the image. This minimal information is typically the vanishing line of a reference plane and a vanishing point for a direction not parallel to the plane. Without camera parameters, Criminisi et al. [

The remainder of this paper is organized as follows. Section

For 3D reconstruction of an object or building, Pollefeys et al. typically present a complete system to build visual model with a hand-held camera [

The 3D models of historical relics and buildings, for example, the Emperor Qin’s Terra-cotta Warriors and Piazza San Marco, have very significant meanings for archeologists. A system that can match and reconstruct 3D scenes from extremely large collections of photographs has been developed by Agarwal et al. [

Modeling and recognizing landmarks at world scale is a useful yet challenging task. There exists no readily available list of worldwide landmarks. Obtaining reliable visual models for each landmark can also pose problems, and efficiency is another challenge for such a large-scale system. Zheng et al. leverage the vast amount of multimedia data on the web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address these issues [

Modeling the world and reconstructing a city present many challenges for a visualization system in computer vision. It can use some products such as Google Earth Google Map. For instance, Pollefeys et al. [

If the world’s model or the city’s reconstruction is exhaustively completed, we can obtain relative location of the buildings and find related views for navigation for robots or other vision systems. Photo Tourism can enable full 3D navigation and exploration of the set of images and world geometry, along with auxiliary information such as overhead maps [

In the literature, there are applications that can employ SfM algorithms successfully in practical engineering. For instance, based on structure from controlled motion or on robust statistics, a visual servoing system is presented in [

3D reconstruction is an important application to face recognition, facial expression analysis, and so on. Fidaleo and Medioni [

Reconstruction of 3D scene geometry is an important element for scene understanding, autonomous vehicle and robot navigation, image retrieval, and 3D television [

The goal of structure-form-motion is automatic recovery of camera motion and scene structure from two or more images. The problem of using pixel correspondences or track points to determine camera and point geometry in this manner is known as structure from motion. It is a self-calibration technique and called automatic camera tracking or match moving. We must consider several questions like

Correspondence (feature extracting and tracking or matching): given a point in one image, how does it constrain the position of the corresponding point in other images?

Scene geometry (structure): given point matches in two or more images, where are the corresponding points in 3D?

Camera geometry (motion): given a set of corresponding points in two or more images, what are the camera matrices for these views?

Based on these questions, we can give the 3D reconstruction pipeline as in Figure

3D reconstruction pipeline.

Firstly, there are some well-known algorithms for image sequences or videos; one popular is the KLT tracker [

Example set of detected KLT features.

Tracking trajectory of KLT tracker through a video sequence.

In the KLT tracker [

In the papers of Lucas and Kanade [

Then symmetric definition for the dissimilarity between two windows, one in image

On the other hand, for a completely unorganized set of images, the tracker becomes invalid. There is another popular algorithm in computer vision area, named scale-invariant feature transform (SIFT) [

Example set of detected SIFT features.

SIFT feature matches between images.

Assume that we have obtained a set of correspondences between images or video sequence, and then we use the set to reconstruct the 3D structure of each point in the set of correspondences and recover the motion of a camera. This task is called structure from motion. The problem has been an active research topic in computer vision since the development of the Longuet-Higgins eight-point algorithm [

There is a popular factorization algorithm for image streams under orthography, using many images and tracking many feature points to obtain highly redundant feature position information, which was firstly developed by Tomasi and Kanade [

However, an orthographic formulation limits the range of motions the method can accommodate. Perspective projection is a projection model that closely approximates perspective projection by modeling several effects not modeled under orthographic projection, while retaining linear algebraic properties [

With the development of factorization method, a factorization- based algorithm for multi-image projective structure and motion is developed by Sturm and Triggs [

Because matrix factorization is a key component for solving several computer vision problems, Tardif et al. have proposed batch algorithms for matrix factorization [

In mathematical expression of the factorization algorithm, assume that the tracked points are

Then we can compute SVD decomposition of

The method can also handle and obtain a full solution from a partially filled-in measurement matrix, which occurs when features appear and disappear in the video due to occlusions or tracking failures [

Example of recovering structure and motion.

Bundle adjustment is a significant component of most structure from motion systems. It is the joint nonlinear refinement of camera and point parameters, so it can consume a large amount of time for large problems. Unfortunately, the optimization underlying structure from motion involves a complex, nonlinear objective function with no closed-form solution, due to nonlinearities in perspective geometry. Most modern approaches use nonlinear least squares algorithms [

To upgrade the projective and affine reconstruction to a metric reconstruction (i.e., determined up to an arbitrary Euclidean transformation and a scale factor), calibration techniques, to which we follow the approach described in [

Traditional SFM algorithms using just two images often produce inaccurate 3D reconstructions, mainly due to incorrect estimation of the camera’ motion. Thomas and Oliensis [

For incremental algorithms that solve progressively larger bundle adjustment problems, Crandall et al. present an alternative formulation for structure from motion based on finding a coarse initial solution using a hybrid discrete-continuous optimization and then improve the solution using bundle adjustment. The initial optimization step uses a discrete Markov random field (MRF) formulation, coupled with a continuous Levenberg-Marquardt refinement [

For time efficiency, Havlena et al. present a method of efficient structure from motion by graph optimization [

For duplicate or similar structure, Roberts et al. couple an expectation maximization (EM) algorithm for structure from motion for scenes with large duplicate structures [

For the problem of camera motion and 3D structure reconstruction from line correspondences across multiple views, there is a triangulation algorithm that outperforms standard linear and bias-corrected quasi-linear algorithms, and that bundle adjustment using our orthonormal representation yields results similar to the standard maximum likelihood trifocal tensor algorithm, while being usable for any number of views [

Tubic et al. [

Liang and Wong [

Multiview stereo (MVS) techniques take as input a set of images with known camera parameters (i.e., position and orientation of the camera, focal length, image distortion parameters) [

There are clustering techniques to partition the image set into groups of related images, based on the visual structure represented in the image connectivity graph for the collection [

While algorithms of structure from motion have been developed for 3D reconstruction in many applications, some problems of reconstructing geometry from video sequences still exist in computer vision and photography. Until recently, however, there have been no good computer vision techniques for recovering this kind of structure from motion. Many researchers are still making efforts to improve the methods mainly in the following aspects.

Zhang et al. give a robust and efficient algorithm on efficient nonconsecutive feature tracking for structure from motion via two main steps, that is, consecutive point tracking and nonconsecutive track matching [

The method is based on the structure from controlled motion that constrains camera motions to obtain an optimal estimation of the 3D structure of a geometrical primitive [

To solve the resulting large-scale nonlinear optimization, we reconstruct the scene incrementally, starting from a single pair of images, then adding new images and points in rounds, and running a global nonlinear optimization after each round [

This paper has summarized the recent development of structure from motion algorithm that is able to metrically reconstruct complex scenes and objects. The wide applications have been addressed in computer vision area. Typical contributions are introduced for feature point detection, tracking, matching, factorization, bundle adjustment, multiview stereo, self-calibration, line detection and matching, modeling, and so forth. Representative works are listed for readers to have a general overview of the state of the art. Finally, a summary of existing problems and future trends of structure modeling is addressed.

This work was supported by the National Natural Science Foundation of China and Microsoft Research Asia (NSFC nos. 61173096, 60870002, and 60802087), Zhejiang Provincial S&T Department (2010R10006, 2010C33095), and Zhejiang Provincial Natural Science Foundation (R1110679).