This paper presents a hybrid (geometry- & image-based) framework suitable for providing photorealistic
walkthroughs of large, complex outdoor scenes at interactive frame rates. To this end, based just on a sparse
set of real stereoscopic views from the scene, a set of

One of the main
research problems in computer graphics is the creation of modeling and
rendering systems capable to provide photorealistic
representations of complex, real-world environments. Minimal human intervention
during the modeling process and operation in real time during rendering, a
property that allows walkthroughs at interactive frame rates, are some of the
desirable characteristics for such systems. Two are the most prominent
approaches that have been proposed so far in this regard. The first, more
traditional, way of building this type of systems is by taking a purely
geometric approach. In this case, a full 3D geometric model of the whole scene
is constructed first, and that model is then used for the rendering of the
scene afterwards. One big advantage of this approach is that it offers great
flexibility with respect to manipulating or editing the scene's properties. For
example, due to a full 3D model being available, operations such as altering
the viewpoint of the camera, changing the position of light sources, modifying
part of the geometry, and so forth can be attained very easily in this case.
The price for this flexibility, however, lies in constructing the underlying 3D
model of the scene needed in the first place. Unless done automatically, this
can be an extremely time consuming and daunting task, especially for
large-scale scenes. Due to this fact, a lot of effort went recently into how to
obtain such a geometric model automatically by using only real images as input
[

Relatively recently, however, it was realized by
researchers that geometry is often unnecessary for obtaining a realistic
representation of the scene [

This issue with the huge amount of input data is also
one of the reasons why the great majority of existing IBR techniques apply only
to scenes of either small or medium scale. On the contrary, much fewer IBR
methods have been proposed for the case of large-scale scenes. Motivated by
this fact, the work presented in this article describes a novel, hybrid (i.e.,
both geometry-based and image-based) system, that can deal efficiently with the
visual reconstruction of large-scale, natural outdoor environments. We note
that this reconstruction is always photorealistic, while its visualization can
take place at interactive frame rates, that is, in real time. For reducing the
otherwise huge amount of input data, our system makes use of a new compact data
representation of a 3D scene, called

Besides the proposal of a novel hybrid representation
for a 3D scene, our system also includes the following new algorithms and
techniques as part of its image-based modeling and rendering pipeline: (1) a
robust photometric morphing procedure that is based on automatically extracting
a dense field of 2D correspondences between wide-baseline images (Section

Of course, besides our system, other IBR techniques
for large-scale scenes exist as well. To this end, work on unstructured/sparse
lumigraphs has been proposed by various authors. One such example is the work
of Buehler et al. [

As already
mentioned, our system is capable of providing photorealistic walkthroughs of
large-scale scenes at interactive frame rates. Assuming, for simplicity, that during the
walkthrough the user motion takes place along a predefined path inside the
environment (we note, of course, that our system
can be readily extended to handle the case where the stereoscopic views have
been captured not just along a predefined path, but throughout the
scene), the input to our system is then a sparse
set of stereoscopic views captured at certain locations (called

Initially, a sparse set of stereoscopic views is captured at key-positions along the path. Based on these stereoscopic views, a series of local 3D models is then constructed. As the user traverses the path, a morphable 3D model is displayed during rendering. In this manner, a continuous morphing between successive local models takes place at any time, with this morphing being both photometric as well as geometric.

Our system can be also extended to handle the
existence of multiple stereoscopic views per key position, all of them related
by a pure rotation of the stereoscopic camera. In this case, there will also be
multiple local models per key-position. Therefore, before applying the morphing
procedure, a so-called

Based on the above discussion, it follows that the 3D
modeling pipeline of our framework consists of the stages that are listed
below.

As already mentioned in the previous section, a
set of local 3D models needs to be extracted during the first stage of the 3D
modeling pipeline. There will be one local 3D model per stereoscopic view (and
hence per key position as well, since we initially assume there is a one-to-one
correspondence between key-positions and stereoscopic views). Such a 3D model
is supposed to provide both a photometric as well as a geometric representation
of the scene, but only at a local level (i.e., only near the corresponding
key-position along the path). To this end, a stereo matching procedure is
applied between the left

A geometric
depth map

These maps will form the local geometric
representation of the scene, whereas the left image

For solving the stereo matching problem, we estimate a
discrete disparity field by minimizing the energy of a 2nd-order Markov random
field. The nodes of this MRF are the pixels of the left image, while the labels
belong to a discrete set of disparities

The single node potential

Given any two
successive local models

Therefore, the pose estimation problem is reduced to
that of extracting a sparse set of correspondences between

However, unlike left/right images of a stereoscopic
view,

Of course, the scale factor

Quantize the scale space of

Rescale

Given any point

This way, apart
from a matching point

The above
strategy has been proved very effective, giving a high percentage of exact
matches even in cases with very large looming. Such an example can be seen in
Figure

(a) Image

The output of
the previous stage will thus be a series of approximate local 3D models (along
with approximate estimates of the relative pose between every successive two).
Rather than trying to create a consistent global model by combining all local
ones (a rather tedious task, requiring among others high-quality geometry and
pose estimation), we will, instead, follow a different approach, which is based
on the following observation. Let

When seeking a correspondence for a pixel

For this purpose, a 2-step procedure will be followed,
depending on whether a point in

The main idea is that if a point in

The rest of the points of

The 2 important remaining issues (that also constitute
the core of the morphing procedure) are as follows:

how to compute the mapping

and how to obtain the missing values of the
destination geometric-maps for

In general,
obtaining a reliable, dense optical flow field between images

On the one hand, this means that the resulting
displacement vectors can be large, which, in turn, implies that, when searching
for the correspondence of a pixel

On the other hand, for extracting pixel
correspondences, one usually needs to compare patches from images

Finally, even if both of the above problems are solved, optical flow estimation is inherently an ill-posed problem and so additional regularization assumptions need to be imposed.

For dealing with the first of these issues, we
will make use of the underlying geometric maps

For dealing with the second problem, we will use a
technique similar to the one described in Section

However, even after the above enhancements, estimating
correspondences independently for each pixel will produce a lot of errors. This
is because optical flow estimation is an inherently ill-posed problem. Hence,
to obtain high-quality results, one needs to regularize this problem by taking
advantage of the fact that optical flow vectors at neighboring pixels typically
exhibit a high correlation. Discrete Markov random fields (MRFs) [

Given all these label sets, getting then an optical
flow field is equivalent to picking one element from the cartesian product

On the other hand, as already mentioned, the role
of the pairwise terms

On the top
row, we show again the two images

After estimating function

whose values in

but whose values in

In the middle,
we show an incomplete destination depth map

In this manner, however, the resulting
destination maps

We show rendered views of a morphable 3D-model during
the transition from one key-position (corresponding to image

Instead, the
proper way to fill-in the destination vertices at region

Along the boundary of

In the interior of

In mathematical
terms, the first condition obviously translates to

An important
advantage of our framework is that, regardless of the scene's size, only one
“morphable 3D-model”

More specifically, for implementing the photometric
morphing of model

(a): Pixel shader code (and the associated vertex shader code), written in GLSL (OpenGL Shading Language), for implementing the photometric morphing. (b): Skeleton code in C for applying vertex blending in OpenGL.

On the other hand, for implementing the geometric
morphing, the following procedure is used: two 2D triangulations of regions

Therefore, based on the above observations, rendering a morphable model simply amounts to feeding into the GPU just 4 textured triangle meshes. This is, however, a rendering path which is highly optimized in all modern GPUs and, therefore, a considerable amount of 3D acceleration can be achieved in this manner during the rendering process.

Up to now, we have assumed that a full local 3D model
is constructed each time, that is, all points of the image grid are included as
vertices in the 2D triangulations

For completely specifying the decimation process, all
that remains to be defined is the error function

(a) Estimated disparity field corresponding to a local 3D model

For
simplicity, up to this point we have been assuming that, during the image
acquisition process, only one stereoscopic image has been captured per
key-position along the path. Our framework, however, can be readily extended to
the case where multiple stereoscopic views (and hence multiple local models)
exist per key-position, related to each other by a pure rotation of the stereoscopic
camera. Such a scenario can be useful, for example, when an extended field of
view is required during the virtual walkthrough (as in VR installations with
large screens). To reduce this case to the one already examined, it suffices
that a single local model (called

Intuitively, each 3D-mosaic should correspond to a
local model produced from a stereoscopic camera with a wider field of view.
Technically, let

The estimated transformations

On (a) we show a portion of a 3D-mosaic constructed by combining two local models. We show the result both when no geometric rectification is applied to the local models and also when we do apply a geometric rectification. In the former case, notice the geometric inconsistency at the boundary between the two local models. On t(b) we show another 3D-mosaic that has been constructed by combining three local models.

As already mentioned, an important property of our
geometric rectification algorithm is that the true relative 3D structure is
preserved in the resulting 3D mosaic. This is in deep contrast to other methods
for fusing geometric maps, such as feathering. This is illustrated with the
following toy example. Let us assume that we have 2 local models,

The “morphable 3D-mosaics” framework has been
successfully applied to the visual 3D reconstruction of the well-known Samaria
Gorge in Crete (a gorge which was recently awarded by the Council of Europe
with a Diploma First Class, as being one of Europe's most beautiful spots).
Based on this 3D reconstruction, and by also using a 3D virtual reality
installation, the ultimate goal of that work has been to provide a lifelike
virtual tour of the Samaria Gorge to all visitors of the National History
Museum of Crete, located in the city of Heraklion. To this end, the most
beautiful spots along the gorge have been selected and for each such spot a
predefined path, that was over 100 meters long, was chosen as well. About 15
key-positions have been selected along each path and approximately 45
stereoscopic views have been acquired at these positions, with 3 stereoscopic
views per position (in this manner, a field of view, that was approximately

Two stereoscopic views as would be rendered by the VR system (for illustration purposes these are shown in the form of red-blue images).

Despite the fact that a common graphics card has
been used, very high frame rates, of about 25 fps in stereo mode (i.e., 50 fps
in monomode), were obtained. On the one hand, this is due to the fact that,
regardless of the actual size of the scene, only one morphable 3D-model needs
to be displayed at any time during rendering. This is an important advantage,
which makes our framework extremely scalable to large-scale scenes. On the
other hand, the achieved frame rates are also due to the highly optimized
rendering pipeline of our system. The reason for this, is that both the
photometric as well as the geometric morphings can be implemented directly on
the GPU. For example, the photometric morphing operation (associated with, say,
morphable model

Sample results from the visual
reconstruction of the Samaria gorge are shown in Figure

Some rendered views that are produced as the virtual camera traverses a path through the so-called “Iron Gates” area, which is the most famous part of the Samaria Gorge. In this case, the virtual camera passes through multiple morphable 3D models.

One of the additional benefits of having a virtual 3D reconstruction of the gorge is the ability, for example, to add synthetic visual effects. Here, for instance, a synthetically generated volumetric fog has been added to the scene.

Another
difficulty that we had to face, during the visual reconstruction of the Samaria
Gorge, was related to the fact that a small river was passing through a certain
part of the gorge. This was a problem for the construction of the corresponding
local 3D models as our stereo matching algorithm could not possibly extract
disparity (i.e., find correspondences) for the points on the water surface.
This was the case because the water was not always static and, furthermore, the
sun reflections that existed on its surface were violating the Lambertian
assumption during stereo matching (see Figure

(a) The left image of a stereoscopic image pair that has been captured at a region passing through a small river. (b) The estimated disparity by using a stereo matching procedure. As expected, the disparity field contains a lot of errors for many of the points on the water surface. This is true especially for those points that lie near the sun reflections on the water. (c) The corresponding disparity when a 2D homography is being used to fill the left-right correspondences for the points on the water. In this case, the water surface is implicitly approximated by a 3D plane.

In conclusion, we have presented a new approach for obtaining photorealistic and interactive walkthroughs of large, outdoor scenes. To this end, a novel hybrid data structure has been proposed, called “morphable 3D-mosaics”. The resulting framework offers a series of important advantages. (1) To start with, no global 3D model of the environment needs to be assembled, a process which can be extremely cumbersome and error-prone, especially for large-scale scenes, where many local models need to be registered to each other. (2) Furthermore, rendering such a global model would make a very inefficient operation, with no inherent support for scalability, thus leading to very low frame rates for large-scale scenes. On the contrary, our method is extremely scalable to large-scale environments, as only one morphable 3D-model needs to be displayed at any time. (3) On top of this, it also makes use of a rendering path, which is highly optimized in modern 3D graphics hardware. (4) Not only that, but by utilizing an image-based data representation, our framework is also capable of fully reproducing the photorealistic richness of the scene. (5) As another advantage, the data acquisition process is very easy (e.g., collecting the stereoscopic images for a path over 100 meters long took us only about 20 minutes), while it also requires no special or expensive equipment (but just a pair of digital cameras and a tripod). (6) Finally, our framework makes up an end-to-end system, thus providing an almost automated processing of the input data, which are just a sparse set of stereoscopic images from the scene.

In the future, we intend to eliminate the need for calibrating the stereoscopic camera, as well as to allow the stereo baseline to vary during the acquisition of the stereoscopic views (these enhancements would allow for an even more flexible data acquisition process). Another issue that we wish to investigate is the ability of capturing the dynamic appearance of any moving objects, such as moving water or grass, that are frequently encountered in outdoor scenes (instead of just rendering these objects as static). To this end, we plan to enhance our “morphable 3D-mosaic” framework, so that it can also make use of real video textures that have been previously captured inside the scene. Finally, a current limitation of our method is that it assumes that the lighting conditions across the scene are not drastically different (something which is not always true in outdoor environments). One possible approach to deal with this issue is by obtaining the radiometric response function of each photograph. In this case, when constructing the morphable 3D models, one should also use the estimated response functions so as to compensate for the image differences due to the varying lighting conditions.

This paper is part of the 03ED417 research project, implemented within the framework of the Reinforcement Programme of Human Research Manpower (PENED) and cofinanced by National and Community Funds (75% from E.U.-European Social Fund and 25% from the Greek Ministry of Development-General Secretariat of Research and Technology).