Three-dimensional city models are becoming a valuable resource because of their close geospatial, geometrical, and visual relationship with the physical world. However, ground-oriented applications in virtual reality, 3D navigation, and civil engineering require a novel modeling approach, because the existing large-scale 3D city modeling methods do not provide rich visual information at ground level. This paper proposes a new framework for generating 3D city models that satisfy both the visual and the physical requirements for ground-oriented virtual reality applications. To ensure its usability, the framework must be cost-effective and allow for automated creation. To achieve these goals, we leverage a mobile mapping system that automatically gathers high-resolution images and supplements sensor information such as the position and direction of the captured images. To resolve problems stemming from sensor noise and occlusions, we develop a fusion technique to incorporate digital map data. This paper describes the major processes of the overall framework and the proposed techniques for each step and presents experimental results from a comparison with an existing 3D city model.
Three-dimensional city models are widely used in applications in various fields. Such models represent either real or virtual cities. Virtual 3D city models are frequently used in movies or video games, where a geospatial context is not necessary. Real 3D city models can be used in virtual reality, navigation systems, or civil engineering, as they are closely related to our physical world. Google Earth [
In the process of 3D city modeling, both the cost and the quality requirements must be considered. The cost can be estimated as the time and resource consumption of modeling the target area. The quality factor considers both visual quality and physical reliability. The visual quality is proportional to the degree of visual satisfaction, which affects the level of presence and reality. The physical reliability is the geospatial and geometrical similarity between the objects—in our case, mainly buildings—in the modeled and physical worlds. Generally, accomplishing a satisfactory level for both requirements is difficult.
Numerous techniques can be used for 3D city modeling. For instance, color and geometry data from LiDAR are mainly used if the application requires detailed building models for a small area. If the city model covers a large area and does not need detailed features, reconstruction from satellite/aerial images is more efficient [
Existing 3D city modeling methods can be divided into manual modeling methods, BIM (Building Information Model) data-based methods, LiDAR data-based methods, and image-based methods.
The manual modeling method is highly dependent on the modeling experts. Older versions of Google Earth and Terra Vista employed this method in their modeling systems. Although current applications employ manual modeling because of its high quality, the method is also a high-cost, labor-intensive process. Hence, it is not efficient for urban environments that include numerous buildings. The BIM data-based method facilitates the use of building design data from the construction stage and is applied in city planning [
To address these problems, remote sensing techniques are being aggressively adopted and studies of LiDAR data-based methods and image-based methods are increasingly common [
Image-based methods include those based on stereo matching and inverse procedural modeling approaches. In earlier research [
In our research, the approach which preserves the cost-efficiency by using the existing image database while increasing physical reliability will be proposed. The image database is relatively easy to access than LiDAR database so cost on the data collection can be decreased. On the other hand the method based on the stereo matching requires numerous images on the large-scale modeling that decreases the universal applicability of method. Therefore inverse procedural modeling approach is preferred on our objective, while the physical reliability can be increased by combining accurate reference data [
In this study, we propose a framework that uses a massive number of images gathered from a mobile mapping system (MMS). This addresses many problems in existing methods, which cannot simultaneously provide feasible levels of cost effectiveness, visual quality, or physical reliability. An MMS collects and processes data from sensor devices mounted on a vehicle. Services such as Google Street View [ Nationwide or even worldwide coverage following the development of remote sensing technologies and map services. Rich, visual, and omnidirectional information. Sensor information that allows geospatial coordination with the physical world.
Using these advantages, we can model a diverse city area for ground-oriented interactive systems in a cost-effective way with the existing image database. Moreover, high visual quality at ground level can be provided by high-resolution panoramic images [ Sensor data includes noise, which lowers its physical reliability. The number of images in a given area is limited and is insufficient for stereo matching-based reconstruction. Inclusion of an enormous amount of unnecessary visual information, including occlusions, cars, and pedestrians.
Noise is unavoidable in the sensing process. The amount of error this introduces differs according to the surrounding environment; a ±5 m positional error and ±6° directional error have been reported in Google Street View data [
To address these problems with MMS data, we propose a method that incorporates 2D digital map data. Digital maps have accurate geospatial information about various features in the physical world. For instance, the 1 : 5000 digital maps applied in our framework have a horizontal accuracy of 1 m, which is five times better than that of the MMS position data. Therefore, by combining data, the problems of sensor errors can be overcome and the selective use of visual information is possible. On the other hand, the geometrical characteristic of the building is restricted to a quasi-Manhattan world model. The quasi-Manhattan world model is the assumption that structures consist of vertical and horizontal planes, and is an extension of the Manhattan world model that assumes structures consist of vertical and horizontal planes orthogonal to each other.
The proposed framework is illustrated in Figure
Overview of the proposed 3D city modeling framework.
The detailed process is illustrated in Figure Image/building analysis. Error correction/compensation. Segmentation and validation. Texture mapping.
Detailed process of the proposed approach.
The error correction/compensation and segmentation and validation processes include a feedback loop to sequentially obtain the texture of individual buildings.
This stage analyzes the correlation between each image and the digital map. To do this, a base 3D model is generated from the footprint of the buildings by extending the model in the vertical direction. The footprint data consists of the geospatial coordinates of the building contour projected to the ground surface. This data should retain a certain level of accuracy and therefore contains precise information about the buildings.
The input image can then be positioned in the 3D environment according to the GPS/INS sensor information. Next, buildings are classified according to the three criteria listed below. The objective of this classification is to separate the buildings into texture-acquirable examples and others at a high resolution. The proposed criteria are as follows. The distance between the location from the GPS sensor corresponding to the image and the building. The occlusion between the buildings. The region occupied by the building in the image.
The distance criterion is quite straightforward: more distant buildings are less likely to appear in the image. The occlusion criterion is also reasonable, as an occluded building cannot appear in the image. The third criterion considers information about the brief texture resolution of each image by calculating the façade angle and width of the image, and then estimating the resolution of the texture. Assuming
These criteria are illustrated in Figure
Criteria of the image/building analysis. (a) Distance between the image and buildings; (b) occlusion between buildings; and (c) facing viewing angle of the building façade, that is used to calculate the texture resolution.
The correction/compensation process turns the omnidirectional building detection problem into the simple problem of segmenting a single image. Noise in the GPS/INS sensor is the primary source of the mismatch between the image/building analysis results and the ground truth.
First, correction of the GPS/INS sensor error is performed using image-based localization methods. Image-based localization is a major research issue in robot vision and location-based services. The localization of a panoramic image can be conducted by applying semantic segmentation [
The error compensation utilizes this error bound to set the region of interest (ROI). Our objective in error compensation is to process the single 360-degree panoramic image into several normal-lens images to build each target, by partitioning and reprojecting. As we mentioned before, the error correction reduces the position and orientation error but there still exist mismatches, up to 2 m in position, between the base 3D model from the digital map and the panoramic image. So we calculate the FOV for each targeted building, which is ROI, with a 2-m margin. After that, the panoramic image is reprojected using rectilinear projection to generate an image which preserves the straight lines in 3D space in the projected image. Then the ROI contains the complete image of the target building and the complexity of the image-segmentation process is reduced.
This step involves the segmentation and validation of individual buildings. As noted in earlier studies [
(a) Detected horizontal line segments (green and blue); (b) segmentation results using [
For validation, the segmented building boundary is evaluated for the presence of three possible defects in the segmented image. The existence of the building in the segmented region. The equality of the segmented building with a building in the 2D map. The completeness of the segmentation, which checks for pixel loss in the segmentation stage.
If the building consists of a set of planes, its existence can be confirmed by means of vanishing point estimation. The relationship between the normal directions of the planes can be calculated, and this can be compared to the relationship between the normal directions from the digital map to check the degree of sameness. This data can also be fed back to the second step to correct the position and orientation information.
Finally, pixel loss can occur even when the error correction/compensation process is complete. We expect that the continuity of the image segments used in the vanishing point estimation process can be used with the color information to guarantee subpixel accuracy in the segmented result. The resulting image is expected to resemble that shown in Figure
Image segmentation and validation. From the regional image of the building to the individual façade.
After segmentation, a process of model refinement and texture mapping is required. Model refinement involves modifying the geometrical shape of the buildings, with height being the most significant factor. The building height can be estimated from the vertical field of view in the original panoramic image, which was mapped inversely from the segmented vertical edge in the previous step. In addition, because we are referring to digital map data, building heights can be estimated using simple trigonometry by combining the distance from the camera to the building edge. Additional refinements can be done by adopting the methods in [
The texture mapping stage is quite straightforward once the previous steps are complete. The resulting model is rendered using color data from the texture image and the UV-coordinate values of the vertices in the 3D model. To simplify the UV-coordinate modification, we consider the width of a single texture image to give the total perimeter of the entire building and assign to each façade image the ratio between this perimeter and the façade length. The resulting texture is illustrated in Figure
Resulting texture image.
For the experiments, we obtained panorama images from around Daejeon, South Korea, as well as coupled position and orientation data from a GPS/INS sensor. The source panorama images have a resolution of 5400 × 2700 pixels and a 360° horizontal/180° vertical field of view. Each panorama image was generated by stitching six perspective images using equidistant cylindrical mapping. A total of 233 images were gathered in the area bounded by [36.358933°, 36.339820°] latitude and [127.432165°, 127.436540°] longitude. The applied digital map was drawn at a scale of 1 : 5000, giving a horizontal accuracy of 1 m. By selecting digital map data through its classification code, only information related to the buildings was utilized.
For the 3D image/building analysis environment, we adopted Unity3D version 4.3 with ray casting and vector calculation functionality. Before the image/building analysis, the digital map data was parsed using the shapefile C Library in the Unity3D environment to generate the base 3D model. Error correction/compensation, image segmentation, and validation processes were implemented using MATLAB R2013a. A standard desktop PC (Intel Core i5 CPU 3.4 GHz, 4 GB RAM, Windows 7) was used as hardware.
The resulting model, illustrated in Figure Rich visual information from the photo-realistic appearance. Identical height values as for buildings in the physical world. Accurate geospatial information, identical to that of the digital map. Quasi-Manhattan world model that has a complete prismatic mesh structure.
Resulting textured building models.
The photo-realistic appearance provides an identical visual experience to the user, which is an important factor in immersive virtual environments. Additionally, the remaining text on buildings and signs provides additional visual information. The difference in visual quality is illustrated in Figure
Comparison between our method (a) and aerial image-based method (b).
Scene from the same geospatial position in resulting city model (a) and panoramic image in unit sphere (b).
Since the objective our research is offline generation of the model from the source data, the computation time is not the important target. Nevertheless, the image/building analysis takes 3.2 seconds for about 120 image locations with 7711 buildings in the digital map. The computation time of this process depends on the number of images and buildings in the targeted region. On the other hand, the computation time in the vanishing point estimation and reprojection in the error correction/compensation process as well as in segmentation depends on various factors, including the number of line segments detected, the resolution of the panoramic image, and the region occupied by the target building. For instance, the reprojection takes 36 seconds for a 30-degree horizontal and 120-degree vertical FOV which produces a 647 × 2977 resolution image from a 5400 × 2700 panoramic image. The processing time increases to 71 seconds for a 60-degree horizontal image with the same vertical FOV which produces a 1311 × 2977 resolution image. Vanishing point estimation takes 10 seconds on average by applying the line segments detector algorithm [
In this paper, we proposed a framework for the automatic generation of a textured high-resolution 3D city model for ground-level applications. Our approach was employed to produce a complete prismatic mesh based on a quasi-Manhattan world model. The cost-effectiveness of the framework is ensured by its use of MMS data, which allows access to a massive number of images and a large coverage area. To address the problems of existing MMS-based methods, our framework uses digital map data in four major steps. The proposed approach combines existing techniques with novel processes in each of these steps.
In future work, we will consider combining data from different sources. This is because data from the MMS system are restricted, as images of the façade facing away from the road cannot be gathered. The missing data can be supplemented by processing aerial images [
The authors declare that there is no conflict of interests regarding the publishing of this paper.
This work was supported by the Human Resources Development Program (no. 20134030200300) of the Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government Ministry of Trade, Industry, and Energy and Development of Integration and Automation Technology for Nuclear Plant Life-cycle Management grant funded by the Korea government Ministry of Knowledge Economy (2011T100200145).