Individual Building Rooftop and Tree Crown Segmentation from High-Resolution Urban Aerial Optical Images

We segment buildings and trees from aerial photographs by using superpixels, and we estimate the tree’s parameters by using a cost function proposed in this paper. A method based on image complexity is proposed to refine superpixels boundaries. In order to classify buildings from ground and classify trees from grass, the salient feature vectors that include colors, Features fromAccelerated Segment Test (FAST) corners, and Gabor edges are extracted from refined superpixels. The vectors are used to train the classifier based on Naive Bayes classifier. The trained classifier is used to classify refined superpixels as object or nonobject. The properties of a tree, including its locations and radius, are estimated by minimizing the cost function. The shadow is used to calculate the tree height using sun angle and the time when the image was taken. Our segmentation algorithm is compared with other two state-of-the-art segmentation algorithms, and the tree parameters obtained in this paper are compared to the ground truth data. Experiments show that the proposed method can segment trees and buildings appropriately, yielding higher precision and better recall rates, and the tree parameters are in good agreement with the ground truth data.


Introduction
With the fast pace of industrialization and urbanization, 3D models are more and more necessary for urban planning, flight simulator, and military training. It is important to identify buildings and trees in high-resolution aerial photographs for displacement maps in real-time, which is a key procedure for building 3D models, because buildings and trees not only are significant features for city modeling, but also often occlude other elements in 3D urban models. Therefore, the first step of displacement maps is to detect buildings and trees from aerial photographs. Automatic detection of trees and buildings is a challenging work because, first of all, the input data sets of aerial photographs are huge; second, the features of buildings and trees are various, and it is hard to find salient ones for training purposes; third, it is difficult to classify building from ground and classify tree from grass in desert climate regions like Arizona State in USA.
Recently, many methods have been proposed for aerial images objects detection. In this section, we will provide a description of prior art of building and tree detection from three aspects: (1) buildings and trees segmentation; (2) superpixel refinement; (3) salient region feature.
(A) Building and Tree Segmentation. Building and tree segmentation from aerial image is used in different applications and analysis, for example, mapping, video games, 3D modeling, environmental management and monitoring, and disaster management. Huertas and Nevatia [1] used a generic model of shapes of structures which were rectangular or were combined by rectangular components to detect the buildings. However, they assumed the visible building surfaces consisted of smooth regions and that building sides consisted mostly of vertical structures, which makes it difficult to detect complex buildings. Liow and Pavlidis [2] proposed two methods integrating region-growing and edge detection to extract buildings. However, some small buildings were not detected. Then, Rottensteiner and Briese [3] used a hierarchic application of robust interpolation using a skew error distribution function to extract the building points from the LiDAR data. However, this method needed much time to calculate the digital terrain model (DTM). Rottensteiner 2 Journal of Sensors et al. [4] proposed a hierarchic approach to extract the buildings from the LiDAR data and multispectral images by using a classification. This method extracted large and small buildings. However, they put more emphasis on detecting all buildings in the test data set than on reducing false alarm rate. Cheng and his colleagues [5] proposed an algorithm of integration of LiDAR data and optical images for building roofs segmentation. However, point normal estimation needs to be used for initial segmentation, which made the algorithm complicated. In order to achieve roof segmentation from aerial images, El Merabet et al. [6] proposed a line and region-based watershed method based on the cooperation of edge-and region-based segmentation methods. However, this method is not robust to the aerial image resolution, contrast, noise, and so forth.
There are many methods for tree segmentation from aerial urban images. Iovan et al. [7] proposed a supervised classification system based on Support Vector Machines (SVM) to detect trees in high density urban areas using high-resolution color infrared images. Unlike our proposed method, this algorithm only uses textures to segment trees from buildings, which reduces the segmentation accuracy. Another pixel-level classifier based on Adaboost was implemented to segment trees from the aerial urban images [8]. However, this method cannot segment the neighbor trees. Chen and Zakhor [9] proposed a random forest classifier for the tree segmentation without using color information. In this method, the edges of buildings and trees were not discriminated because of lacking some other features. In order to accurately segment the trees, Brook and Ben-Dor [10] proposed an approach based on a sequence merge including different classifiers. However, this classifier was a complex classifier that was difficult to be implemented. Risojević et al. [11] presented an evaluated classifier based on SVM using only Gabor features. However, because of lacking colorful information, the objects with smooth structures cannot be correctly segmented. Swetnam and Falk [12] proposed a method for tree canopy segmentation and estimating tree's parameters by using LiDAR data. In order to calculate the forest structural parameters, Huang et al. [13] proposed a method for calculating the tree height and crown width by using high-resolution aerial imagery with a low density LiDAR system.
(B) FAST Superpixel Refinement. In order to overcome the undersegmentation, the number and boundaries of superpixels should be appropriately determined. A superpixel is a patch whose boundary matches the edge of an object. Superpixels represent a restricted form of region segmentation, balancing the conflicting goals of reducing image complexity through pixel grouping while avoiding undersegmentation. They have been adopted primarily by those attempting to segment, classify, or label images from labeled training data. Superpixels are segments in an image which can serve as basic units in the further image processing. Their purpose is to reduce the redundancy in the image and increase efficiency from the point of view of the next processing task. Many methods for the computation of superpixels were already presented.
Levinshtein et al. [14] presented a method to produce smooth segment boundaries. Turbopixels algorithm was a geometric-flow based algorithm for computing a dense oversegmentation of an image, often referred to as superpixels. It produced segments that on one hand respect local image boundaries, while on the other hand limiting undersegmentation through a compactness constraint. It was very fast, with complexity that was approximately linear in image size, and can be applied to megapixel sized images with high superpixel densities in a matter of minutes. However, it produced a fixed number of superpixels. Engel et al. [15] used the medial features to detect the undersegmentation region, and Sobel operator was used to detect the edges of undersegmentation regions. However, the computational burden of this method was high because of implementing iterations, which increased the run-time. Kluckner and Bischof [16] proposed an automatic building extraction method based on superpixels, which is similar to our method. However, they applied a conditional Markov random field (CRF) to refinement on superpixels twice, which was a computationally expensive process. Therefore, we proposed a fast and simple method based on image complexity to determine the number of superpixels and refine boundaries of superpixels.
Therefore, we can find that a drawback of most of these methods is their high computational complexity and hence high computational time consumption. Second, a sophisticated method for marker image calculation which respects both the remaining natural edges in the image and the regularity of marker placement in still regions of the image still has to be developed.
(C) Salient Region Feature. Salient region, which was introduced by Itti et al. [17], attracts much more attention by human brain and visual system than other regions in the image. It is inspired by the Human Vision System (HVS), and therefore the features extracted from salient regions are invariant to viewpoint change, insensitivity to image perturbations, and repeatability under intraclass variation [18]. These features are key properties for image classification. Therefore, salient region is introduced for the image classification. Hahn et al. [19] presented a representation for 3D salient region features, and these features were used to be implemented for medical image registration. The authors used a region-growing based approach for extracting 3D salient region features. In order to extract cartographic features from aerial and spatial images, Kicherer et al. [20] proposed an interactive method based on image texture and gray-level information for image region segmentation. Bi et al. [21] proposed a method for saliency tiles detection from the entire image using low-level features. However, because of selecting tiles whose edges were regular, some edges of an object were missed and much computation was needed. Therefore, we extract the salient features from superpixels, and these features are used for the aerial image classification.
(D) Research Objective. To automatically generate a realistic 3D urban model just depending on a single 2D large highresolution urban image, we should segment and detect all the objects in the 2D image, such as trees, buildings, and streets. However, in this work we limit our focus to buildings and trees. Our aim is to produce a label for each pixel in an aerial photograph indicating whether building or tree is present at its location. We are given a high-resolution aerial or satellite photograph from a nadir viewpoint that includes visible color ( or ) which is an important feature for building and tree segmentation. We do not consider additional raster data that may be available, such as LiDAR data, infrared, hyperspectral data, stereo data, or digital surface models, because they can be more difficult to obtain. Therefore, we claim the following specific contributions: (1) Proposing a method to automatically estimate the superpixels number for an aerial photo based on measurement of image complexity and detect boundaries of the undersegmentation region. The new method makes our approach robust against missing or erroneous metadata about image resolution.
(2) Proposing a salient feature vector for training the classifier, which is simple and efficient and easy to be implemented.
(3) Proposing an approach for estimating the location and radius of a tree crown and the tree height without additional information, which is simple and efficient and easy to be implemented.

Study Area.
The study area is located at Phoenix, Arizona, United States, and its location is (33 ∘ 27 N, 112 ∘ 4 W), which is shown in Figure 1. The area of the study region is about 291.7264 km 2 . Arizona has an abundance of mountains and plateaus in addition to its desert climate. Despite the state's aridity, 27% of Arizona is forest. The largest stand of Ponderosa pine trees in the world is contained in Arizona. Therefore, the geography of Arizona makes it more difficult to segment the buildings and trees from images than from other areas.

Aerial
Imagery. The test images are from urban or suburban regions of Arizona, USA. These aerial images were  collected at August, 2008, using the ADS40 airborne digital sensor. The height taken in the images is 600 meters. This sensor incorporates a line-array of charge-coupled devices (CCDs) and is capable of acquiring visible to near-infrared stereo data at ground resolutions of 0.5 m/pixel. The detail of this sensor can be found in [22]. The resolution of each aerial image shown in Figure 2 is 8540 × 8540 in this paper. The main EXIF information of experimental images is shown in Table 1. Moreover, in order to train the classifier, 50 images which are labeled trees, building, and others are used and the examples of the original labeled images are shown in Figure 3 whose resolution is 2048 × 1642 pixels. Furthermore, 500 images are used to evaluate the proposed algorithm.

Proposed Algorithm
In this paper, we propose a building and tree segmentation method using superpixels from a single large high-resolution urban image. The procedure of this algorithm is shown in Figure 4.

Presegmentation Based on Refined Simple Superpixel.
In this paper, an improved method is proposed to presegment the large scale image with a high accuracy. This method is based on Turbopixels whose number of superpixels was given by users. However, different from the Turbopixels, our   method used two formulas to overcome the undersegmentation caused by Turbopixels.
In order to use superpixels [14,23] to presegment images to reduce computation while preserving exact pixel boundaries, different types of images should need different numbers of superpixels. Therefore, it is required to automatically calculate the number of superpixels according to the processed images. Moreover, it is necessary to overcome the undersegmentation by refining results given by Turbopixels. Therefore, we use superpixel that is defined in Section 3.1.1 to calculate the number of superpixels automatically according to the image complexity and edge is used to refine the presegmentation. This superpixel is called Refined Simple Superpixel.

Simple Superpixel Segmentation.
The aim of superpixels is to reduce the problem by replacing pixels with regularly spaced, similarly sized image patches whose boundaries lie on edges between objects in the image, so a pixel-exact object segmentation can be accomplished by classifying superpixel patches rather than individual pixels. However, the boundaries of the superpixels will not match with the edges of the objects as well if the number of superpixels generated is too small, and the computation will be expensive when the number is too large.
In this paper, we use the image complexity to calculate the number of superpixels. Image complexity is defined as a measure of the inherent difficulty of finding a true target in a given image. The image complexity metric is determined by the gray-level feature or the edge-level feature [24]. According to Rigau et al. [25], image complexity depended on the entropy of the image intensity histogram. Therefore, we use the entropy to measure the image complexity. The following formula is used to calculate : where is the probability value of th gray. This metric was originally developed for automatic target recognition. However, according to lots of experiments, we suggest another novel use of this metric to choose the number of superpixels using the following formula: where is the image size and is a weighted value. According to (2), we can find that the number of superpixels is dependent on the image complexity and the image size. The bigger is, the more useful information is included in the aerial image. Moreover, the bigger is, the more objects are contained in the aerial image. Therefore, superpixel must be bigger, which means that more superpixels whose boundaries lie on edges between objects in the image are needed to segment the images. However, in order to obtain a balance between the segmentation accuracy and computation, a penalty weight is introduced in (2).
Testing different types of images which were collected by our lab, we suggest that the internal of is 6∼13 and in our experiment is 8.1. In order to balance under-and oversegmentation, several parameters were introduced in (2). According to our research, oversegmentation or undersegmentation is caused by inappropriate numbers of superpixels. Furthermore, more details are included in the complex image. Therefore, the entropy in (2) was introduced to calculate the image complexity. Besides, a larger image includes more details than smaller one, which means that more superpixels are required for image segmentation. Therefore, the image size was introduced in (2). Finally, the parameter was introduced to suppress the oversegmentation. As shown in Figure 2, we can find that the trees, buildings, and other objects are covered by different superpixel patches; even the trees are adjacent, which makes it easy to segment buildings and trees. From Figure 2, we can also find that most of the superpixels cover the pixels which belonged to the same objects. However, some boundaries of the superpixels are missed. Therefore, edge is implemented to detect these missing boundaries in Section 3.1.2.

Refined Superpixel.
In this paper, we use a method based on Canny detector to overcome the undersegmentation. Canny [26] suggested that, in an image, edges were usually defined as sets of points with a strong gradient magnitude. An edge can be of almost arbitrary shape and may include junctions. Therefore, if the variance of pixels in the same superpixel is bigger than the threshold, the Canny detector should be used to refine this superpixel. Then the new boundary of the superpixel could be found; otherwise, this superpixel does not need to be refined.
In this paper, we propose a function edge to determine which superpixels need to be refined. The function is shown as follows: > thr Detection, where is the standard deviation of the th superpixel and thr is the threshold which is obtained by the experiment. In this paper, we suggest that thr is the threshold whose interval is [9,19] that is calculated based on [27].
Simply defining thr to be the three-dimensional feature in color space will cause inconsistencies in clustering behavior for different superpixel sizes. This produces compact superpixels that adhere well to image boundaries. For smaller superpixels, the converse is true.

Salient Features Extraction from Refined Superpixels.
A superpixel-based classifier is proposed for segmenting buildings and trees from a presegmentation image. In order to train the classifier, we assign {building, tree, others} label to each superpixel to create the training data, and then a salient feature vector is derived from an entire superpixel. In this paper, we extract the color feature, corners, and texture feature for training the classifier.

Color Features.
In our method, and (hue, saturation, and value) are extracted to be as color features. The input indices fill a color-cube which the hexcone was designed to perfectly fit [28].
is a basic color space, and it can be transformed to any color space including , . At the same time, is a more reasonable color space and it has a relationship with the Human Vision System (HVS). Therefore, they were both important for training a classifier. The following features calculated based on and are extracted from each superpixel: mean , mean , mean , mean , mean , and mean : the means of each channel of and .
std , std , std , std , std , and std : the standard deviation of each channel of and .
max , max , max , max , max , and max : the value of the pixel with the greatest intensity in the superpixel. min , min , min , min , min , and min : the value of a pixel with the lowest intensity in the superpixel.

Robust FAST Corners Extraction.
Corner refers to a small point of interest with variation in two dimensions. There are many corner detectors which can be classified into two categories. (1) Corner detector based on edges: an edge in an image corresponds to the boundary between two areas and this boundary changes direction at corners. Many early algorithms detected corners based on the intensity changes. Journal of Sensors machine learning [29], which is shown by (4). However, FAST corner is not robust because the threshold in (4) is a constant. Therefore, a novel function which is used to calculate the threshold is proposed in this paper, which is shown by (5): where is the gray values of candidate corners in a superpixel, is the gray values of neighbors around , and FAST-corner is the FAST corner number which is calculated by the following equation: where is the variance of an aerial image ( , ) and ( , ) is the value of the th pixel in the image ( , ). According to (4), we can find that the bigger is, the more FAST corners are extracted from the superpixels, because more salient corners are in an image whose variance is bigger.

Gabor Textures.
In our method, we use the Gabor filter to extract structure features of buildings and trees. Structure is an important feature to segment buildings from street and ground, and it is also useful to segment trees from grass. Furthermore, frequency and orientation representations of Gabor filters are similar to those of the Human Visual System (HVS), and they have been found to be particularly appropriate for texture representation and discrimination. Among various approaches for extracting texture features, Gabor filter has emerged as one of the most popular ones [30]. A Gabor descriptor of an image is computed by passing the image through a filter bank of Gabor filter. In the spatial domain, a 2D Gabor filter is a Gaussian kernel function modulated by a sinusoidal plane wave with a complex sinusoid [31]. The function is shown as follows: where Ω is the frequency of the Gabor function and and determine its bandwidth. From the zoom-in region of Figure 2, we can find that (1) buildings have similar color with the ground and streets and (2) trees and grass have similar color. These two characteristics make it difficult to segment buildings and trees from aerial images. However, the textures of buildings and trees are much more obvious than ground and grass. Besides, in order to balance the conflicting goals of reducing computation while obtaining a satisfying accuracy, the number of frequencies ( 1 = √ 2 /2, 2 = √ 2 /3, 3 = √ 2 /4, 4 = √ 2 /5, and 5 = √ 2 /6) is 5 and number of orientations ( 1 = 0, 2 = /8, 3 = /4, 4 = 3 /8, 5 = /2, 6 = 5 /8, 7 = 3 /4, and 8 = 7 /8) is 8.
Through concatenating the above features based on SVM, a salient feature vector is generated for each superpixel. Then we use this feature vector to train a classifier for building and tree segmentation. Finally, Naive Bayes classifier is chosen to be used for tree and building classification in our algorithm because it is efficient and easy to be implemented [32].

Proposing a Cost Function for Tree Parameters Estimation.
After segmentation, a tree location ( , ), a tree radius , and a tree height ℎ are estimated by a cost function integrating shadows.

The Proposed Cost
Function. The tree model matching is applied to 3 channels: , , and of the image ( , ). The main advantage of the model is that the luminance and the color information are independent. Therefore, the luminance component can be processed without affecting the color contents [33]. In order to match trees in ( , ) with the tree model ( , ), we propose a cost function which is shown as follows: where 0 , , , , and are the weights to control the convergence speed of the cost function, , , and are used to calculate the root of the square error (RSE) of , , and channel between ( , ) and ( , ) within the tree's canopy, and is used to calculate the RSE of shadows in ( , ) and ( , ) which is a circle. These functions are shown as follows: where = 1/ , is a weight, and ∈ [0, 1], ( ) is supposed to be circular in our algorithm. ( 0 , 0 ) is the initializing center of the model. 0 is the initializing radius and = 1/ , is a weight, and ∈ [0, 1]: ( ) is also supposed to be a circle and its radius is the same with the tree model. ( , ) is dependent on the centroid of the tree which is detected. In Section 1, we showed that there was a relationship between tree and its shadow, which is shown in Figure 5, and this relationship can be calculated by using the following functions: where is the sun azimuth angle, is the sun elevation angle, and they could be calculated using the time when the image was taken, ℎ 0 is the tree height, ( 0 , 0 ) is the center of the tree crown, and ( , ) is the center of the shadow. Here, we estimate the tree height ℎ using (11) and the cost function. It is noted that all these parameters are initialization parameters, and the unit of the tree parameters is pixel, because we calculate them without additional information.

The Outline of Calculating Tree's Parameters.
In order to calculate the tree's parameter, the gradient descent method is employed to achieve iterations. The convergence speed is robust. This procedure is shown in the following steps.
Step 2. Calculate the norm value of 0 which can be expressed as 0 , where 0 = 5.
Step 4. Calculate the new gradient value 1 using int 1 .
Step 5. Calculate the norm value of 1 which can be expressed as 1 .
Step 6. Calculate the ratio of 0 and 1 ; rate can be obtained.
Step 7. If rate > , newspeed = (rate / ) × oldspeed; else newspeed = oldspeed, where is a threshold, is a constant, and > 1 and in our experiment = 100. In our method, the initialization centroid ( 0 , 0 ) is the centroid of a superpixel that belonged to a tree. The number of iterations is 600.
This procedure is finished until getting the minimum value of tree parameters ( * , * , * , ℎ * ).
Our method was implemented using Microsoft Visual Studio 2010 and Matlab 2010a, and this method was coded by integrating C# and Matlab. The proposed algorithm was tested on many large high-resolution urban images taken from Phoenix, Arizona, USA.

Results
In our experiment, the method is implemented by two steps: (1) the first step is presegmentation, and the result is compared with the result which is got by Turbopixel and Entropy Rate Superpixel (ERS); (2) the second is to segment buildings and trees, and the segmentation result is compared to the result got by using Turbopixel and ERS [34]; (3) the tree's parameters estimation result is evaluated by the root mean square error (RMSE). Figure 2 is an original image which includes trees, grass, buildings, and other objects. The zoom-in area of Figure 2 shows that the tree color is similar to grass color; the textures of trees are coarse and salient, while the textures of grass are smooth. Moreover, the shapes of buildings and trees are different. Figure 6 shows the presegmentation result. By comparing Figures 6(a), 6(b), and 6(c), we can find that the superpixel in Figure 6(a) was refined by our method. It is noted that the number of superpixels is 8200 before the aerial images are presegmented by using Turbopixels. Figure 6(a) shows the result which is obtained by (3). The boundaries are marked by red, and the boundaries which are got by (3) are marked by green. Figure 7 shows four segmentation results; the first one is based on our method; the second one is based on the Turbopixels; the third one is obtained by using ERS; and the last one is ground truth segmentation result. By subjectively comparing these three results, we found that the segmentation result got by our method is the best. Because of training the features from the superpixels, computation of our method is less than that of the method used by ERS. For the superpixel-level segmentation, the computational complexity is ( × /2 10 ). However, the computational complexity of pixel-level is ( ). is the image complexity, and is the pixel number. Therefore, the running time of our method is considerable, which is very important in large scale urban aerial image processing. The average running time of our method is 45% less than the average running time of ERS.

Segmentation Result.
Furthermore, two standard metrics which were undersegmentation error [35] and boundary recall [36] were used for evaluating the quality of superpixels. Undersegmentation error (UE) measures fraction of pixel leak across ground truth boundaries, which is shown by (12). The comparison results of UE were shown in Figure 8(a): where T = { 1 , 2 , . . . , } represent ground truth segmentation and denotes the segment size and is the th  superpixel. Boundary recall (BR) measures the percentage of the natural boundaries recovered by the superpixel boundaries, which is represented by (13). The comparison results of BR were shown in Figure 8(b): where T and denote the union sets of ground truth boundaries and superpixel boundaries, respectively. The indicator function ∇ checks if the nearest pixel is within distance. In our experiments we set ∇ = 2. These performance metrics are plotted against the number of superpixels in an image.
According to Figure 8, we can find that the proposed algorithm in this paper performs significantly better than the state of the art in all the metrics when the number of superpixels is calculated by the proposed function.

Building and Tree Classification Result.
To justify the use of Naive Bayes classifiers which is trained by the proposed algorithm in our approach, we tried the popular intersection kernel and Chi-square kernel on our feature vector including color, FAST corner, and Gabor feature for comparison. Figure 9 shows the building and tree classification results with zoom-in regions.

Tree's Parameters Estimation
Result. In our method, we calculated the tree parameters one by one. Each tree with shadow was extracted from every segmentation image, such as Figure 7, and then this tree is matched with tree model that is presented in this paper. The estimation result is shown in the panel of the GUI when the tree is marked by using red circular, which is shown in Figure 10. After calculating the parameters, all of the trees are covered by a round model. RMSE is introduced to evaluate the estimation results and the function is shown as follows: where is the number of trees, is the estimated parameter, and true is the ground truth of the parameters which are manually measured. It is noted that and true are used to describe the same tree in the image. The RMSE of , , , and ℎ is 1.142 pixels, 1.381 pixels, 1.993 pixels, and 5.992 pixels, respectively. At the same time, we manually measured the parameters for 270 trees which are chosen from Figure 9(a). The height of the tree cannot be measured directly because the image is 2D. However, we can measure the parameters of the tree's shadow by human, so the height can be calculated by (11). Therefore, the tree's height calculated by (11) could be considered as ground truth. Then these ground truth Building Tree Others  parameters are compared with the parameters obtained by using the proposed cost function (7). The result is shown in Figure 11.

Discussion
In this paper, we proposed a novel method for building and tree segmentation from large scale urban aerial images, and we also proposed a new approach for estimating tree parameters. In Section 4.2, we compared our segmentation method with the method proposed by Levinshtein et al. [14]. From Figure 6, it is shown that our approach of superpixel can be segmented along with the edge of objects, while the method of Levinshtein and his colleague segmented several objects into one superpixel. We also compare our segmentation results with the pixel-level segmentation approach using ERS. In Figures 9(a) and 9(c), the results indicate that our method can not only detect smaller building objects but also has less false alarm detection of building objects on the street. In our paper, the superpixel code is programmed by Matlab code. Therefore, the time complexity is a little high. However, we are doing the work of installing this algorithm by C#, which is used to reduce the run-time and easy to be used in commercial areas.
In Section 4.4, we presented a novel approach of calculating tree parameters, such as locations, radiuses, and heights, depending on a single aerial image without additional information. Our method needs less data than existing methods [9,37,38] that used additional information from LiDAR data and another database. Based on our novel segmentation method and the high-resolution aerial photographs, our tree parameters can be estimated with high accuracy since (a) the edges of neighbor trees can be detected more precisely because our segmentation method can separate adjacent trees using different superpixels; (b) the edges of shadow of trees are clearer and easier for segmentation in the high-resolution aerial images; (c) the time information for calculating heights by shadows of trees can be obtained from EXIF info in aerial photographs. However, our approach is implemented with two restrictions. The first restriction is that the color of building roof cannot be the same as the color of ground. The second restriction is that the distance between trees cannot be too close or the nearby tree cannot be separated and will be considered as the same one.
In Section 1(A), we showed many existing segmentation algorithms in aerial urban image processing using the pixellevel as the underlying representation. However, the pixellevel representation is not a natural representation of visual scenes. Each pixel cannot represent the geometric structure property of an object, while the superpixel is perceptually meaningful and each superpixel is a perceptually consistent unit [23]. The comparison of Figures 9(a), 9(b), and 9(c) shows that the pixel-level segmentation misses some edges of the buildings, while the superpixel-level segmentation keeps the edges. Furthermore, superpixel image has low computation cost and reduces the complexity of high-resolution aerial images. Therefore, many researchers introduced the superpixels for the image segmentation. For example, Saxena et al. [39] used the superpixel to segment objects during modeling 3D scene structure from a single stationary image. Kluckner et al. [40] used the superpixels to segment the large scale aerial images and the conditional random field (CRF) was used to refine the result. The presented approach of segmenting tree stems goes one step further than existing methods by using additional information between the superpixels. Most of missing edges of superpixels are detected.
Tree model is widely used to detect trees. Morsdorf et al. [41] used the local maxima of canopy height model (CHM) to calculate the tree height using airborne laser scanning raw data. The method from Vastaranta et al. [42] used a raster CHM which was created from normalized data to detect the individual tree. Apparently, LiDAR data information is needed for detecting trees. However, our proposed method just detects the single tree from the aerial image without any additional complex information. The experiments show that the parameters of the trees can be calculated from a single high-resolution aerial image. In order to calculate the parameters of the trees, we use shadows, tree crowns, and the time of taking aerial image as input data and estimate the height of a tree based on (11).
Finally, the experiments show that the locations and radiuses of the trees estimated by our approach are approximate to the manual-picking methods. However, there are some heights of trees higher or lower than the ground truth ones shown at Figure 11(d). It might be because the shapes of the tree are irregular. This problem could be solved if we consider a new tree model which is more complex than the model proposed in this paper.

Conclusion
In this paper, we proposed a building and tree detection algorithm by using improved superpixels from large highresolution urban images, and we also proposed a method to calculate the tree parameters depending upon a cost function and shadows. A function was proposed to automatically calculate the number of superpixels, and a function is used to refine the boundaries of superpixels. We provided a new tree model with a cost function that can be minimized using gradient decent in order to identify the optimal properties of individual tree. We evaluated our method by using many aerial images and compared our method with other two stateof-the-art methods. Experiments showed that our method is fast and robust, while still being simple and efficient, and they also indicate that the shadow is a good feature to estimate the tree height. The results of our method can be implemented for generating 3D urban models. The main purpose of this paper is to design fast, accurate segmentation and classification, characteristics not focused on the learning process. However, we will examine the different supervised learning algorithms in the future research work, improving the effectiveness of the proposed algorithm. Future work will focus on comparing our method with other supervised algorithms including methods of integrating these data sources into our solution.
Parameters used in the rule sets are sample values specifically for the study area we chose and may vary for other locations of interest, but the similar principles and procedures can be applied to other areas. Using additional ancillary data, such as 1-meter resolution LiDAR (Light Detection and Ranging), may further help generate more accurate landcover maps, which is among our planned future work.