Estimating the Aboveground Biomass of an Evergreen Broadleaf Forest in Xuan Lien Nature Reserve, Thanh Hoa, Vietnam, Using SPOT-6 Data and the Random Forest Algorithm

Forest biomass is an important ecological indicator for the sustainable management of forests. e aim of this study was to estimate forest aboveground biomass (AGB) by integrating SPOT-6 data with field-based measurements using the random forest (RF) algorithm. In total, 52 remote sensing variables, including spectral bands, vegetation indices, topography data, and textures, were extracted from SPOT-6 images to predict the forest AGB of Xuan Lien Nature Reserve, Vietnam. To determine the optimal predictor variables for AGB estimation, 10 different RF models were built. To evaluate these models, 10-fold cross-validation was applied. We found that a combination of spectral and vegetation indices and topography variables offer the highest prediction results (Radj � 0.74 and RMSE� 61.24Mgha ). Adding texture features into the predictor variables did not improve the model performance. In addition, the SPOT-6 sensor has the potential to predict forest AGB using the RF algorithm.


Introduction
e forest ecosystem is one of the primary sources of carbon storage in the terrestrial ecosystem and constitutes approximately 80% of all living terrestrial biomass [1]. With a massive carbon pool, the forest ecosystem plays an important role in reducing global warming [2,3]. Most activities related to forest biomass assessments focus on the aboveground biomass (AGB) of living trees because AGB represents the largest amount of total biomass in forests. e accurate assessment and evaluation of forest AGB stores and their spatiotemporal patterns are important for the sustainable management of forests [4,5]. Estimating AGB is one of the most important steps in measuring and evaluating the carbon stocks and carbon sequestration of forests [6].
In general, field measurements (including destructive sampling or using allometric equations/conversion factors) and remote sensing (RS) are the main methods used to estimate forest AGB [7]. e traditional method based on field measurements is the most accurate but is difficult to cover large areas due to it being expensive, labor-intensive, impractical, harmful to nature, and time-consuming at a large scale [8][9][10]. Compared to the traditional approach, the RS technique has advantages in its ability to obtain effective and repeatable vegetation information in large areas, especially for remote regions [11]. e forest AGB can be estimated from different RS sensor types, including synthetic aperture radars [11][12][13], light detection and ranging (Li-DAR), and optical sensors. e radar and LiDAR datasets have the advantage of penetrability through the forest canopy to obtain more information in the following, such as trunks and branches, which contain more than 60% of the AGB [11]; this information will help achieve a higher accuracy. e limitations of these types of datasets are their high costs and large data volume requirements to capture the information in large-scale areas [14][15][16][17]. Besides radar and LiDAR, very high-resolution (VHR) optical images, such as IKONOS, WorldView-2, GeoEye-1, and SPOT-6 or SPOT-7, also allow one to estimate AGB by using the empirical relationships between the AGB and RS spectral bands, vegetation indices (VIs), and texture and topographic information with acceptable accuracy. For example, Motlagh et al. [18], Hirata et al. [19], Hussin et al. [20], Karna et al. [21], Li et al. [22], and Gara et al. [23] successfully predicted forest biomass based on those LiDAR and VHR sensors combined with sufficient field data.
Regardless of the RS data sources, there are no RS techniques that are capable of providing a direct measurement of biomass. As a result, biomass prediction accuracy increases when combined with field-sampled data, especially when using machine learning approaches to build biomass models [4,[24][25][26]. Machine learning algorithms allow one to analyze a large number of predictor variables from remote sensing data, thereby filling in the missing data and reducing the error of the prediction models [27][28][29]. A wide variety of machine learning algorithms have been employed to estimate AGB, including an artificial neural network (ANN), K-nearest neighbor (KNN), support vector machine (SVM), and random forest (RF). In recent years, RF has been widely used to develop predictive models for AGB at the local, regional, and global areas because it can run efficiently on large datasets with a high accuracy. Furthermore, RF has the ability to determine the importance of variables [30,31].
Evergreen broadleaf (EB) forests are estimated to cover more than 57% of national forests [32][33][34][35] and harbour approximately 44% of the total forest carbon stock in Vietnam [35]. is forest type plays an important key role in ecosystem carbon sequestration in Vietnam. However, in Vietnam, there have only been a few studies on forest carbon estimation with the integration of RS techniques, especially using VHR sensors. For instance, Dang et al. [36] used Sentinel-2 satellite images, combined with field-measured data, to estimate the AGB in Yok Don National Park. e research of Pham and Brabyn [8] successfully proved the accuracy of predicting the AGB of mangrove forests in Can Gio (73%) by integrating spectral information, vegetation type, texture features, and vegetation indices from SPOT-4 and SPOT-5 images.
e main objective of this study is (i) to test the ability of the spectral and vegetation indices and topographic and texture features derived from SPOT-6 images to predict AGB in combination with field data using the RF algorithm and (ii) to identify the most desirable predictors for AGB estimation.

Study Site.
is study was conducted at Xuan Lien Nature Reserve, anh Hoa, Vietnam, located at 19 ∘ 52′-20 ∘ 02′N, 104 ∘ 58′-105 ∘ 15′E, which covers 23,404 ha of two forest types in the southwest of anh Hoa province. is reserve is bordered by Cao River in the north, the Nghe An province in the south and west, and the Ta Leo and Bu Khong mountains and the confluence of Cao and Chu Rivers to the east (Figure 1). e study area is situated in a belt of mountains from Sam Neua in Laos to the uong Xuan and Nhu Xuan districts in anh Hoa province, which contain some high peaks (e.g., Ta Leo (1400 m), Bu Cho (1563 m), Bu Hon Han (1208 m), and an unnamed 1605 m peak). e mean temperature is about 23-24°C, and the mean annual rainfall is approximately 1700-1900 mm, which occurs mainly from May to October and accounts for 90% of the total annual rainfall [37]. e main soil in the nature reserve is feralite soil: feralite humus soil in the medium-high mountains (FH), feralite soil in the lowlands (F), and alluvial soil (P) associated with streams or rivers and the valley bottom [37]. e vegetation in the study area was mainly closed evergreen broadleaf forest, which was classified into three forest types [38] based on the classification of ai Van Trung [39]. e first forest type is distributed from medium to high montane, consisting of mixed coniferous and broadleaf evergreen forests (MCBEV) between 800 m and 1605 m (a.s.l). is forest type is generally undisturbed and dominated by upper storeyed broadleaf tree species from the families of Fagaceae, Lauraceae, Euphorbiaceae, Fabaceae, Magnoliaceae, Dipterocarpaceae, and Sapotaceae [37,38]. e second forest type is located in the low montane broadleaf evergreen forests (BEV), which are distributed under 800 m a.s.l and have been weakly impacted by human activities. Common species include Leguminosae, Euphorbiaeceae, Lauraceae, Rutaceae, Rosaceae, and Meliaceae [37,38]. e final forest type is secondary forests (SF), which are mainly a mix of Neohouzeaua dullooa, Dendrocalamus patellaris, Bambusa sp., and broadleaf evergreen forest [37,38]. ese plots were randomly generated in ArcGIS 10.4 and then located in the field using a GPS device with errors up to 5 m. Within the plots, the diameter at breast height (dbh) and the total height (h) of each living tree with dbh greater than 5 cm were measured using a diameter tape and a Vertex Hypsometer, respectively. Tree species were also recorded for each measured tree.

Aboveground Biomass Estimation.
We considered only the aboveground living tree biomass for carbon estimations. Aboveground biomass (AGB) was estimated as the sum of the individual components (stumps, stems, bark, branches, seeds, and foliage) of the individual living trees that were predicted using appropriate allometric equations [6]. ese allometric equations were carefully chosen depending on the forest types and the tree or bamboo species available in the input dataset. For the evergreen broadleaf forests, we used the biomass equation developed by Huy et al. [40], which was specifically developed for evergreen broadleaf forests in the North Central region of Vietnam (1). For bamboo forests, we opted for the equation from Vu et al. [41], which was developed for bamboo forests at a national scale (2). For mixed forests of bamboo and evergreen broadleaf forests, both (1) and (2) were used to estimate the total biomass. All of the selected equations above are based on the tree/bamboo diameters (dbh) and total heights (h).

2
International Journal of Forestry Research Finally, to synchronize the estimated AGB for each sample plot to the remotely sensed data, the AGB values were prorated and scaled to obtain the per-hectare values.

Remotely Sensed (RS) Data.
Due to the availability, SPOT-6 dataset was opted as RS data in this paper. SPOT-6 is an optical satellite that was developed by Astrium with the capacity to obtain panchromatic and multispectral imagery at spectral resolutions of 1.5 m and 6 m, respectively [42]. Two orthorectified scenes of SPOT-6 images taken on 20 May and 05 December 2013 were obtained for this research. Both image scenes consist of four multispectral bands (blue: 450-520 nm, green: 530-590 nm, red: 625-695 nm, and near-infrared (NIR): 760-890 nm), each with a 6 m spatial resolution and one panchromatic band (450-745 nm) with a 1.5 m spatial resolution [42]. e digital number (DN) of the SPOT-6 images was first used to calculate the radiance data and then convert those data to the reflectance value using atmospheric correction in ENVI 5.4. We applied the FLAASH (Fast Line-of-Sight Atmospheric Analysis of Spectral Hypercube) radiative transfer model to correct the atmospheric interference in each image [43]. e 6 m spatial resolution digital elevation model (DEM) was first created from a topographic map with 5 m contour lines [44] using the "Topo to Raster" interpolation method in ArcGIS 10.4. e topographic data (elevation, slope, and aspect) were then generated from a 6 m DEM.

Variables for AGB Prediction.
To explore the effectiveness of the SPOT-6 sensor for estimating forest AGB, different types of RS features were considered. ese features included raw spectral bands, topographic data, vegetation indices (VIs), and texture (Table 1). Based on the coordinates, size, and shape of each sample plot, we created a polygon shapefile using the "rectangles, ovals, and diamonds" plugin in QGIS 1.8.0 [52], which we then overlaid onto the RS data. e values of all pixels within each polygon plot were derived for the four different spectral bands and then averaged for each plot. e extracted values were then   International Journal of Forestry Research used to calculate the 9 VIs. We used the following vegetation indices, most often used in remote sensing-based studies on forest biomass and its properties [4,45,53,54]: NDVI (normalized difference vegetation index), RDVI (renormalized difference vegetation index), RVI (ratio vegetation index), DVI (difference vegetation index), MSR (modified simple ratio), and EVI (enhanced vegetation index). Since some locations in the study area have low vegetation cover (Figure 1), we additionally used SAVI (soil-adjusted vegetation index), OSAVI (optimized soil-adjusted vegetation index), and GEMI (global environment monitoring index) to minimize the effect of soil background reflectance [47]. e topographical conditions, including elevation, slope, and aspect, were also considered as factors affecting the forest's structure, composition, and distribution [55][56][57].
e texture feature calculations were carried out using PCI Geomatica 2013. ese calculations were performed on all images using a 5 × 5 (900 m 2 ) 6 m-pixel window [50]. For each spectral band, eight texture parameters, as per Haralick et al. [51], were calculated. In total, 52 independent variables were used.

Correlation between the AGB and RS Data.
e analysis of the relationship between the AGB and RS data was carried out using the RF algorithm that was integrated into the randomForest package in R software [58]. RF is an ensemble machine learning algorithm that has been widely used in biomass modeling, with the advantages of being able to handle a large number of input variables and identify the most significant variables, as well as to reduce or even overcome the overfitting problem and thereby improve model accuracy [8,59,60]. e RF algorithm (RF) was first developed by Breiman [30]. is ensemble learning method generates many decision trees from a randomly selected sample via bootstrapping, known as a training dataset. e features for modeling at each node of the decision trees are also randomly selected.
e results are then obtained by averaging the predictions from all decision trees. To estimate the model errors, a subset of samples, comprising the remaining data from the original dataset (called out-of-bag data or OOB data), is used as validation samples. ese OOB data are not only used to calculate prediction errors by comparing the predictions from the training dataset with the Texture (derived from each spectral band) OOB data but are also used to measure the importance of the variables [30]. In RF modeling, there are two important training parameters that need specification: ntree is the number of trees to grow in the forest, and mtry is the number of randomly selected variables used in each node of the tree. A good RF model, which is built from the desirable values of ntree and mtry, will have a low root mean square error (RMSE). To find the ntree value that corresponds to a desirable predictor, different ntree values varying from 50 to 1000 with an interval of 50 were tested. e final ntree value was selected based on the stability of the RMSE (see Figure 2). To identify the optimal mtry values, we used the tuneRF function in the randomForest package.
To evaluate the importance of each variable, RF defines two measures, which are computed from the OOB data. e first measure is the percent increase in the mean square error (%IncMSE) that was calculated for the prediction of each tree [31]. Higher %IncMSE values indicate a more important predictor. e second measure is the total decrease in node impurities (IncNodePurity), which is the average of the residual sum of squares over all trees when splitting the variables at each node [31]. Higher IncNodePurity values indicate a more important variable. According to Strobl et al. [61], the IncNodePurity method is biased and not recommended for use. erefore, in this study, we only use the % IncMSE measure to identify the importance of variables.
Overall, 10 RF models were built to determine the most desirable predictor for forest AGB estimation (Table 2).

Model Validation.
For validation, the original data were randomly divided into two separate parts: a training dataset (70%) and a testing dataset (30%). Each RF model's performance was validated through a 10-fold cross-validation. e validation measures include the adjusted coefficient of determination (R 2 adj ) and the root mean square error (RMSE). Table 3 and Figure 3 show the results of tree AGB calculations for each forest type at the plot level from field data measurements. e results show that the forest AGB ranges from 18.32 Mg ha −1 to 543.86 Mg ha −1 . e average AGB estimated for Xuan Lien Nature Reserve was 158.23 Mg ha −1 for the four forest types. e MCBEV forests had the highest AGB followed by the BEV and SF forests. Secondary forests had the lowest AGB and were mostly mixed bamboo and evergreen forests or developed on abandoned agriculture land.

Tree AGB Estimation from Field Data.
In total, 189 species from 55 families were recorded in the field. e five most dominant species were Castanopsis indica, Engelhardia roxburghiana, Ormosia sp., Fokienia hodginsii, and Archidendron balansae.

Variable Importance and Variable Selection for the Final RF Models.
Because models RF2, RF3, RF4, RF5, and RF6 are a combination of spectral features, vegetation indices, topographic data, and texture features, only models RF1 (all variables), RF7 (spectral variables), RF8 (vegetation indices' variables), RF9 (topographic variables), and RF10 (texture variables) were used to investigate the importance of the predictor variables. Each RF model was run 100 times to determine the variation of each variable's importance.    Among all the variables (model RF1), the most 10 important variables are elevation, spectral band 3 (red), the GLCM mean of the green and blue band, RVI, MSR, NDVI, OSAVI, RDVI, and SAVI (Figure 4. RF1). When using spectral features as predictors, the most important band is band 3 (red) following by band 4 (NIR), band 2 (green), and band 1 (blue) (Figure 4, RF7). Among the nine VIs used for AGB estimation, there is no large difference in the %IncMSE values (Figure 4, RF8), for which DVI and EVI have highest values. If we use only topographic data to predict AGB, elevation has the largest influence followed by aspect and slope (Figure 4, RF9). For texture features, the result from model RF10 reveals that the texture means of the green and blue bands are the two most important variables for AGB estimation (Figure 4, RF10). To select variables for the final models, knowing the importance of each variable is not enough. Each model must have an optimal number of variables, which will improve the model's accuracy. To obtain the optimal variables, we used a 10-fold cross-validation method (running 100 times). e optimal variables for the 10 models are shown in Figure 5. Figure 6 shows the results of the ten models, which are consistent with the processing done in Section 3.2. Model RF6 shows the lowest result followed by models RF10, RF8, and RF7. Models RF4 and RF9 have similar accuracy and present a slightly lower result than model RF2. Model RF1 presents the highest result but requires 52 variables and resulted in overfitting. Although models RF3 and RF5 have a result lower than model RF1, they require only 7 and 16 variables, respectively, and do not show over-or underfitting in their results.

Discussion
e main objective of this study was to test the possibility of using SPOT-6 images for estimating the AGB of evergreen broadleaf forests in Xuan Lien Nature Reserve using the random forest algorithm. Figure 4 (RF1) shows that elevation was the most important variable for predicting AGB.
is is mainly because vegetation types strongly vary along the altitudinal gradient within the study area [38]. Similarly, other studies have proven that forest biomass has a significant relationship with vegetation types and elevation [  red reflectance and the two textures derived from the green and blue bands. Among the four spectral bands, the red band has the strongest correlation with forest AGB (Figure 4, RF7) following by NIR and the green and blue bands. e possible reason for this result is that red reflectance and NIR reflectance are more sensitive to vegetation characteristics (e.g., tree species or stem volume) than other visible types of reflectance [65]. e performance of the regression was improved when we combined vegetation indices and/or topographical features with spectral band reflectance. Some similar findings were presented by Pandit et al. [4] and Adam et al. [66], who stated that using VIs improves results because VIs diminish the influence of environmental conditions and shadow effects on reflectance.
In this study, we have shown that the AGB and different VIs have a significant correlation with each other. e most useful VI for predicting forest AGB was DVI followed by EVI, RVI, OSAVI, SAVI, MSR, GEMI, NDVI, and RDVI; however, the difference between these VIs was not very high. Figure 5 clearly shows that the texture features (models RF10, RF6, and RF2-those that included texture as a predictor) are less important than other features in AGB estimation. In other words, the accuracy of the model was not improved when using texture as an additional predictor for AGB estimation. is result is similar to that of Pham and Brabyn [8].

Conclusions
is study used the RF algorithm for modeling and predicting the forest AGB in Xuan Lien Nature Reserve, Vietnam, using VHR SPOT-6 data combined with fieldbased data. e results showed a significant statistical relationship between the AGB and the SPOT-6 data. e SPOT-6 data effectively predicted the AGB of the EB forest with R 2 adj � 0.74 and RMSE � 61.24 Mg ha −1 . e accuracy of AGB estimation was affected by many factors, among which elevation was indicated to be the most important for AGB models. e random forest model selection of important variables showed that using elevation and vegetation indices and spectral reflectance could significantly improve biomass estimations in evergreen broadleaf forests. e RF algorithm is also suitable for estimating the AGB of evergreen broadleaf forests.
Based on the results of the study above, some future work should be considered. For example, when applying the model from this study to different types of forests in different ecoregions, the topography, spectral reflectance, and texture should be taken into consideration. Although the method introduced in this study is applicable to other forest ecosystems in Vietnam, evaluating forest types is one of the necessary impact predictors.
It is also possible to use other RS data sources or machine learning algorithms, which may have better fit estimations. erefore, the next step of this study is to compare different machine learning techniques to predict forest AGB using two optical sensor types (SPOT-6 and Sentinel-2 MSI).
In this study, we found that elevation is one of the most important predictors for forest AGB estimation. erefore, when designing a field survey for forest biomass estimation, the elevation should always be recorded.

Data Availability
e data that support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.