The Sloping Mire Soil-Landscape of Southern Ecuador: Influence of Predictor Resolution and Model Tuning on Random Forest Predictions

The sloping mire landscape of the investigation area, in the southern Andes of Ecuador, is dominated by stagnic soils with thick organic layers. The recursive partitioning algorithm Random Forest was used to predict the spatial water stagnation pattern and the thickness of the organic layer from terrain attributes. Terrain smoothing from 10 to 30m raster resolution was applied in order to obtain the best possible model. For the same purpose, several model tuning parameters were tested and a prepredictor selection with the R-package Boruta was applied. Model versions were evaluated and compared by 100 repetitions of the calculation of the residual mean square error of a five-fold cross-validation. Position specific density functions of the predicted soil parameters were then used to display prediction uncertainty. Prepredictor selection and tuning of the Random Forest algorithm in some cases resulted in an improved model performance. We therefore recommend testing prepredictor selection and tuning to make sure that the best possible model is chosen. This needs particular emphasis in complex tropical mountain soil-landscapes which provide a real challenge to any soil mapping approach but where Random Forest has proven to be successful due to the testing of model tuning and prepredictor selection.


Introduction
Tropical forests store significant amounts of organic carbon not only in their aboveground biomass but particularly in their soils.In tropical mountain forests which receive a high amount of precipitation, soil wetness is usually assumed to even increase organic layer thickness and, therefore, soil carbon stocks due to a lower nutrient availability and turnover rate [1,2].Furthermore, with the altitude decreasing temperatures were often reported to cause an additional increase of the thickness of the organic layer.Accordingly, Dieleman et al. [3] claim that montane tropical forests consistently contain larger amounts of soil organic carbon compared to tropical lowland forests.We will show in this study that the relation between climate and paludification (the formation of peatlands) in tropical mountain landscapes is still not completely understood and needs further investigation.
The soil landscape of the tropical mountain forest area in San Francisco, Ecuador, is dominated by soils with hydromorphic properties and thick organic layers [4,5].The soils are influenced by slope processes such as shallow slope parallel subsurface flow within the organic layer and the stagnic soil horizon [6,7].Rainfall and the geomorphology of the landscape have a strong influence on the genesis of stagnic soil properties [7] in determining how much water is accumulated.The more water reaches a particular soil compartment due to rainfall and shallow subsurface flow and the lesser the discharge capacities of the soil are due to its saturated hydraulic conductivity and its position in the landscape, the more likely it is that stagnic properties will develop.The term catena [8] refers to the relief determined pattern of soils on hillslopes.It is defined as a sequence of soils of about the same age derived from similar parent material and occurring under similar climatic conditions but having different characteristics due to variation in relief and drainage [9].
Being an important control factor of soil formation and water logging in particular, relief parameters calculated from a digital terrain model (DTM) are often used to regionalise soil properties.Walker et al. [10], Thompson et al. [11], and Chaplot et al. [12] predicted hydromorphic soil properties by landform parameters.Furthermore, the relation between terrain parameters and soil drainage classes was analysed by Troeh [13] and Peng et al. [14].Soil horizon thickness was regionalised by Moore et al. [15].Moreover, Park et al. [16] classified the soil morphological pattern based on soil landscape units.For a detailed overview regarding digital soil terrain modelling please refer to [17].However, so far only a few attempts have been made to apply digital soil mapping techniques in tropical mountain landscapes, due to their often high heterogeneity, difficult terrain, and low accessibility.Moreover, rainfall induced soil slides are an important factor within the development of this particular soil landscape.Accordingly, first attempts to predict the histic and stagnic soil layer [7] by classification and regression trees (CART) [5] resulted in very high prediction uncertainty.
The recursive partitioning algorithm Random Forest is known for its strong predictive force due to the prediction by multiple decision trees and the implemented algorithms to guarantee decision tree variability.However, for the random selection of a subset of predictors to develop each tree within Random Forest, each single tree is also more prone to lose its predictive force while weak predictors are included.This is a well-known problem concerning machine learning algorithms [18].Kursa and Rudnicki [19] provided the R-package Boruta to select the minimal predictor set and exclude predictors without predictive force to solve this so-called minimal-optimal problem.Hence, different predictor resolutions, Boruta pre selection of predictors, and two tuning parameters to optimize the Random Forest algorithm shall be tested in order to obtain the best possible spatial prediction of three soil parameters: (1) the occurrence probability of hydromorphic/stagnic properties in the soil, (2) the vertical extent of these properties, that is, the thickness of the stagnic soil layer, and (3) the thickness of the organic layer.In order to refer to the former two, the term water stagnation pattern will be used.Furthermore, we use the term stagnic layer and not stagnic horizon to differentiate from its usage within the World Reference Base of Soil resources [7], where a stagnic soil horizon needs a clearly defined minimum thickness.

Dataset.
The investigated soil-landscape comprises an area of c. 26 km 2 around the research station San Francisco (Figure 1).Soils were assessed by 56 soil profiles and 315 auger sampling points.To guarantee a representative dataset, sampling sites were selected according to a 24-terrain classes comprising sampling design [4].While soil profiles, due to limited time, had to be positioned close to the existing footpath network, auger sampling took place along transects laid from hilltops to valley bottoms (Figure 1).Transect positions also had to be selected due to accessibility of the often very steep terrain.

Prediction Parameters.
In earlier studies [4,5] the terrain parameters altitude, aspect, slope, terrain curvature, distance to the channel network (OFD), and specific upslope contributing catchment area, calculated by the kinematic routing algorithm (KRAA), were used as prediction parameters for organic layer and stagnic layer thickness [7].
According to Ließ et al. [4,21], Stagnosol [7] probability increases above an altitude of about 2150 m a.s.l. on slope angles <40 ∘ .Schrumpf et al. [22] also report an increase in hydromorphic properties with altitude.The increase with altitude can be attributed to the increasing rainfall [23].Lesser steep slope angles account for a slower discharge.Against all expectations, rather small specific upslope contributing catchment areas showed the highest Histosol [7] probability [21].
Apart from parameters being related to rainfall intensity like altitude or parameters describing relative water accumulation such as slope, OFD, and KRAA more predictors describing water accumulation are added in this study, that is, convergence index and Saga wetness index (SWI).We further assume that the soil's relative position on the slope is of high importance not only due to received rainfall and water accumulation potential but also regarding wind and incoming solar radiation, which are expected to have a stronger drying effect along the exposed mountain ridges and on the eastern slopes that are exposed to the main wind direction.To best describe these processes, the terrain parameters valley depth, normalised height, wind effect, terrain ruggedness index (TRI), and potential incoming solar radiation (PISR) were included as prediction parameters.All considered terrain parameters were The predictor altitude, which corresponds to air temperature [36] as well as a different forest composition and structure [37], is assumed to strongly influence the organic layer thickness due to a lower nutrient availability and turnover rate [1,2].The relative position on the slope, represented by valley depth and normalised height, is the second important factor determining forest structure and results in a denser more light consuming vegetation with taller trees in the deep side valleys compared to the ridge structures [37].Furthermore, Oesker et al. [38] report a different soil nutrient status for ridges and gorges within this area.
We, therefore, expect the complex interactions of (1) climatic factors such as rain and temperature represented by altitude, (2) factors representing water accumulation and discharge such as slope, KRAA, KRAS, SWI, and TRI, and (3) factors influencing the drying mechanisms such as wind effect, PISR, and relative slope position, represented by normalised height and valley depth to determine the spatial water stagnation pattern.The thickness of the organic layer is assumed to be determined by (1) the forest structure, (2) the soil nutrient status, (3) air temperature represented by altitude, valley depth, and normalised height, and (4) the soil water stagnation pattern.
The influence of the predictor resolution, particularly of the DTM, on digital soil mapping has been widely analysed and has been summarized by Behrens et al. [39].Ließ [40] also assumed an effect of the DTM resolution on model uncertainty for this particular research area.Therefore, this study aims at finding the best possible Random Forest model to predict the spatial water stagnation pattern and the organic layer thickness within this tropical mountain landscape by means of a range of predictor resolutions and various model tuning parameters which include predictor selection and the size of the predictor set for each Random Forest tree.

Random Forest Algorithm.
Methods from the field of statistical learning theory are often applied to understand the influence of terrain parameters on soil properties and use this relation to develop digital soil maps.Recursive partitioning methods, that is, CART, boosted classification trees, and Random Forest (RF) are used to predict soil units, clay content, or soil drainage classes from terrain parameters (e.g., [41][42][43]).These tree methods are amongst the most popular and widely used techniques for nonparametric regression and classification [44].
Recursive partitioning [45] is a procedure by which a data set, comprising one dependent (e.g., soil property) and many predictor (e.g., terrain parameters) variables, is progressively split into a dichotomously branching tree that optimizes the homogeneity of samples within subsets based on the dependent variable.At each splitting location within a tree, it is possible to determine the value of the predictor variable that best predicts the split in the data.For a regression problem (e.g., organic layer thickness), the optimal split is found by minimizing the mean square error.For a classification problem (e.g., occurrence of stagnic properties), the optimization criterion is the Gini Index [45].
RF, developed by Breiman [46], is an ensemble method which grows a number of classification or regression trees.Model stability is achieved through tree diversity by (1) choosing at random a subset of predictor variables (mtry) to grow each tree and (2) sampling with replacement (bootstrapping) and thereby varying the input dataset.The size of mtry has to be selected by the user.It is a sensitive parameter determining model strength for it defines the strength of each individual tree and the correlation between any two trees in the forest.Tree strength improves model performance, whereas correlation among trees weakens it.mtry can be optimized by the R-function tuneRF.The tuning parameter sampsize determines the size of the data subset used for model development.It is set to 2/3 by default but can be varied.Hence, Random Forest contains several tuning parameters which control internal random processes.
RF was performed within the open-source data analysis environment R (version 2.13.2;R Development Core Team, 2011).It is implemented with the package randomForest which is based on Breiman and Cutler's FORTRAN code.The R-package Boruta for predictor selection is described by Rudnicki and Kursa [47].Several tuning strategies were applied and compared to the Random Forest model with default parameters resulting in four model versions.
(1) No predictor selection and no tuning (R default values for mtry and sampsize).
(2) Additional predictor selection with the R package Boruta [19].All confirmed and tentative predictors were used.In case predictor selection with Boruta led to better models compared to (1) it was also included into (3) and ( 4).
As analysed [5], not only CART methodology, but also RF shows a strong dependence on the used dataset, especially while the dataset is small.In order to compare model performance and to estimate modelling uncertainty of the 12 adapted models-three spatial predictor resolutions times four model tuning strategies-a 5-fold cross-validation scheme was computed and conducted in 100-fold repetition to account for the effect of internal and external random effects.Internal effects refer to the bootstrapping and predictor selection procedure (mtry) implemented within Random Forest; external effects refer to the sample attribution to crossvalidation groups.The so obtained RMSE distributions of the 12 models are then also compared to the RMSE distribution of the mean of the data which was calculated by the same scheme.

Results and Discussion
3.1.Preliminary Data Mining.Thick organic layers in the tropics are assumed to develop above poorly drained basins and depressions or in highland areas with a high precipitation/evapotranspiration ratio [7].However, Ließ et al. [48] state that the wettest points in highland landscapes might be too wet to carry thick organic layers.It is often assumed that soil water logging limits organic matter turnover [49,50] which results in the expectation of a positive correlation between the occurrence of stagnic properties and organic layer thickness.However, Figures 2(a) and 2(b) clearly show that there is no correlation between the occurrence of stagnic properties and the organic layer thickness for the investigated area as was already assumed by Ließ et al. [48].The tuning of sampsize also had an ambivalent effect on model performance (model version 3), for it improved model performance for the models predicting the occurrence probability of stagnic properties, using 20 or 30 m terrain resolution, and even resulted in the best model (Figures 3(b) and 3(c)).The same is true for the prediction of the stagnic layer thickness, where it improved the models of all three terrain resolutions (Figures 4(a)-4(c)) and resulted in the overall best model for one of them (Figure 4(b)).In the latter case of 20 m terrain resolution being used for the prediction of the stagnic layer thickness, all other models (versions 1, 2, and 4) resulted in models worse than the data mean.In predicting organic layer thickness, the tuning of sampsize impaired model performance for 20 Because of their lowest median of the RMSE distribution the following models are considered best and will be used for prediction and map generation.(i) Occurrence probability of stagnic properties: 10 m DTM, model 1 (no predictor selection, no tuning).
Of these three models, a combination of tuning of mtry and sampsize might have resulted in an even better model only for the model to predict stagnic layer thickness.The model to predict the occurrence probability of hydromorphic properties reduced the Median RMSE by 18%, the model to predict stagnic layer thickness reduced it by 3%, and the model to predict organic layer thickness reduced it by 11% compared to using the data mean for prediction.That higher predictor resolution resulted in better models to predict water stagnation was also reported by Chaplot et al. [12].According to Campling et al. [51], vegetation indices and terrain parameters have a complementary role in predicting soil drainage classes.Hence, classified satellite image information could improve model performance and will, therefore, be included in future modelling approaches.The digital soil map of the median occurrence probability of stagnic properties and its interquartile range is shown in Figures 6(a) and 6(b).The best of the 12 models predicted by terrain parameters of 10 m resolution with no tuning and no predictor selection was selected.For the development of the digital soil map of the occurrence probability of stagnic properties, all terrain parameters were included.This indicates that it is the complex pattern of climate (altitude, PISR), water accumulation (curvature, convergence, KRAA), water discharge (slope, KRAS), the insulating effect of the heterogeneous geomorphology with the ridge-side valley structure in particular (TRI, normalised height, valley depth), and the wind effect (wind effect, aspect) which lead to the distribution pattern of stagnic soil properties within the investigation area.
The spatial pattern of stagnic properties occurrence probability in Figure 6(a) follows that described by Ließ et al. [48] with a minimum median probability of 0.2 throughout the area and a particularly high probability between 2100 and 2500 m a.s.l.The lower probability below 2100 m a.s.l.must be attributed to the higher inclination that supports a higher discharge of surface and subsurface flow and the higher bulk soil density [52].In contrast, particularly the flat platform-like areas above 2100 m show a much higher probability.The lower probability above 2500 m a.s.l according to Ließ et al. [48] can be attributed to a higher soil hydraulic conductivity due to a sandier soil texture [5] and therefore less chance for the development of stagnic soil properties.A low interquartile range, <0.1 for 99% of the area (Figure 9(a)), shows that the dataset is well suited to model the spatial pattern of hydromorphic properties within this area.However, particularly some parts along the upper mountain ridges (Figure 6(b) south-eastern part) display a rather high interquartile range.This shows that the model is better suited for certain parts of the landscape.However, another possible explanation is that the degree of soilscape complexity for certain geomorphological positions might be higher than for others.
The model to regionalise stagnic layer thickness is less stable than the model to predict the horizon's occurrence probability.This is indicated by the rather high interquartile ranges in Figure 7(b).Still 70% of the area displays a range <6 cm (Figure 9(b)).The vertical development of the soil profile is less influenced by surface processes.According to Park and Vlek [53], soil attributes whose vertical distribution is strongly determined by vertical pedogenesis or unknown factors are poorly modelled by environmental variables.Therefore, terrain attributes can only explain horizon thickness to some extent.Bauer [6] limited slope-parallel subsurface flow within the research area to the topsoil (stagnic layer).The frequent change of parent material within one soil profile [5] might be the reason why stagnic layer thickness cannot be explained by geomorphology alone.
The chosen model version includes prepredictor selection and the tuning of mtry so that we assume that at least some of the predictors are not or are less suitable for predicting the stagnic layer thickness.The thickest stagnic layers >40 or even >60 cm are found along the mountain ridges, with decreasing thickness while proceeding down slope towards the side valley creeks.In contrast to the occurrence probability of stagnic properties, this pattern seems to apply throughout the area.
For the development of the digital soil map of the organic layer thickness, all terrain parameters were included.This means that the assumptions about the predictors were reasonable.However, from the final model it is not clear in which way the predictors influence the spatial pattern of the organic layer thickness.The digital soil map is shown in Figure 8.Its uncertainty expressed as the interquartile range (Figure 8(b)) according to the dataset is very low; that is, different data subsets predict a similar organic layer thickness.97% of the area displays an interquartile range <5 cm (Figure 9(c)).This indicates a stable model.The thickest organic layers are not found along the mountain ridges which are supposed to be the wettest due to their exposedness and low slope angle, resulting in the highest stagnic layer thickness (and probability).The organic layer thickness rather seems to be the highest on mid slope positions, decreasing towards the creeks and towards the crests.Water logging surely limits organic layer decomposition, but is not the only cause.A higher direct solar radiation and wind exposure might favour decomposition rates along the exposed and mostly flat mountain ridges.In addition, wind might also be responsible for a lower litter fall rate on these sites.Accordingly, apart from prediction parameters indicating rainfall (altitude) soil water accumulation (plan curvature, KRAA, SWI, TRI) and water discharge (slope), solar radiation (PISR) and wind effect are also important.Last but not least, the relative slope position (normalised height, valley depth) is a good indicator for organic layer thickness.
With the overall median organic layer thickness >21 cm and 46 area % even ≥40 cm, the area is influenced by paludification, controlled by factors such as climate, geomorphology, and soil water stagnation.International mire classification [54] acknowledges soligenous surface flow mires, the socalled sloopy fens.German Bavarian classification [55] is more precise in acknowledging the fact that sloping mires are influenced by rainwater and shallow subsurface flow at the same time and refers to them as soliombrogen sloping mires.However, information on tropical mountain mires is still scarce.Chimner and Karlberg [56] state that tropical mountain peatlands unlike lowland peatlands are covered by cushion plants, bryophytes, and herbaceous.We can now report that the mires within the tropical Andes of southern Ecuador are also found within the tropical mountain forest zone at 1800-2800 m a.s.l. and under páramo vegetation starting above c.2800 m a.s.l.
Furthermore, organic layer thickness does not increase with altitude as was concluded by studies based on a dataset with a smaller spatial coverage [13,47] and is usually explained by a decrease in organic matter turnover and limitation in nutrient supply due to decreasing mean temperatures and increasing rainfall [1,20,57].
The landslides occurring with high frequency within the investigation area might give one possible explanation.However, old landslide positions, covered again by dense forest, were sampled by chance and make up less than 5% of the data.Furthermore, open landslide scars, which are visible on aerial photographs, are mainly found at upper slope positions and on steep slopes and therefore cannot explain the low organic layer thickness along the ridges which are left unaffected.

Conclusions
Soliombrogen sloping mires do not only occur in tropical páramo landscapes, but also under tropical mountain forest vegetation.However, there is no simple relation between the water stagnation pattern and organic layer thickness, but a complex interaction of various parameters has to be considered.Furthermore, soil organic layer thickness within this tropical mountain landscape does not increase along the altitudinal gradient as was assumed by authors analysing datasets of a smaller spatial coverage.
The Random Forest algorithm was successfully applied to predict the spatial distribution of the occurrence probability of stagnic properties and of the organic layer thickness in this complex tropical landscape that is influenced by landslides.The RMSE as compared to the mean was reduced by 18% by the model to predict the former and 11% by the model to predict the latter.Still stagnic layer thickness was the most difficult parameter to be predicted, as indicated by the low improvement in RMSE by only 3% as was also described by Ließ et al. [48].
Our modeling exercise has shown that position specific density functions of soil properties, characterized by median and interquartile range (Figures 6, 7, and 8), may be an appropriate way of mapping prediction uncertainty.They show that a particular model is better suited for certain parts of the landscape.Furthermore, they might indicate that the degree of soilscape complexity for certain geomorphological positions might be higher than for others, which needs to be further investigated in the near future.
All considered terrain parameters proved to be important predictors for the spatial water stagnation pattern and the paludification process in this particular landscape and should, therefore, be considered while investigating other tropical mountain landscapes, and-in case they are available-amended by others.As expected, DTM resolution also showed an impact on model performance.Higher resolution was favoured for the prediction of hydromorphic properties, while using a smoothened DTM lead to better results concerning the organic layer thickness models.Tropical soils may vary in their hydromorphic properties within a few meters.Nevertheless, it would have been interesting to extend the range of the GIS raster resolutions beyond 30 m.However, with the available dataset this was not possible.As the spatial distance between many data points is not more than 25 meters, a further terrain smoothing would have only yielded in a higher noise and model uncertainty.
However, no general decision should be made in regard to whether Random Forest tuning is necessary or not.Prepredictor selection with Boruta improved model performance in one of the three predicted soil parameters.The tuning of mtry resulted twice in the overall best model.Sampsize reduced model variation as indicated by the RMSE range and improved model performance in some cases.We, therefore, recommend always testing prepredictor selection and model tuning in order to make sure that the best possible model is chosen.This needs particular emphasis in complex tropical mountain soil landscapes which provide a real challenge to any soil mapping approach but where a supervised learning technique has proven to be successful due to the testing of model tuning and prepredictor selection.

Figure 2 :
Figure 2: (a) Boxplots of organic layer thickness for soils with and without stagnic properties based on collected data set.(b) Scatter plot relating stagnic and organic layer thickness based on collected data set.

3. 2 .
Model Performance.Predictor selection with Boruta (model version 2) resulted in no model improvement concerning the prediction of the occurrence probability of stagnic properties (Figures 3(a)-3(c)) and organic layer thickness (Figures 5(a)-5(c)) as indicated by the median RMSE.It rather impaired model performance.For the prediction of the thickness of the stagnic soil horizon it did, however, have a positive impact for the models using 10 and 20 m terrain resolution (Figures 4(a) and 4(b)).

Figure 3 :Figure 4 :Figure 5 :
Figure 3: Boxplots of the 100 RMSE of the models predicting the occurrence probability of stagnic properties.(a) 10 m terrain resolution, (b) 20 m terrain resolution, and (c) 30 m terrain resolution. 1 = default Random Forest parameters, 2 = additional predictor selection, 3 = 1, 2+ tuning of sampsize, and 4 = 1, 2+ tuning of mtry.The lines refer to the boxplot parameters of the RMSE distribution of the mean of the data as prediction model.Solid line: median, dashed line: upper/lower hinges, and dotted line: upper/lower whiskers.

3. 3 .
Digital Soil Maps.The digital soil maps display the soil parameters' distribution function for every point in the landscape.While the median (Figures 6(a), 7(a), and 8(a)) displays the spatial prediction estimate, the interquartile range (Figures 6(b), 7(b), and 8(b)) provides a spatial uncertainty estimate due to the data.

Figure 9 :
Figure 9: Covered cumulative area percentage in dependence of the interquartile range.(a) Occurrence probability of stagnic properties, (b) stagnic layer thickness, and (c) organic layer thickness.

Table 1 :
SAGA modules to calculate terrain parameters.