Prediction of Soil Organic Carbon for Ethiopian Highlands Using Soil Spectroscopy

Soil spectroscopy was applied for predicting soil organic carbon (SOC) in the highlands of Ethiopia. Soil samples were acquired from Ethiopia’s National Soil Testing Centre and direct field sampling. The reflectance of samples was measured using a FieldSpec 3 diffuse reflectance spectrometer. Outliers and sample relation were evaluated using principal component analysis (PCA) and models were developed through partial least square regression (PLSR). For nine watersheds sampled, 20% of the samples were set aside to test prediction and 80% were used to develop calibration models. Depending on the number of samples per watershed, cross validation or independent validation were used. The stability of models was evaluated using coefficient of determination (), root mean square error (RMSE), and the ratio performance deviation (RPD). The (%), RMSE (%), and RPD, respectively, for validation were Anjeni (88, 0.44, 3.05), Bale (86, 0.52, 2.7), Basketo (89, 0.57, 3.0), Benishangul (91, 0.30, 3.4), Kersa (82, 0.44, 2.4), Kola tembien (75, 0.44, 1.9), Maybar (84. 0.57, 2.5), Megech (85, 0.15, 2.6), and Wondo Genet (86, 0.52, 2.7) indicating that the models were stable. Models performed better for areas with high SOC values than areas with lower SOC values. Overall, soil spectroscopy performance ranged from very good to good.


Introduction
Ethiopia is one of the largest countries in Africa. The vast majority of the society (∼80%) depends on agriculture which contributes around 42% of the growth domestic product (GDP) [1]. Despite its strategic importance for the country's economic development, the agricultural sector suffers from low efficiency, population pressure, ineffective land management, and unfavourable land use practices leading to widespread land degradation. Still, the country is strongly committed to environmental protection, rehabilitation, and sustainable land management as evident from Article 92 of its constitution [2], and various internationally supported initiatives have been launched aiming to stop and reverse this trend. Among others, the Growth and Transformation Plan (GTP) is the major one. To address these challenges, efficient and affordable land management practices are required in all regions of the country and their effectiveness should be analysed and monitored. Alongside, there is pronounced need to improve knowledge on soil resources and to collect reliable data on soil's state and dynamics, allowing for operational assessment and monitoring of this important resource more precisely [3]. Such information is critically important for research and development interventions.
Traditionally, such information was collected through comparatively expensive and slow wet chemistry analysis methods. Every year, many soil samples are collected and analysed for determination of soil properties critical to soil management and crop husbandry using the conventional (wet chemistry) laboratory methods. However, the country is facing a critical challenge to continue with conventional methods as the cost for chemicals is very high and the efficiency of conventional laboratories is low. Soil spectroscopy as an alternative approach has yet to be established in source: https://doi.org/10.7892/boris.42691 | downloaded: 14.6.2020 2 ISRN Soil Science Ethiopia; therefore, conventional methods of soil analysis currently remain the only available option. The advent of spectroscopy in soil science provided a promising prospect and is nowadays successfully applied in many parts of the world since its early beginnings [3]. Soil spectroscopy is an analytical technique used for soil analysis [4]. It is fast and efficient in producing comprehensive soil information in a short span of time [3][4][5][6][7]. Many soil attributes can be determined with a single scan [4,6,8]. Moreover soil spectroscopy requires only a small amount of soil; it is a nondestructive procedure allowing samples to be stored and reanalysed. Finally, it requires less complex infrastructure as compared to the conventional soil analysis. Despite the aforementioned benefits, the technology is not yet widely used in Ethiopia. To our knowledge, the only scientific report on soil spectroscopy in Ethiopia was published by Vågen et al. [9]. In view of the country's urgent need for more efficient technologies, a study on the applicability of soil spectroscopy for the prediction of SOC was conducted in the highlands of Ethiopia. Soil organic carbon is an important indicator for soil health and fertility, serving alongside as a measure for the potential of carbon storage in the soils in relation to climate change mitigation [10]. Therefore, the objective of the study was to evaluate the feasibility of VNIR soil spectroscopy for predicting SOC in the highlands of Ethiopia.

Study Area.
The chosen study sites cover most of the agroecological zones, climate conditions, geologies, and soils of the Ethiopian highlands ( Figure 1 and Table 1).
The study sites including Bale Mountain, Basketo, Maybar, and Wondogenet are characterised by bimodal type of rainfall, which is manifested in the eastern, northeastern, and south-eastern parts of the country. The other study sites have one prolonged rainy season [11]. The general characteristics of the study sites are given in Table 1. The southern and western parts of the country are mostly under perennial crops; the northern parts of the country are dominated by cereal crops. The farming systems are characterised by small-scale croplivestock mixed types. In all of the study sites, sustainable land management practices have been introduced in the course of land rehabilitation and reclamation campaigns.

Soil Sample Sets.
For this study, a total of 1159 soil samples were chemically and spectrally analysed. The major share of soil samples (713) was acquired from the soil archive of Ethiopia's National Soil Testing Centre (NSTC), representing the soils of the dominant highland areas of the country. We explored the soil archive for the existing soil samples together with the analysed soil parameters, method of analysis, year of sampling, and origin of the samples. Then, sites for this study were selected based on availability of organic carbon data and their representation to the highlands of Ethiopia. Soil samples in the catchments of Anjeni and Maybar ( Figure 1) were collected according to a field sampling procedure presented in Amare et al. [12]. The sampling design was optimized to account for the spatial variability of soil properties due to land use and land management activities, as well as for pedogenetic factors varying across spatial and temporal scales. Samples from all other sites analysed and stored at the NSTC were collected following similar sampling designs as outlined previously, addressing soil characterisation and effects of land use and management practices in the watersheds.

Chemical Soil
Analysis. The soil organic carbon content of all samples from the NSTC was determined using the wet oxidation Walkley-Black method [13], a standard procedure in Ethiopia. The additionally collected samples from Anjeni and Maybar were dried under shadow conditions, ground, and sieved to pass through 2 mm sieve and then soil organic carbon was determined similarly to the other sample sets. All samples were analysed in the same laboratory following the same procedure to errors due to laboratory procedure and helped to work with a homogeneous sample set.

Spectral Measurement.
Pretreatment and spectral measurements were carried out in line with the protocol developed by the global soil spectroscopy group [14,15]. The reflectances of all soil samples were measured at Ethiopia's National Soil Testing Centre in Addis Abeba, February 2011, using a FieldSpec 3 diffuse reflectance spectrometer (Analytical Spectral Devices, Boulder, CO, USA). The spectra were measured from 350 to 2500 nm at 1 nm interval. A highintensity mug light source was used to illuminate samples from the bottom through borosilicate Duran glass petri dishes. Roughly, 1 cm layer of soil was poured into the sample holders. A panel consisting of spectralon (Labsphere, North Sutton, USA) was used as a white reference standard between each measurement. The spectrometer was optimised regularly after measuring 10 samples and between shifting sets of different soil samples. Two replicate spectra with 90-degree rotation were collected in order to increase the precision of the measurements. Whenever high variations occurred between the scans of the same sample, the measurement was taken again. After checking the spectral signatures of each repeated measurement, average spectra of the two readings were computed. Spectral inconsistencies (splices) were observed for all samples at 1000 and 1830 nm and were corrected by applying an offset. In order to reduce data and the processing time, only every 10th reading starting from 380 nm was kept for further analysis [14]. Spectral regions below 380 nm or above 2450 nm were removed because they were affected by noise [3].

Data Analysis and Model Development.
The geological background, soil type, and agro-ecology of the study areas are diverse (Table 1 and Figure 1). Not only between, but also within a given watershed, environmental conditions are highly variable (Table 2). Due to this fact, a separate modelling strategy was followed in this study, developing one model for each watershed. According to the main objectives of this study, which was to test the applicability of soil spectroscopy for various environmental settings in Ethiopian highlands, this choice was further motivated by findings and discussions given by [16,17]. For exploratory data analysis, the R statistics language [18] was used while modelling was done in Unscrambler 10.1 (CAMO, Oslo, Norway). The dataset was screened for outliers using PCA as well as model residuals. One sample from Bale, one sample from Megech, and three samples from Wondogenet study sites were flagged for removal due to high residual variance as well as high leverage and score distance; we speculate that sample handling errors or fundamentally different soil characteristics (out-of-population sample) could be possible reasons for these variations.
Various spectral preprocessing options were tested, and finally preprocessing with first derivative was found to improve the performance of the models best in terms of stability and interpretability, except for the Bale case study site, where second-order derivative was scoring higher. Derivative is a widely used preprocessing technique [16], which has a baseline correction effect and is able to enhance weak signals [16,19]. Before modelling, all spectra further underwent autoscaling (mean-centering and variance-scaling), which is a recommended standard procedure for most cases [20].
The raw spectra displayed in Figure 2 reflect the variability of the VNIR spectra in the sample set exemplarily for the Bale case study site. Basically, the shape of all VNIR spectra was similar, showing a steep ascent from 400 to 750 nm, which is characteristic for iron oxides [16]. The dominant absorption regions (reflectance minima) around 1450 and 1900 nm are usually attributable to OH − and H 2 O, masking most of other signals [16]. The higher similarity of the spectra of the derivative spectra given in the lower part of Figure 2 is resulting from the removal of additive and multiplicative scatter effects achieved by the derivative preprocessing techniques.

Calibration Models.
Partial least square regression (PLSR) was used to develop models for the relationship between soil spectral reflectance and soil organic carbon determined by wet chemistry. The number of samples selected for calibration, validation, and prediction varied from watershed to watershed based on the number of samples. For all watersheds except for Anjeni, models were developed with 80% of the samples with cross validation, while 20% of the samples were randomly selected and set aside for model testing. The distributions of SOC for these samples selected randomly for independent model testing were compared and scrutinized with SOC values samples used for calibration and validation using descriptive statistics ( Table 2) and statistical tests discussed later.

Splitting Samples into Calibration, Validation, and Test
Sets. For Anjeni watershed, the number of samples (302) was sufficient to split into calibration and validation sets. Before splitting the sample sets into calibration and validation sets, 20% of the samples (61 samples) were selected randomly for checking purpose. The 61 samples constitute the test set. After selection and separation of samples for testing set, the remaining 80% of the samples (241 samples) were split again randomly into calibration set (145 samples) and validation set (96 samples). This approach separating calibration, validation, and testing sets is recommended as a robust procedure by Varmuza and Filzmoser [21]. Therefore, the separation of all testing sets was done following the same procedure as given previously. The sample selection of calibration set, validation set, and test set for a given site was based on the size of the samples sets available per study area. The statistical appropriateness of the sample split was scrutinized using the two samples t-test for mean homogeneity, Wilcox-sum rank test for distribution homogeneity, and the Levene's test for variance homogeneity. Furthermore, descriptive statistics were calculated for total samples, calibration, validation, and test sets as presented in Table 2.
The statistical parameters (R 2 , RMSE, and RPD) were calculated as indicated below.
The coefficient of determination (R 2 ) is as follows: where X is the laboratory measured soil organic carbon (%) for each observation and is the mean value of the laboratory measured while Y is the predicted soil organic carbon (%) and its mean value. The root mean square error (RMSE) is as follows: where Y is the predicted (fitted) value, X the measured value, and N is the number of observation. The ratio of performance deviation (RPD) [22] is as follows: where SD is the standard deviation of the measured samples. Finally, calibration models were evaluated based on coefficient of determination (R 2 ), root mean square error (RMSE), and ratio of performance deviation (RPD) as discussed later in the results part of this paper.  Table 2 have been tested for statistical conformity. For all splits, the test results confirmed the null hypothesis which is assuming homogeneity of the parameters tested. Table 2, soil organic carbon (from wet chemistry analysis) was highly variable between locations and within each location. The highest variability of soil organic carbon contents was observed at Anjeni with values ranging from 0.2% (soil samples from agricultural fields) to 13.68% (samples collected from the top soil of an old forest belonging to Orthodox Church). The latter value is also the maximum SOC value recorded for the entire sample set used for this study. Such high SOC content results from a combination of high rainfall, favourable temperature, and regular turnover of biomass without forest clearing or biomass removal, based on the Ethiopian Orthodox Church's strict rules and regulations for its own forest management. The church's contribution to and role in forest protection and reforestation have been studied in detail by Wassie Eshete [23].  Figure 2: Spectral preprocessing from raw spectral data to first derivative and second derivatives for Bale samples.

Soil Organic Carbon Distributions in the Study Sites. As shown in
Tesfahunegn et al. [28] reported SOC contents within the range of the results given in Table 2. For the southern parts of the country, Solomon et al. [29], Lemenih et al. [30], Teklay et al. [31], and Ashagrie et al. [32] found similar carbon contents. Yimer et al. [33] and Yimer et al. [34] in the Bale Mountains, while Spaccini et al. [35] and Tulema et al. [36] reported similar results for the central and northeastern highlands. Based on these findings, we conclude that the results presented in Table 2 are adequately well representing the respective areas in terms of the SOC range covered.

3.2.
Spectral Data Analysis Using PCA. As described previously, the spectral readings from all study sites were analysed using PCA in order to detect outliers and patterns in the data; however, this was also done to detect clusters or subgroups for aggregated model building, comparing the sites to each other. Given the high variability listed in Table 1, PCA seems to be a promising and powerful tool for reducing the amount of redundant data and dimension reduction. Therefore, PC1 and PC2 were calculated from the raw spectra of selected study sites and the scores were displayed in a scatterplot ( Figure 3). The cumulative variance explained by the first two PCs was higher than 90% in all cases. Figure 3 displays the score plots for the study sites of Kersa, Kola tembien, and Maybar, indicating that these watersheds are very different; only little similarity exists between the scores, even if the 95% data coverage ellipse might suggest a small overlap. Similarly, Bale, Kola tembien, and Megech study sites were projected into the PC space and show well-defined regions with few overlapping samples. Samples from Maybar were also separated from samples of Wondogenet and Bale Mountain with only few overlapping. Samples from Megech were also separated from samples of Wondogenet and Bale. While the Megech samples cover rather compact region in the plot, the samples from Kola tembien spread less uniformly over much larger area, which might indicate various levels of complexity and variability within the watershed. In Figure 3 samples Maybar covers compact region with five distinct samples far away from the centre. Still Maybar samples were also separated from Wondogenet and Bale samples with only few overlap. Figure 3 also displays samples from Megech together with Bale and Kersa. This figure clearly shows a certain degree of similarity between Bale and Kersa watersheds, indicating the potential for developing a combined model. The discussed results are not very surprising when one considers the highly diverse environmental conditions present in the different catchments, and the uneven spread of samples the in PCA space of Kola tembien and Bale even suggest to further subset models. Similar results and conclusions have been reported by McDowell et al. [17] and Stenberg et al. [16], stating that subset models score higher in terms of precision and accuracy. The downside of course of subset model is their lack of stability and robustness. Thus, using PCA, it was possible to reveal the data structure in the samples relating to their sources of origin, with varying levels of overlap. The method is fast due to complete absence of soil analysis by wet chemistry procedure. Odlare et al. [37] have suggested that field characteristics must be significantly different to be distinguished from each other. Similarly, PCA has been used effectively to separate organic and nonorganic wine products [38], soil colors [18], soluble and less soluble elements [39], and sediment sources [40]. The use of PCA also reported to separate soil samples according to management practices such as tillage [41] and composting [42].

Model Calibration, Validation, and
Testing. Using PLSR, the results show in general that visible/near-infrared spectroscopy performed well for all different agroecological conditions, soil types, and land management practices of Ethiopia. As shown in Table 3, all values of the coefficient of determination (R 2 ) were in a range of very good to good with references to Stenberg and Rossel [43] and Chang et al. [44].
For all sites except Kola tembien and Kersa, the ratio of performance deviation (RPD) was above 2.5 for validationa value that is considered very adequate for developing stable prediction models for soil organic carbon [45]. The value of RPD for the present study ranged between 2.79 (Maybar) and 6.04 (Benishangul) for calibration and between 1.92 (Kola tembien) and 3.39 (Benishangul) for the validation set. The  comparatively poor performance of the calibration model for Kola tembien may be related to the low levels of SOC, which is related to the high level of soil degradation [46] and high carbonate. This reflects the findings by other authors who reported unstable calibration models for areas coined by low SOC and high carbonate levels [16]. Benishangul is covered by a single soil type (Nitisols), which developed on Gneisses and Basalts. The area is further dominated by uniform land use hand, Anjeni, watershed in Gojam, provided very acceptable model results despite the high range of SOC (Table 2), high variability of soil types (Table 1), and land uses [47].
The range of the root mean square error (%) varied between 0.11 (Megech) and 0.53 (Maybar) for the calibration and between 0.15 (Megech) and 0.57 (Maybar) for validation. Rossel et al. [45] reported a range of root mean square error (RMSE) comparable to the present results. Rossel and Behrens [48] achieved maximum precision with 0.75% RMSE and 89% 2 for soil organic carbon, using artificial neural network (ANN) with wavelet coefficients-this is comparable to the present results (Table 3). They reported higher RMSE and lower 2 with multiple linear regressions (MLR), PLSR, multivariate adaptive regression splines (MARS), support vector machines (SVM), random forests (RF), and boosted trees (BT)-when compared to ANN-indicating that the results are fairly reliable. After calibration and validation, the developed models were tested with ∼20% of the samples from each study area, which is presented in Figure 4.

Conclusions and Recommendations
From the results of the study, it is possible to conclude that soil spectroscopy is an effective method for predicting soil organic carbon in the highlands of Ethiopia. There are considerable variations in the performance of the local models, which are mostly linked to the variability of the environmental settings prevailing within the watersheds. Models for the study sites in the moist southern and western parts of the country seem to perform slightly better. In general terms, the relationship between predicted and measured values of soil organic carbon was high for all sites, regardless of the high variability of soil organic carbon within watersheds like Anjeni. The RPD values for validation ranged from 1.9 (Kola tembien) to 3.4 (Benishangul). This indicates fairly stable models and thus the models developed from these data could be used to build local spectral libraries with sufficient predictive power for local land use and management-related applications. Nevertheless, there is a potential for further improvement of the presented local models, which could be achieved by enlarging the range of the sample set and adding more samples from the respective sites, which would also allow for developing robust calibration models. Considering the results from the PCA, there is a potential for research in investigating the performance of aggregated models and other modelling strategies, alongside with the use of nonparametric calibration methods as suggested by many authors. While the present study only focussed on soil organic carbon, soil spectroscopy's ability to predict other chemical, biological, and physical soil parameters in Ethiopia remains to be explored.