Use of Machine Learning Models for Prediction of Organic Carbon and Nitrogen in Soil from Hyperspectral Imagery in Laboratory

,


Introduction
Precision agriculture is the science that develops techniques that facilitate the process of obtaining results in the feld.Tese techniques include technologies such as Remote Sensing (RS) using satellites, in-feld sensors using unmanned aerial vehicles (UAVs), and laboratory sensors [1].Sensors can use hyperspectral imaging (HSI), which provides information through image pixels to identify the materials that make up the soil [2].HSI images capture the portion of the electromagnetic spectrum corresponding to the visible region (400-800 nm) and a portion of the near infrared (NIR) and mid-infrared (MWIR) regions (800-2500 nm).
Radiation absorbed by chemical bonds containing carbon or other nonmetals (C�H, N�H, S�H, C�O, and O�H) is concentrated in the NIR spectral region; therefore, HSI image data corresponding to this region provide information about the chemical composition of the sample [3].When HSI images are acquired, they provide information in the form of a three-dimensional hypercube, usually with a large amount of data and multicollinearity between them [4].However, the processing and extraction of such information is complex and requires the application of algorithms and multivariate transformations that are not widely used in general statistics.Nevertheless, HSI imaging ofers several advantages, such as high speed and ease of data acquisition, and several machine learning algorithms are available to calibrate this technique [5].
Organic carbon (OC) and nitrogen (N) play a key role in plant nutrition, and the levels of these nutrients are synonymous with soil fertility [6]; however, many farmers are unaware of the OC and N content of their agricultural soils and consequently overapply amendments and fertilizers or, on the contrary, do not supply the soil with the nutrients it needs.
Recently, many authors have attempted to develop equations using sensors and machine learning to calibrate OC [7][8][9] and N [10][11][12] in soil.Te information provided by such techniques has enabled researchers to determine and interpret various soil properties at both feld and regional scales; detailed information can be obtained that allows quantitative analyses of soil constituents [1].
Machine learning models are components of a branch of artifcial intelligence and can learn routines on their own.Supervised learning is the process of training a machine learning algorithm on questions and answers to make a prediction.Tese machine learning algorithms can be classifed as classifcation or regression algorithms.One of the most used models is the Random Forest (RF), which is a series of decision trees that act as a set of classifers; it can be used to solve both regression and classifcation problems [13].Each of the decision trees in the RF model is constructed using diferent orders of data.One set of data is used for calibration and another for testing.At the end of the RF analysis, the regression prediction is calculated by averaging the individual trees, and a majority vote for the correct classifcation performs the model ranking.Others are Support Vector Machine (SVM) models, which are based on solving a convex quadratic optimization to obtain a globally optimal solution that overcomes the extreme dilemma of other machine learning techniques; SVM is a nonparametric model and is considered a classifcation model capable of dealing with high-dimensional data [14].Te training and evaluation of multivariate models allows the evaluation of variables with high dimensionality, where the variables have been masked and subjected to diferent transformations; however, these processes must be performed iteratively due to the large number of models that can be generated.
Transformations are mathematical equations or formulas that are applied to spectral data to reduce noise.Transformations can improve assumptions when applying statistical models and allow for easier comparisons among the data being analysed.Examples of transformations include Absorbance, Savitzky-Golay (SG), Detrending, Standard Normal Variance (SNV), and Multiplicative Scatter Correction (MSC).Model ft or performance factors are the mathematical criteria evaluated after the statistical model is applied to determine its acceptability.One of the ft factors for machine learning models is the coefcient of determination or R 2 (equation ( 1)) which relates the sum of error squares to the sum of total squares and indicates the proportion of variance in the response variable that is explained by the predictor variables: Another factor is the root mean square error of prediction or RMSEP (equation ( 2)) and indicates the diference between the predicted and observed values: Te ratio of performance to deviation or RPD (equation ( 3)) is given by the standard deviation (sd) of the observed data over the RMSEP.RPD values greater than 3 are considered excellent in agricultural applications; values greater than 2 indicate good model performance [15].Authors such as Wadoux et al. [16] consider an RPD >2 to be a good model in soil applications.RPD � sd RMSEP . (3)

Materials and Methods
2.1.Study Area.Soil sampling was carried out in the department of Antioquia, Colombia, specifcally in diferent subregions and on farms growing fowers, cacao, and pastures for beef and dairy cattle (Figure 1).Within the department, there are diferent thermal soils classifed as high, medium, and low tropic, resulting in soils with highly variable physical and chemical characteristics.Sample processing and image acquisition were carried out at the Faculty of Agricultural Sciences of the University of Antioquia.1998 soil samples were collected at a depth of 15 cm.Te samples were collected between the years 2020 and 2023.

Chemistry Data.
Each of the samples collected contained two bags of soil.Tis material was mixed and homogenized to ensure sample uniformity.Half of each sample was processed (dried, sieved to 2 mm, and stored) in the laboratory.Drying was performed in a forced air oven at a temperature of 40 °C for 48 h.Te other half of each soil sample was sent to a wet chemistry laboratory where all soil nutrients were analysed by the conventional method.Results were obtained for the soil chemical variables OC and N, which were analysed using the Walkley-Black and Kjeldahl techniques, respectively.Tese analyses were used to calibrate the HSI cameras to the data.

Hyperspectral Image Data Acquisition.
Dry soil samples with a particle size of 2 mm were placed in a 10 and 20 cm 3 tray.Refectance values were corrected using a Zenith Lite TM 50% R SG31XX difuse refectance target.Tis target was placed at the front of the dish so that the cameras captured it at the beginning of the procedure.Two cameras were used to capture the images: a Hyspex Ⓡ Baldur V-1024 N (VNIR) with a spectral resolution of 5.4 nm, a spatial resolution of 3289-1024 pixels, and coverage of the spectral range from 485 to 955 nm, and a Hyspex Baldur S-384 N (SWIR) with a spectral resolution of 5.45 nm, a spatial resolution of 1216-384 pixels, and coverage of the spectral range from 951 to 2517 nm.

Data Preprocessing.
Image preprocessing was performed using the Python 3.8.2programming language [17] and the SpectralPy, Spectral, and NumPy libraries.Te region of interest 2 Journal of Spectroscopy (ROI) was selected by coordinates within the image.A region was selected in the centre of the image where the edges of the dish were not included and where the sample was homogeneous.An average of the pixels of each band included in the ROI was calculated.Pixels with refectance less than 0.10 and greater than 0.90 were masked to eliminate shadows and saturated pixels.Te overlapping bands were determined, and the transition zone corresponding to band 951 was eliminated.Te change between bands 955 and 957 was analysed, and spectra with a change greater than 0.097 were eliminated.

Training and Test of Statistical Models
Te raw data of the spectra were refectance values.Te spectral signature of the soil samples is shown in Figure 2.
Te OC and N variables were transformed by , which is a transformation that has a moderate efect and is weaker than other transformations; it is used to reduce the asymmetry to the right.Te spectral data were transformed into absorbance values; other transformations were then applied, including SNV, MSC, frst derivative of SG, and detrend.Te Mahalanobis distance was applied to the spectral data to detect outliers.No outliers were found, so all data were retained.For the RF model, 500 and 800 trees were used; for the SVM model, radial and linear methods were used.75% of the data was used for training and 25% for testing the statistical models.
Te models were run using the statistical software R-Project [4.2.2].Te randomForest and caret libraries [3] were used to run the RF model, and the e1071 library [18] was used for the SVM model.Te performance of the models was evaluated based on the R 2 , RMSEP, and RPD metrics and the absence of overuse between training and test data.Figure 3 shows the methodology applied to the soil samples and the spectral information.

Characterization of the Variables Used in the Study.
According to the descriptive statistics applied to the data, the mean was found to be 2.92% ± 2.72 and 0.31% ± 0.23 for OC and N, respectively.For the transformed data, the mean and median values are more similar, where the standard deviation of the data is signifcantly reduced.Te square root transformation of the soil variables is expected to signifcantly improve the performance of the statistical models.Te results are shown in Table 1.
Neither variable's data show a normal distribution.Most of the data are on the far left of the histogram.Te variables represent nonsymmetric data (Figure 4).
Te average of the OC variable can refer to soils in warm climates with ideal values or, on the contrary, to soils in cold climates with low values.Tis research included soils belonging to all thermal soils; therefore, the OC values must be analysed according to the area studied to determine whether they are high, medium, or low.Te maximum OC values found are associated with the high tropical zones of the department, since the rate of mineralization of organic matter is inversely proportional to temperature.Te average value of N corresponds to overfertilized soil, since the normal range for this nutrient is 0.1-0.2.Te analysis of the data by subregions showed that the N values are high in some areas of the high tropics of the department.Tis result may be related to the high use of nitrogenous fertilizers in dairy cattle production.

Statistical Models
In total, 96 statistical models were obtained: 48 models by RF and 48 models by SVM for the two soil variables.Te value of 96 was obtained by combining the two types of models and diferent combinations of transformations and methods for the two variables.For the models by RF, 500 and 800 trees were used; however, in the results, the internal validation method "cv" was used in the results, which is a method to verify the efectiveness of a machine learning model.Its function is to select a part of the dataset that is not used to train the model, to be used later as test data.For the SVM models, the linear and radial methods were used.Only the models with the highest performance for each of the soil variables are shown.Table 2 shows the results of the RF and SVM models for the OC variable.In general, high ft values were obtained with all transformations and RF models.In all models, a better ft was obtained when the soil OC variable was transformed.In addition, better performance was obtained for all models and transformations using 800 trees.Te model that showed better performance was the application of the absorbance transformation and ��� OC √ , where an R 2 of 0.87 was obtained for the test data group, the RMSEP was 0.10, which is one of the lowest values obtained in the present study, and the RPD was 6.74, which was the highest value for the models studied.In addition, the model did not show overftting as R 2 of the test data was the same as that of the training data.Although the coefcients of determination are lower than those of the SVM models, excellent fts were obtained for RMSEP and RPD.Te RF model performed better than the SVM model for the OC variable.None of the models showed overftting for the validation data.Te best performing RF model was the one that used the transformation of the frst derivative of the SG of the spectral data and the transformation for ��� OC √ .Based on a literature review, Vargas et al. [19] concluded that the RF and SVM algorithms are useful for determining the OC in soil.Tese algorithms have also been studied by other authors.Pouladi et al. [20] used RF models to determine the prediction of soil organic matter, which can be directly related to the OC content through a conversion factor.Tey found an R 2 of 0.89 and an RMSEP of 4.20.Teir relatively large error may be because the study was conducted with relatively few samples.Te RMSEP values found by these authors are much higher than those found in the present investigation.Yang et al. [21] have also conducted studies using RF models to determine the OC in harsh climates, where the maximum ft of the model was 0.71 and the RMSEP was 0.48, which are still close to the ft obtained in the present work.Hong et al. [5], who used HSI images in conjunction with RF models to determine OC in soil, obtained an R 2 value of 0.79 and an RMSEP of 0.18, like the values obtained in the present investigation.Te research carried out by Nawar and Mouazen [22] shows that RF models are an excellent method for calculating OC and N in soil; these authors found fts as high as 0.97 using crossvalidation of the algorithm, an RPD of 5.58, and an RMSEP of 0.01; these values were found using a set of 528 data points distributed over several European countries.
Table 3 shows the results obtained for the N variables when the RF and SVM models were applied.Although the RF models had a lower R 2 , the RPD obtained was the best among all models and the RMSEP was the lowest among the models.Terefore, the model that showed that the highest performance and its ftting factors are excellent when used with the frst derivative SG transformation.Te results obtained for this model were an R 2 of 0.79 for the training and test data, an RMSEP of 0.03, and an RPD of 5.44.
For the soil variable N, a better performance of the SVM model was obtained using a combination of the frst derivative SG transformation and �� N √ with the radial method.Te SVM algorithm gave the best results for the determination of OC and N when combined with diferent transformations.For this model, Datta et al. [23] obtained a good ft when using the bands with the highest correlation in the spectrum for the OC variable, obtaining an R 2 of 0.90, which is like the R 2 value obtained in the present study.However, Aldana et al. [24] ftted the SVM model and obtained an R 2 of 0.95 and an RMSEP of 0.21 for OC, which confrms our results for the same variable.Meng et al. [25] also applied SVM models and obtained an R 2 of 0.80, an RMSEP of 3.20, and an RPD of 1.71.Although their coefcient of determination is like that found in the present study, the other ft values difer signifcantly from those in the present work, possibly due to the diference in the number of samples between the studies.Authors such as Vargas et al. [19], through a systematic review, concluded that SVM models are the most suitable machine learning algorithms to determine variables such as organic matter and N in soils because they achieve better performance than other multivariate models.
Figure 5 shows all predicted and ftted data obtained using the RF algorithm for the two variables of interest.Tese plots correspond to the models with the spectral data transformed using the frst derivative SG transformation and the square root of the soil variables.Tat is, it refers to the best RF model observed for each variable.

Correlation of Spectral Bands.
After applying the model, we performed a correlation analysis between the spectral bands and the OC and N variables (Table 4).Te correlation analysis was applied to the transformed and untransformed databases, and the bands that gave a better result, with correlations above 0.60 and −0.60, were selected.A small number of bands were found to correlate with nutrients.Te detrend transformation resulted in a greater number of band ranges for OC and N.For OC and N, a strong correlation was observed between the band ranges from 500 to 900 nm, which includes portions of the visible and NIR regions, and from 1300 to 1950 nm, which is in the NIR region.
Te correlation between the spectral bands and the OC content in the soil is related to the presence of carbon and other elements.In their study carried out to determine the OC in soil using hyperspectral images, Aichi et al. [26] found a high correlation of OC with the range of bands between 400 and 680 nm.In addition, they correlated the concave spectral signature of the soil with a high OC content between the bands at 400 and 950 nm, which was corroborated by the present study because the set of all spectral signatures of the soil resulted in this behaviour.Meng et al. [25] studied the behaviour of soil OC and found that the bands most sensitive to the presence of carbon are in the visible region of the spectrum, which confrms the results of the present investigation, where most of the correlated bands were also found in the visible region.Te presence of OC in the visible region of the spectrum can also lead to strong correlations because of the relationship between the color of the soil (dark) and its presence in large amounts [20,27].Several authors have found a signifcant relationship between wavelength and the OC of soil.Strong correlations were found in the visible region: in the bands from 550 to 700 nm [28] and between the bands at 526 and 587 nm [29].Tese fndings support our results, as we found medium and high correlations in the refectance data in the 566-852 nm spectral range.A high correlation was also observed between OC and refectance produced near 490 nm [30].Tis band showed a high correlation in our research; however, it was detected when the detrend transformation was applied to the spectral data.
Regarding the bands of the spectrum correlated with the variable N, authors such as Patel et al. [31] observed strong absorption peaks near 1400, 1900, 2200, and 2350 nm.In the present study, correlated bands were found between 1412 and 1420 nm, in addition to some bands near 1900 nm.Also, Tahmasbian et al. [32] also found bands highly correlated with the N content, such as the bands between 400 and 900 nm.Tese bands were also found to be correlated in our research.

Conclusions
Te results of this study show that the RF and SVM machine learning models can be useful for predicting soil OC and N variables.Te SVM model behaves better than the RF model, as indicated by the better R 2 , RMSEP, and RPD values of the ft to the SVM model.Using the spectral band transformations in this case, the absorbance and the frst derivative of SG in combination with the machine learning models can result in a better ft and more accurate prediction of the OC and N data.Few spectral bands with high correlation under the study variables were observed; however, we found certain bands where the correlation is high.Tese band ranges should allow researchers to work with specifc areas of the spectrum in relation to diferent soil nutrients.Te use of HSI can help reduce the use of conventional techniques, which currently have numerous drawbacks.

Figure 2 :Figure 1 :
Figure 2: Spectral signature of the collected soil samples at refectance values.

Figure 3 :
Figure 3: Sequential chart of the methodology used in this research.

Figure 4 :
Figure 4: Histogram of the behaviour of the data of the variables OC and total N.

Table 1 :
Descriptive statistics of the investigated soil variables.

Table 2 :
Results of adjustment factors for the SVM and RF models for the soil OC variable.

Table 3 :
Results of adjustment factors for the SVM and RF models for the soil total N variable.