Modeling Soil Temperature for Different Days Using Novel Quadruplet Loss-Guided LSTM

Soil temperature (Ts), a key variable in geosciences study, has generated growing interest among researchers. There are many factors affecting the spatiotemporal variation of Ts, which poses immense challenges for the Ts estimation. To enrich processing information on loss function and achieve better performance in estimation, the paper designed a new long short-term memory model using quadruplet loss function as an intelligence tool for data processing (QL-LSTM). The model in this paper combined the traditional squared-error loss function with distance metric learning between the sample features. It can zoom analyze the samples accurately to optimize the estimation accuracy. We applied the meteorological data from Laegern and Fluehli stations at 5, 10, and 15 cm depth on the 1st, 5th, and 15th day separately to verify the performance of the proposed soil temperature estimation model. Meanwhile, this paper inputs the variables into the proposed model including radiation, air temperature, vapor pressure deficit, wind speed, air pressure, and past Ts data. The performance of the model was tested by several error evaluation indices, including root mean square error (RMSE), mean absolute error (MAE), Nash-Sutcliffe model efficiency coefficient (NS), Willmott Index of Agreement (WI), and Legates and McCabe index (LMI). As the test results at different soil depths show, our model generally outperformed the four existing advanced estimation models, namely, backpropagation neural networks, extreme learning machines, support vector regression, and LSTM. Furthermore, as experiments show, the proposed model achieved the best performance at the 15 cm depth of soil on the 1st day at Laegern station, which achieved higher WI (0.998), NS (0.995), and LMI (0.938) values, and got lower RMSE (0.312) and MAE (0.239) values. Consequently, the QL-LSTM model is recommended to estimate daily Ts profiles estimation on the 1st, 5th, and 15th days.


Introduction
Soil temperature (T s ) is a main physical variable of the land surface, which has a direct influence on the atmosphere [1]. Relevant fields including geoscience and forestry application aspects have drawn attention from researchers [2,3]. In principle, all the interactions in terrestrial ecosystems are companied by T s variations since they involve energy exchanges. T s is an essential factor for growing crops that can facilitate the development of the root system by impacting microbial activity, soil decomposition, and fluidity of soil water [4]. In addition, the death of animals and plants produces plenty of carbon substrates and a high volume of greenhouse gases in the soil. Consequently, it results in an increase in T s , thus expediting carbon dioxide emission to the atmosphere [5]. erefore, accurate T s monitoring is crucial for agricultural management and atmosphere environment forecast. However, T s data in most areas is still measured by using traditional sensors, and the T s data cannot be collected at different depths [6]. erefore, it can be used to solve some problems in different fields for the study of T s estimation. e essential environmental factors have a great influence on the accuracy of T s estimation. At present, T s is mainly predicted by methods based on physical models and data-driven methods.
e physical method is based on the heat conduction model to estimate T s [7]. Meanwhile, the method is greatly affected by related physical parameters and the scale problem [8]. e data-driven methods can explore the internal relationship between T s and the surrounding environmental factors for T s estimation. At present, several predictive models based on machine learning methods are used for estimating T s [9][10][11][12][13][14]. For example, ANN is composed of a complex network structure that imitates the structure and function of the brain's neural network, and it has powerful data processing capabilities. Bilgili applied the multilayer perceptron (MLP) model to adequately describe T s distribution at a monthly temporal scale from meteorological data [15]. Kisi et al. used three machine learning models to estimate monthly T s at the soil depth of 5 cm and 10 cm, respectively, and verified the predictive performance of radial basis neural networks performed better than that of generalized regression neural networks and multilayer perceptron models [9]. But generalized regression neural networks had the better performance for deeper depth (50 cm and 100 cm). Kisi et al. applied ANN-based models to predict long-term T s at a monthly temporal scale, and they found that genetic programming generated the best performance with the meteorological data [16]. Zeynoddin et al. applied a multilayer perceptron (MLP) model to describe daily T s distribution at three soil depths (5,10, and 20 cm) from past measurements of T s [17]. Samadianfard et al. processed the meteorological data such as Ta, W, RH, Rs, Sunshine hours (Sh), and air pressure (Ap) and integrated ANN-based models separately to predict T s at a monthly temporal scale [18]. Mehdizadeh et al. noted that machine learning models combined with time series models performed better performance than the predictive models based on the single machine learning method or single time series method for predicting T s at a daily temporal scale [19]. Moazenzadeh et al. proposed SVR with krill herd algorithm (SVR-KHA) method in modeling T s estimation at different depths (5,10,20,30,50, and 100 cm), which achieved the best performance, compared to SVR and SVR with firefly algorithm (SVR-FA) [11]. Delbari et al. proposed an SVRbased model to compute daily T s at three depths (5,30, and 100 cm) in Iran [12]. e ELM network featured by a single hidden layer could improve the learning speed and accuracy of the algorithm and can model the accurate T s . Nahvi et al. used the improved ELM model on daily T s estimation based on the self-adaptive evolutionary method and verified the improved predictive model can estimate the adequate T s [20]. Sanikhani et al. tested the data from the Mersin station, and the results that show the performance of ELM has the best predictive performance than other predictive models. [14]. Feng et al. tested the loess plateau data with ELM and random forests (RF) and ANN-based models showed that ELM had the better performance for estimation T s of halfhourly at different soil depths [13].
As a time loop neural network containing complex neural network modules, LSTM is used in this paper to solve long-term dependence problems, which can effectively alleviate gradient vanishing through the extraction of required features by the gate control unit. LSTM network [21] can learn long-term and short-term behaviors, and it has seen application in vast areas. By integrating LSTM and SVR, Guo et al. significantly improved the prediction accuracy of abnormal passenger flow fluctuations [22]. In hydrology, Zhang et al. designed a novel LSTM model with the dropout scheme to estimate the depth of the water table [23]. In the atmosphere field, Qing et al. estimated the solar irradiance based on the LSTM network [24]; the results showed the method could avoid the overfitting of the model. Li et al. designed a new GANs-LSTM model and noted that it could serve as an alternative method to estimate T s [25].
is article focuses on the following issues. First, we select the environmental factors which will affect T s estimation. T s memory can help the predictive model "remember" a warm or cold condition when the anomaly is forgotten by the atmosphere forcing. In addition, recent literature reviews have revealed that the input for prediction models is either the past measurements of T s or other meteorological information (Ta, W, RH, Rs, Sh, and Ap). Assume that the prediction models are constructed using input combinations of past T s and other meteorological information; how does the prediction model performance?
e second question is about the construction of loss function in LSTM. e predictive model for T s estimation is a regression predictive modeling problem that involves predicting a real-valued quantity. e loss function is crucial for optimizing the predictive model which could express the degree of difference between predicted and observed T s ; meanwhile, it can optimize the predictive model by updating the weights. Recently, most previous studies in loss function of regression predictive model mainly focused on the distance metric learning between predicted values and real values [26][27][28][29]. However, the distance metric learning between the sample features (environmental factors) is usually ignored which has already been successfully applied to image processing [30][31][32]. To enrich information processing in loss function and further improve the estimation performance, how can we construct a novel loss function by combinations of distance metric learning? e last question is about timescale evaluating for T s estimation. In previous studies, any evaluation at short-term T s estimation (half-hourly, hourly, daily timescales) does not consider the timeliness of long-term T s estimation. However, any evaluation at longterm T s estimation (monthly timescale) does not include the information of T s in a small timescale. An ideal decisionsupport tool for T s estimation should provide a multifarious decision-making basis. How can we design a prediction scheme at the same timescale evaluation that provides not only the short-term decision-making basis but also the longterm decision-making basis? is paper proposed a novel quadruplet loss function based on the LSTM network that combines traditional squared-error loss function with distance metric learning between the sample features, called QL-LSTM. e traditional squared-error loss function is usually applied to the predictive task with great accuracy. e current limitation of this loss function, however, involves the special variation on Ts based on different predictors. As shown in Figure 1, we have made labels according to T s values. e T s data which are in the same range are made the same label (the T s data which are in the range of 8-12°C are labeled as "1", the T s data which are in the range of 12-16°C are labeled as "2", and the T s data which are in the range of 16-22°C are labeled as "3"). Meanwhile, Ta data are labeled as the same as T s data. In Figure 1(a), we noticed that the T s data with the same label are almost within a stable range. However, in the red ellipse, Figure 1(b), we observed that similar Ta values may have different labels (T s data with similar Ta values may vary considerably). e data with this feature will prevent predictive models from accurately exploring the internal relationship between T s and the surrounding environmental factors discovering. To address this problem, the idea of triplet loss [33] is considered in this paper. Triplet loss optimization allows the anchor and positive points to accumulate and therefore prevent the negative points and realize the similarity calculation of samples. is approach can enrich processing information of loss function and overcome the disadvantage of the traditional squarederror loss function and further improve the estimation performance.
e main three contributions of this research paper are summarized as follows: (1) As we know, the proposed method that combined traditional squared-error loss function with distance metric learning between the sample features is a new approach to be used for T s estimation. (2) Daily-scale prediction scheme was designed to provide the multifarious decision-making basis and was used to estimate the T s on the next 1st, 5th, and 15th day. To achieve this end, we input the meteorological and past T s data to the estimation model. (3) Results showed that our QL-LSTM method outperformed the existing advanced methods in most cases.

e Framework of Soil Temperature
Estimation. e corresponding meteorological data T s as the input of our QL-LSTM model are obtained from FLUXNET at first. In the meantime, several other advanced models based on data-driven technology (SVR, BPNN, ELM, and LSTM) were considered in T s estimation. Traditional squarederror loss function and distance metric learning between the sample features were integrated into our model for accurate exploration of the internal relationship between T s and the surrounding environmental factors. Finally, the comparison of model performance is reflected by five evaluating indicators (RMSE, MAE, NS, WI, and LMI). Figure 2 denoted the flow chart of soil temperature estimation.

Long Short-Term Memory (LSTM)
Network. LSTM can process and learn long-term dependence problems. Due to the characteristics of the LSTM network, we use it to explore the internal relationship between T s and the surrounding predictors. LSTM controls the transmission state through the gating state, remembers what needs to be remembered, and forgets unimportant information. Figure 3 shows the internal structure of an LSTM cell, and the calculation formula of the LSTM is as follows: Computational Intelligence and Neuroscience , where x(t) is the input data, and y(t) is the output data; i(t), f(t), and o(t) denote the input gate, forget gate, and output gate; c(t) represents the unit status at the current moment; h(t) is the current output value; σ(·) and tanh(·) are the activation functions; W and b denote the weight matrix and bias term.

Triplet Loss.
Triplet loss is a significant "learning criterion" for optimizing the predictive models, which is applied for adjusting the weight parameters of predictive models, including anchor (Anchor) example, positive (Positive) example, and negative (Negative) example. e similarity calculation of the samples is realized through triplet loss learning, which makes the anchor-to-positive distance smaller than the anchor-to-negative distance. And Figure 4 denoted the visual representation of triplet loss. Equation (2) expresses the objective function of triplet loss as follows: Past T s Input Figure 2: e flow chart of soil temperature estimation. Figure 3: e internal structure of an LSTM cell. 4 Computational Intelligence and Neuroscience are the corresponding feature expression obtained by training a parameter in the triplet; α represents the minimum interval between the anchor-to-positive distance and the anchor-to-negative distance; the value of [·] + defines the degree of loss.

QL-LSTM Model.
Previous analysis shows that LSTM with traditional squared-error loss function could not accurately discover the special relationship between T s and surrounding predictors. To address this problem, inspired by the study of triplet loss, we combined a predictive model with distance metric learning between the sample features. As far as we know, the method based on distance metric learning between the sample features has not been used to estimate T s ever. It must be noted that the distance metric learning between the sample features is first proposed in the field of image processing. However, there is no description of the similarity of samples for T s estimation. In this paper, the clustering method is used to label samples; thus, distance metric learning between the sample features could be further applied in T s estimation. e framework of our QL-LSTM is shown in Figure 5. Firstly, for its ability to cluster data efficiently and scalability, the T s data were quantized by the clustering method (called Birch) [34]. In the quantization step, any T s data quantized to the same label will be defined as similar samples (positive). In contrast, any T s data quantized to different labels will be defined as the dissimilar samples (negative). It is worthwhile to observe that the number of labels should be neither too large nor too small [35]. Hence, the Calinski Harabasz Score (CH) and Y_Silhouette_score (S) are used to evaluate the quality of the cluster [36]. e larger value of CH or S, the better quality of the clustering results. Second, the labeled data are input into the predictive model (LSTM network). Finally, the weights of the predictive model are updated to reduce the loss based on our quadruplet loss function. We as the input data, where l i represents T s labeled as "i" and x i represents the labeled environmental factors. Assume C is the total number of labels, en, we project an instance x i onto the estimate T s by f LSTM (.; θ): be the environmental factors in the i -th labeled samples. N c represents the total number of samples. We evaluate the similarity between samples through cluster analysis and expect the output of the model closer to the true value.

Hard Sample Mining.
Hard sample mining generally refers to hard negative mining. Adding negative sample sets to participate in model training can improve the effectiveness of learning and training and mine hard negatives as much as possible [37,38]. For each fixed picture, the farthest sample picture and the nearest negative sample picture in a training batch are applied to train the network to enhance the generalization ability of the network, so that the network can learn better representations.
Inspired by TriHard loss, we first define x c i as the test sample: is the quadruplet data set we defined. P * c,i is the positive set; N * c,i represents the negative set, |P * c,i | and |N * c,i | represent positive and negative sample pairs, and these tuples form the training sample pairs. e query sample is represented as x c i ; when S + ij satisfies the formula (3), x i , x j is the pair that we need.
represents the similarity between two samples, where 〈·, ·〉 represents the calculation of an n × n similarity matrix. S ij is the element in S at (x i , x j ), and μ as a hyperparameter impacts the quadruplet that can control the number of hard positive samples. e condition for selecting a hard and negative pair is the same as above:

Optimization Objective.
For each test sample x c i , we use the margin m to make it as close to the positive set P c,i as possible, and as far away from the negative set N c,i as possible. All the nontrivial positive points in P c,i are pulled together by minimizing: Similarly, all nontrivial negative points in N c,i need to push out of the boundary τ, by minimizing: Meanwhile, we applied the squared-error loss function to the LSTM model for T s estimation, as follows: In the QL-LSTM, three minimization objectives were put into the model, and they are optimized at the same time:

Computational Intelligence and Neuroscience
We incorporate stochastic gradient descent and minibatch into the QL-LSTM to optimize the estimation model.
x c i is a sample of the minibatch, which is obtained by sampling the labels of the training samples randomly, and serves as an anchor. We represent the QL-LSTM of each minibatch as where N denotes the batch size. Figure 6 represented the learning procedure of our QL-LSTM model.
where N is the number of the whole data, y i denotes the observed value, y i is the predicted value, and y is the average of the true values.

Experiments.
e data within half an hour is obtained from two meteorological stations in an ecological nature reserve, located in Switzerland, namely, Legern and Fluley. e corresponding meteorological data (T a , W, A p , R s , VPD, and T s ) and past T s data were input into the models. Meanwhile, the input variables are normalized to eliminate the dimensional influence between indicators. And the formula is as follows: where the minimum value of the sample data is represented by x min , and the maximum value is represented by x max . Moreover, we have conducted research on the influence of the surrounding environmental factors on the model prediction. And we found the value of R s in the data of the two stations is low, which is close to the normal distribution.
We conducted a statistical analysis of the data from the two stations. Table 1 listed the details of variables (minimum value (x min ), maximum value (x max ), average value (x mean ), standard deviation (z sd ), skewness (z s ), and variation coefficient (z v )). We used the daily data to verify the performance of the model with every half an hour data. e results showed in Table 1 that A p had the highest negative skewness and presents a normal distribution at 5 cm depth, which presented similar characteristics in both stations. Meanwhile, z v showed the biggest difference between the two stations. T s at the 5 cm, 10 cm, and 15 cm depths range − 1.888-26.876°C, − 0.181-22.193°C, and 0.16-20.826°C, respectively. In summary, results showed that the values of z sd , z s , and z v change very slightly.

Results and Discussion
For testing the superiority of our QL-LSTM model performance for T s estimation using scikit-learn, we compared our test results with those of other advanced models (SVR, BPNN, ELM, and LSTM).
We choose default parameters for the SVR model. For the BPNN model, the square error is used as the loss function, and the optimization is Adam. e number of samples selected for the model is 400, the iteration is set to 500, the learning rate is set to 5.0e-4, and the size of the nodes is set to 128. e elm function was used to model the ELM model, the sigmoid served to activate the function in the hidden layer, and we set the same size of the nodes to BPNN. Furthermore, we set the hyperparameters of the LSTM to be the same as that of QL-LSTM. As can be seen from Table 2 and 3, the different values of the hyperparameter can generate the different predictive results. When the number of samples selected for the model is set to 400, the iteration to 500, the num QL− LSTM to 128, and set the learning rate to 1.0e-3, the QL-LSTM model has the best performance.

Evaluation for the Hyperparameters in Quadruplet Loss
Function.
e quadruplet loss function has five main hyperparameters, which are the total number of labels C, hyperparameter μ in equations (3) and (4), and τ and m in equations (5) and (6). When we evaluate the above hyperparameters in the quadruplet loss function, we set the parameters num QL− LSTM to 128, the learning rate to 1.0e-3, the iteration time to 500, and the batch size to 400. We first select the best C based on the Calinski Harabasz Score and Y_Silhouette_score. Figure 7 denotes the Calinski Harabasz Score and Y_Silhouette_score with different numbers of labels. It is observed that both scores achieve the best result when C is 25. en, we gradually tune the hyperparameters, τ and m. Figure 8 represented the results of the estimation model with different μ, τ, and m in Laegern meteorological station. We can see that when we set μ to be 5.0e-3, τ to be 1.0e-3, and m to be 5.0e-5, and our QL-LSTM model could achieve the best estimation performance (RMSE � 0.789, MAE � 0.605, NS � 0.977, WI � 0.994, and LMI � 0.865). It is probably because the smaller hyperparameters we set, the less hard samples would be computed. Meanwhile, when we set the larger hyperparameters, the more redundant samples would be computed.

e Impact of Different Inputs on the Performance of the Predictive Model.
In this part, we analyzed the environmental factors that may affect our QL-LSTM model for T s estimation. Considering that the interaction between different environmental factors would have an impact on the T s estimation, we combine the meteorological variables accordingly and input them into the submodels we set as follows: Computational Intelligence and Neuroscience en, we consider that the past T s will continue to have an impact on the future T s estimation, so we have carried out lag processing for the past T s on different days, as follows:    m=1.0e-5, τ=1.0e-3, μ=0 m=5.0e-5, τ=1.0e-3, μ=5.0e-2 m=5.0e-5, τ=1.0e-3, μ=5.0e-4 m=1.0e-5, τ=1.0e-3, μ=5.0e-3 m=5.0e-5, τ=1.0e-3, μ=5.0e-3 m=1.0e-4, τ=1.0e-3, μ=5.0e-3 m=5.0e-5, τ=5.0e-3, μ=5.0e-3 m=5.0e-5, τ=5.0e-4, μ=5.0e-3 We input what we specified above into QL-LSTM to predict the T s (d) at the 5 cm depth of the Laegern station. For our model, we first selected the hyperparameters μ as 5.0e-3, τ as 1.0e-3, m as 5.0e-5, C as 25, num QL− LSTM as 128, learning rate as 1.0e-3, iteration time as 500, and batch size as 400, and the results are presented in Table 4. Obviously, the methods of QL-LSTM(I 3 ) and QL-LSTM(I 8 ) are better than the others, respectively. Meanwhile, we could conclude that − 4), and T s (d − 5) all have an influence on the performance of the predictive model. In addition, by comparing the estimation results between meteorological variables input and past T s input, we found that our model with past T s could achieve greater accuracy in modeling than the one with meteorological variables. e reason may be that the predictive model with past T s input has stronger memory for T s variables. e estimation of the future T s should make the best use of its continuity; in this way, we can make a reliable T s estimation, which not only continues its historical tendency but also conforms to its actual performance. Hence, we construct the predictive model (QL-LSTM(I 11 )) by combining the environmental 3)), which is also considered in estimating the T s (d) at the 5 cm depth of the Laegern station. Experiment results prove that it could achieve the best estimation performance (RMSE � 0.789, LMI � 0.865, WI � 0.994, NS � 0.977, and MAE � 0.605). Hence, the final input for the predictive models is the environmental factors (T a (d − 1), R s (d − 1), VPD(d − 1)) and the past T s (T s (d − 1), e three methods (QL-LSTM(I 3 ), QL-LSTM(I 8 ), and QL-LSTM(I 11 )) are used to test the data of the Laegern station. Figure 9 shows the linear relationship between the predicted value and the observed value. e QL-LSTM(I 11 ) model gets the best predictive performance with y � 0.9899x + 0.3022 and the higher R 2 (0.9792) compared with the others. In the frequency diagram ( Figure 10) of the models (QL-LSTM(I 3 ), QL-LSTM(I 8 ), and QL-LSTM(I 11 )), the QL-LSTM(I 11 ) also has a higher frequency (91%) compared to the others. erefore, we can draw a conclusion that the predictive model (a combination of the environmental factors and the past T s ) normally outperformed the other two (by either past measurements of T s or other meteorological information) in the T s estimation.

Comparison with Different Models.
In this part, our QL-LSTM model was compared with several advanced models, including SVR, BPNN, ELM, and LSTM. e data of T a , R s , and VPD on day "d − 1", and T s data on different days were acted as input data to different predictive models, and the output was the predicted value of T s on days "d", "d + 5", and "d + 15". Time steps were in days. e testing results of five different models at 5, 10, and 15 cm depth on the 1st, 5th, and 15th days of the Laegern station were shown in Table 5. And we can see that our QL-LSTM model performs better than the existing advanced models at the 5 cm depth on the 1st day. Specifically, the value of RMSE is 0.789, which is reduced relative to 13% (LSTM), 22% (ELM), 28% (BPNN), and 22% (SVR), respectively. e MAE values amount to 0.605 (QL-LSTM), and the others are 0.813 (SVR), 0.872 (BPNN), 0.824 (ELM), and 0.821 (LSTM). Meanwhile, the QL-LSTM model achieved a higher value of NS, WI, and LMI. Hence, it is obvious that our model had the best performance in this case. For the results of 5 cm depth on the 15th day, the LSTM achieved a higher WI (0.892) than estimation from other models on the 15th day, but it is similar to the WI (0.891) of our model. For 10 and 15 cm depth results on the 1st, 5 th , and 15th days, the performance of our QL-LSTM model remains stable, although our QL-LSTM model has the lower values of WI (0.952 and 0.933) than the values of WI (0.954 and 0.934) on the LSTM model in individual cases. It can be found that the predictive performance will get better as the soil depth decreases (from 5 cm to 15 cm), but it will decrease as time goes on (from 1st to 15th days). e systematic errors caused this phenomenon for long-term estimation [39]. e same strategy was applied in the Fluehli station to further verify the performance of the models, with the results shown in Table 6

12
Computational Intelligence and Neuroscience    perform well in some cases probably because the weights of the LSTM model are randomly selected to generate the nonoptimal solution. Meanwhile, our novel loss function (quadruplet loss) is applied based on the LSTM model; it only improved estimation performance to a certain extent against the LSTM model. All in all, the results of our testing on the data of different regions show that the performance of our QL-LSTM model is usually better for T s prediction with different depths and days.

Conclusions
Soil temperature (T s ) is a main physical variable of the land surface, which has an impact on many aspects, such as the growth and yield of crops. erefore, how to predict T s accurately is very important. is paper proposed the QL-LSTM model and compared it with the state-of-the-art predictive models to use the meteorological data and past T s of the Laegern and Fluehli stations (Switzerland) for daily T s estimation at 5, 10, and 15 cm depth on the 1st, 5th, and 15th days.
e experiment results showed that the QL-LSTM model performed better than the existing advanced models for T s estimation in multifarious cases.
In addition, to enrich processing information in loss function and further improve estimation performance, we attempt to design the novel quadruplet loss function that combines the traditional squared-error loss function with distance metric learning between the sample features. Similar samples can be zoomed and the dissimilar samples can be pushed. e distance metric learning between the sample features is combined with the squared-error loss function, which could improve the estimation performance to a certain extent. However, the many hyperparameters in our method may cause sensitivity issues in estimation, which may lead to poor generalization ability of other estimations. In the future, the parametric adaptive method will be explored for a new loss function in the follow-up study.
Data Availability e experimental data are available without any restriction.

Conflicts of Interest
e authors declare that there are no conflicts of interest.