A Novel Methodology for Credit Spread Prediction: Depth-Gated Recurrent Neural Network with Self-Attention Mechanism

is paper develops a depth-gated recurrent neural network (DGRNN) with self-attention mechanism (SAM) based on longshort-term memory (LSTM)\gated recurrent unit (GRU) \Just Another NETwork (JANET) neural network to improve the accuracy of credit spread prediction. e empirical results of the U.S. bond market indicate that the DGRNN model is more eective than traditional machine learning methods. Besides, we discovered that the Depth-JANET model with one gated unit performs better than Depth-GRU and Depth-LSTMmodels with more gated units. Furthermore, comparative analyses reveal that SAM signicantly improves DGRNN’s prediction performance. e results show that Depth-JANET neural network with SAM outperforms most other methods in credit spread prediction.


Introduction
Credit spread is the risk premium demanded by investors of credit bonds over the yield of risk-free bonds of the same maturity, which is the basis of credit bond pricing and risk management. By grasping the future trend of credit spread, stakeholders involved in the bond market can make decisions more scienti cally. For instance, investors can improve the accuracy of transactions. Financiers can choose time scienti cally, and regulators can properly prevent and control nancial risks. Besides, credit spread can be utilized to monitor macroeconomics and warn governments. However, because the bond market is usually regarded as a complex system [1], existing techniques cannot perform well in predicting the credit spread accurately. erefore, it is necessary to theoretically and empirically discuss how to improve the credit spread's prediction accuracy.
As a representative technology of arti cial intelligence, deep learning methods have developed rapidly in recent years [2][3][4]. Deep neural networks have become the most advanced forecasting method in nance due to its outstanding performance in time-series prediction [5,6]. ey have been widely used to predict indicators, such as stock prices, exchange rates, gold prices, and housing prices [5,[7][8][9].
Many studies show that deep neural networks can effectively t complex nonlinear relationships between input variables with a higher tting degree, reducing the overtting of shallow foundations and local extremum problems. Besides, deep neural networks have no restrictions on the form of input variables. erefore, all relevant information can be included. Particularly, deep neural networks can perform generalized learning based on data characteristics, weakening irrelevant information while learning heterogeneous information.
Existing literature on the prediction of the credit spread is mainly based on linear models [6,10]. Although the deep learning methods can help improve the accuracy of credit spread prediction, it is not clear which algorithm has the best prediction performance according to the "No Free Lunch eorem" proposed by Wolpert. us, it is also worthy of indepth investigation of the performance of deep learning algorithms in credit spread prediction [11].
is paper aims to construct a depth-gated recurrent neural network with self-attention mechanism (SAM-DGRNN) to predict credit spreads in the U.S. corporate bond market. e main contributions are as follows: (1) to apply the XGBoost algorithm to integrate the selected credit spread determinants and extract the feature variables with the highest importance of prediction. (2) To construct depth-gated recurrent neural networks based on LSTM/ GRU/JANET and compare them with three traditional nonlinear machine learning models (e.g., support vector regression, multilayer perception, and the random forest approaches) and a linear model (e.g., the vector autoregressive model). e comparative analysis of their prediction effects supports the superiority of deep learning methods in predicting credit spreads. (3) To construct a depth-gated recurrent neural network with self-attention mechanism (SAM-DGRNN) to explore the effectiveness of SAM in credit spread prediction. e remainder of this paper is organized as follows. Section 2 provides a brief background. Section 3 discusses literature review related to the deep learning field. Section 4 introduces the theoretical methods and methodology for constructing the model of predicting the credit spread. Section 5 presents the experimental results. Section 6 is the conclusion section of this paper.

Long-Short-Term Memory (LSTM) Neural Network.
LSTM neural network has three gated units: input gate, forget gate, and output gate. e gated units allow information to affect recurrent neural networks at each moment selectively. Each gate outputs a value between 0 and 1. e value refers to how much information can be passed (0 means "no information can pass and one means "all information is allowed to pass"). e forget gate controls what information is discarded or saved from the cell state, the input gate controls how much new information is added to the cell state, and the output gate controls which part of the cell state will be output. e schematic diagram of LSTM neural network structure is shown in Figure 1. e update rules are shown in equations (1) to (6).
First, the forget gate discards useless historical information: Second, the input gate updates the state with input data and historical information: ird, the output gate outputs current information:

Gated Recurrent Unit (GRU) Neural
Network. GRU neural network consists of two gated units. e update gate is used to control the degree to which previous state information is brought into the current state. e smaller its value is, the less information it brings and the smaller the impaction the current hidden layer is. e reset gate is used to control the degree of state information that is ignored at the previous moment. e larger its value is, the less information is overlooked. GRU neural network synthesizes the input gate and the forget gate in LSTM neural network into a single update gate and combines the cell state and the hidden state. ese features not only maintain the advantages of LSTM in solving long-term dependency problems but also lead to a more straightforward structure, with fewer parameters and higher training efficiency. e schematic diagram of GRU neural network structure is shown in Figure 2. Update rules are shown in equations (7) to (10).
First, the reset gate determines the degree of the alternative state h t depending on the previous state h t−1 : Second, the update gate determines the weights of historical information inheriting from the previous state h t −1 and new information the current alternative state accepts: where x t is the input vector at time t, h t−1 is the output vector at time t − 1, b r , b h , b z are the bias vectors, W rx , W rh , W hx , W hh , W zx , W zh are the weight matrixes, σ is the logistic sigmoid function, tanh is the hyperbolic tangent activation function, and r t , z t are the output state of the reset gate and update gate at time t, respectively.

Just Another NETwork (JANET) Neural Network.
JANET neural network, with dramatically less training time, performs better on multiple benchmark data sets than LSTM neural network. e schematic diagram of JANET neural network structure is shown in Figure 3. e update rules are shown in equations (11) to (14).
where x t is the input vector at time t, h t−1 is the output vector at time t − 1, W fx , W fh , W cx , W ch are the weight matrixes, b f , b c are the bias vectors, σ is the logistic sigmoid function, tanh is the hyperbolic tangent activation function, f t is the output state of the forget gate at time t, and c t is the state of memory unit at time t.

Self-Attention Mechanism (SAM
where Q, K, V are the query vector sequence, key vector sequence, and value vector sequence, respectively, and W Q , W K , W V are the learning parameters. e output vector is where (K, V) � [(k 1 , y 1 ), (k 2 , y 2 ), . . . (k N , y N )] is the keyvalue pair, representing the input information. i, j ∈ [1, N] are the positions of input and output vector sequences, and the connection weight α ij is dynamically generated by the attention mechanism. softmax ensures that the sum of all weights is 1.

Related Work
Deep learning methods have been widely used in many fields. As one of the classic deep learning models, long-short term memory (LSTM) neural network has great advantage in mining long-term dependencies of sequence data. It was first proposed by Hochreiter and Schmidhuber to solve longterm memory problems in recurrent neural network by considering the "gated units" [12]. Wang   Cell State Figure 3: Structure of JANET neural network.
Mathematical Problems in Engineering combined LSTM with convolutional neural networks (CNN) and proposed the CNN-LSTM model. rough CNN, signal features are transmitted to LSTM to realize dynamic memory [15]. Yu et al. applied LSTM to a nonlinear system model and proposed an improved depth LSTM. Combining the strengths of LSTM and multilayer perception, the stability of the training method is verified by the Lyapunov function. At the same time, the model is preferable to other existing models in a nonlinear system [16].
Due to many parameters involved, the LSTM neural network performs a lower training efficiency. To improve this drawback, Cho et al. proposed a more simplified gated recurrent unit (GRU) neural network based on LSTM neural network and proved that the prediction performance of the GRU Neural network is better than that of standard LSTM neural network [17]. Particularly, GRU can significantly simplify the structure of LSTM, reduce the number of parameters, and greatly shortens the training time. Liu et al. used GRU to replace the LSTM in the neural programmer interpreter for changing the core structure [18]. Based on the classification results of LSTM and full convolution network LSTM-FCN, Nelsayed et al. found that GRU has higher classification accuracy and simpler hardware implementation in time-series classification problems, which are of smaller architecture and less computation [19]. Wu et al. combined GRU with CNN to propose a GRU-GNN hybrid neural network model. In the GRU-GNN model, GRU is responsible for extracting the feature vector of time-series data, and CNN extracts the feature vector of high-dimensional data [20]. Pan et al. applied the GRU-GNN combined model to the water level prediction of the Yangtze River.
rough the 30-year water level data of the Yangtze River and comparative analysis, it is confirmed that the model is superior to wavelet neural network, LSTM, and statistically integrated moving average autoregressive model ARIMA [21]. Given the excellent performance of the GRU neural network after eliminating redundant gates, Westhuizen and Lasenby further explored the necessity of three gated units in the LSTM neural network to build more efficient models [22]. ey proposed a JANET (Just Another NETwork) with only a forget gate and chronologically initialized bias terms.
Attention mechanisms are widely used in neuroscience and computational neuroscience. is common mechanism comes from the fact that many animals only focus on specific parts of their vision to give enough response. erefore, many neural computing studies have concluded that people only need the most relevant information, rather than all information, for further neural processing. In recent years, this mechanism has also been widely used in deep learning research, such as image re-rolling and voice recognition. Recent studies have found that considering the self-attention mechanism in deep learning can effectively extract the most critical information for current tasks to enhance predictive power. Attention mechanism has become one of the most important topics in the deep learning literature following the research by Vaswani et al. [23]. Zhao et al. designed a longshort term memory (LSTM) neural network structure model with attention mechanism based on the dynamic sequence in the internet financial market [24].
e empirical results showed that their attention mechanism model outperformed others. Chen and Ge applied an LSTM neural network with attention mechanism to predict the stock price trend in Hong Kong and achieved satisfactory prediction results [25].
In the training of deep neural network models, the gradient vanishing and overfitting often result in unsatisfactory learning effects. Studies have shown that batch-normalization (B. N.) method can alleviate the gradient vanishing by pulling the data back to a standard normal distribution with a mean of 0 and a variance of 1 [26]. Furthermore, Dropout can prevent overfitting to a certain extent by preventing neuronal coadaptation during the training phase [27]. However, improper use of both methods will generate the opposite effect. Li et al. found that placing Dropout in all B. N. layers or modifying Dropout's formula to reduce the sensitivity of variance could improve the coordination between B. N. and Dropout [28]. Luo et al. suggested that by adopting differentiable learning, the switchable-normalization method (S. N.) could determine the appropriate normalization operation for each normalization layer in a deep network [29]. As a result, it is more advantageous than B. N. in avoiding gradient disappearance. erefore, in our deep neural network design, we add the S. N. layer and Gaussian Dropout layer to optimize its structure. A reasonable combination of the S. N. layer and Dropout layer will improve the performance of the neural networks.

Depth-Gated Recurrent Neural Network with Self-Attention Mechanism (SAM-DGRNN).
We add the S. N. layer and Gaussian dropout layer to optimize its structure in the deep neural network. A reasonable combination of the S. N. layer and dropout layer will improve the performance of the neural networks. Specifically, the main structure of the depth-gated recurrent neural network constructed in this paper includes an attention mechanism layer, a three-layer LSTM/GRU/JANET neural layer, and two fully connected layers (of which the first neural layer has 128 neurons, the second has 64 neurons, the third has 32 neurons, and the two fully connected layers have 32 neurons and one neuron, respectively). An S.N. layer is added in front of each LSTM/ GRU/JANET neural layer. A Gaussian Dropout layer is added at the back of the LSTM/GRU/JANET neural layer, and the drop rate is set to 0.2. e structure of the deep LSTM/GRU/JANET neural network is shown in Figure 4, and the neural network structure is shown in the dotted box.

Training Method, Loss Function, and Optimizer Selection.
We apply the mini-batch gradient descent method to train the deep learning neural network. In order to predict future credit spreads, the mean square error (MSE) is selected in the loss function. We choose Adam optimizer (adaptive moment estimation) to perform optimization training. Compared with other self-adaptive learning rate algorithms, the Adam algorithm is more robust in selecting hyperparameters, with higher training efficiency, and can generate more effective results [30]. e experimental environment of this paper is shown in Table 1.

Control Models.
In the depth-gated recurrent neural network model, we employ the rolling-window prediction method and use the data of n trading days to predict the credit spread on the next day.
is paper first integrates indicators with XGBoost (extreme gradient boosting) algorithm, extracting the predictor variables with higher importance ranking. XGBoost algorithm combines random forest algorithm, which can further reduce calculation complexity. When dealing with a large amount of data, XGBoost can operate parallel and divide the data according to different characteristics to form a tree sequence. is algorithm is simpler and more effective. It can transfer complex data to an orderly and concise arrangement form.
en the feature variables with a higher importance ranking are selected as the model input.
To comprehensively evaluate the prediction effect of depth-gated recurrent neural network, one benchmark deep learning model RNN and three traditional machine learning models (support vector machines (SVR), multilayer perceptron (MLP), and random forest (RF)) in financial prediction are selected as nonlinear control models. Research has put VAR as a linear control model. [6]. Deep RNN selection and depth-gated recurrent neural network have the same structure. e parameter combination in SVR is set as "Radial Basis Function (RBF), penalty parameter C � 1, gamma � auto". We also select the classic MLP neural network with three layers. e prediction methods of RNN, SVR, MLP, and R. F. are consistent with depth-gated recurrent neural network. e prediction idea of a VAR model is as follows: first, the stationarity of all sequences is comprehensively judged by the ADF test, KPSS test, and P. P. test and decide whether to carry out the corresponding order difference to obtain the stationary sequence according to the test results. Second, the VAR model is established. e order of the VAR model is determined by integrating AIC and BIC information criteria.
ird, the VAR model was estimated, and the model's stability was tested. Finally, the credit spread sequence is predicted based on the stable VAR model. e prediction flowchart is shown in Figure 5.

Variable Selection.
We collect daily closing data from 2009 to 2019. e 2517 trading days during this period are divided into a training set (includes the first 85% of trading days) and a test set (includes the remaining 15% of trading days). Table 2 shows variables in the literature that have been verified as significant credit spread determinants. e credit spread sequence is a forecast indicator, and additional variables are used as the characteristics to predict credit spreads. e detailed indicators are discussed as follows.
Risk-free interest rate term structure: the risk-free interest rate is an important variable in the structural model. e information contained in the shape of the riskless yield curve can improve the prediction performance of credit spreads [6]. Credit spread term structure: the credit spread curve's level, slope, and curvature are the principal variables for predicting the future credit spread [6]. Fama-French factor returns: credit spreads indicate the extra compensation of holding risky assets as an analogy to stock risk premiums. erefore, financial markets would transfer the explanatory power of stock returns, represented by Fama-French factor returns, to the bond market [31]. Return on Stock Index: stocks are also yield-producing securities. e equity market is the most plausible alternative to the fixed-income market, and equity market indexes measure capital market investment levels. erefore, the return on the stock index could be relevant to corporate bond credit spreads [32]. Volatility of Stock Index: VIX Index, often referred to as the market's "fear gauge," can be correlated with credit spreads, which capture the future probability of default as a common forward-looking risk metric. As a result, stock market volatility is a significant variable for explaining credit spread changes [33]. Exchange rate: the prevailing economic theory, such as uncovered interest rate parity, suggests that there should be an empirical relationship between exchange rates and interest rates. Given exchange rate fluctuations, foreign investors will be attracted to invest in U.S. corporate bonds. e foreign exchange rate is a heretofore overlooked variable for explaining credit spread changes [33].  Oil Prices: energy prices, as the cost of economic activities, are captured by oil prices to study their influence on credit spreads. TED spread: TED spread captures additional macroeconomic and interest rate information from international fixed-income markets. LIBOR and U.S. treasury yields are often used to price complex financial derivative products, and the difference is an important predictive variable [34]. Swap spread: swap spread is highly correlated with credit spreads because it is a proxy for credit rate. Since the swap market is more well developed and liquid than corporate bonds, swap rates may provide a forward indication of credit spreads. Credit spreads will increase with swap spreads [35]. Commodity Price Index: it is widely used to analyze price fluctuations in commodity markets and macroeconomy. CPI index is a better indicator of inflation [6].
e importance score based on the XGBoost algorithm is shown in Figure 6. e y-axis represents the feature, x-axis represents the importance score, and the score is between 0 and 1. Figure 6 shows the mutual information of selected features. To avoid disturbance from insignificant features, with 0.01 as the cut-off point of importance score, we select ten features with the highest mutual information from the  raw feature set. Features with higher mutual information are more helpful to determine future spreads.

Evaluation Indexes for Prediction Results.
In this paper, we apply three indicators, including MAE (mean absolute error), MAPE (mean absolute percentage error) and RSR ( e classification of RSR values by Moriasi et al. (2007): when RSR ≤ 0.5, the prediction performance is excellent; when 0.5 ≤ RSR ≤ 0.6, the prediction performance is good; when 0.6 ≤ RSR ≤ 0.7, the prediction performance is at an average level; when RSR > 0.70, the prediction performance is poor) (root mean square error (RMSE) divided by standard deviation), to evaluate prediction accuracy. e smaller the value is, the higher accuracy the prediction will have. SDAPE (standard deviation of mean absolute percentage error) is used to evaluate the prediction stability. e smaller the SDAPE value is, the better the prediction stability will be. e calculation formulas of evaluation indicators are shown in formulas (17)-(21): (21) where y i and y i represent the actual value and the predicted value of credit spreads, respectively, STD is the standard derivation of the actual value, and N is the sample size. Although the indicators are widely used to compare prediction accuracy, their values alone are insufficient to determine models' prediction ability. We also conduct D. M. statistical tests with these indicators as the basic loss function to analyze statistical significance [36]. e idea of the D. M. statistical test is as follows. For a set of actual time series y t T t�1 , the estimated values for the two models are y it T t�1 and y jt T t�1 , whose error sequences are e it T t�1 and e jt T t�1 , and whose loss functions are g(y t , y it ) ≡ g(e it ) and g(y t , y jt ) ≡ g(e jt ), respectively. As a result, their relative loss functions can be expressed as d t � g(e it ) − g(e jt ). e null hypothesis is that the two models' prediction abilities are not different, expressed as E(d t ) � 0. If the loss-differential series d t T t�1 is covariance stationary and short memory, then standard results may be used to deduce the asymptotic distribution of the sample mean loss differential.
We have is the spectral density of the loss differential at frequency 0, and is the autocovariance of the loss differential at displacement c, and τ is the population mean loss differential. e formula of f d (0) shows that the correction for serial correlation can be substantial, even if the loss differential is only weakly serially correlated, due to the accumulation of the autocovariance terms.
Because in large samples the sample means loss differential d mean is approximately normally distributed with mean μ and variance 2πf d (0)/T, the obvious large-sample N(0, 1) statistic for testing the null hypothesis of equal forecast accuracy is DM � d mean /(2πf d (0)/T), where f d (0) is a consistent estimate of f d (0). If the absolute value of D. M. statistic is significantly greater than the critical value, the null hypothesis is rejected, indicating that the two models' predictive abilities are significantly different.

Prediction Performance of Depth-Gated Recurrent Neural Network (DGRNN).
We have the following discoveries about repeated experiments. (a) e deep learning model is sensitive to the number of traversals (the value of hyperparameter epochs); the prediction effectiveness of the same model is in a U-shaped relationship with epochs value; when epochs � 100 ± 10, the deep learning models perform best in the experiments; the machine learning models are not sensitive to hyperparameter epochs. In this paper, the number of traversals is 100 (epochs � 100). (b) Several representative values (1,5,20,60,120,180, and 250) were selected to test the sensitivity of the hyperparameter look_back, which determines the number of trading days  used to predict the credit spreads of the next day. We find that the prediction effect of the same model and the value of look_back show W-type characteristics. All models perform best when look_back � 5, indicating that the historical data of the previous five trading days already contain enough information. Too few trading days result in insufficient information, while too many bring extra noise. erefore, we set the look_back parameter to 5 in the subsequent analysis. e prediction results with the parameter combination [epochs, look_back] as [100, 5] are shown in Tables 3 and 4.
As can be seen from Table 3, all D. M. test results reject the null hypothesis, indicating that the predictive power of these models is significantly different. Depth-gated recurrent neural networks (DJANET/DGRU/DLSTM) are superior to nongated deep recurrent neural networks (DENN), such as the traditional machine learning models (SVR/MLP/RF) and linear prediction model (VAR), in credit spread prediction of U.S. bond market in terms of accuracy and stability. Furthermore, the RSR value of the DGRNN model is less than 0.6, suggesting that the DGRNN model also performs better in absolute dimensions according to the classification standard of RSR metric value by Moriasi et al. [37]. Besides, D.M. test results also show that the null hypothesis is rejected at least at the significance level of 10%, indicating that the predictive power of the DJANET model is significantly different from that of the other models. In the deep learning model with the gated unit mechanism, the DJANET model with one gated unit performs better than the DGRU and DLSTM models, which have more gated units. Furthermore, the DGRU model with two gated units is better than the DLSTM model with three gated units. Note: * , * * , and * * * indicate that D.M. statistics are significant at the 10%, 5%, and 1% levels, respectively; DJANET is the benchmark in the D.M. test; boldface represents the optimal value under different evaluation criteria; the following tables are the same.  We compared the performance of models DJANET, DGRU, DLSTM, and DRNN at different lengths of time steps in Table 5, where SDAPE (standard deviation of mean absolute percentage error) is used to evaluate the prediction stability.
e smaller the SDAPE value is, the better the prediction stability will be. As we can see, the values of DJANET are larger when the lengths of time steps are from 5 to 120 in Table 5. However, the value of DJANET is lower when the length of time steps is 180. DJANET has only a forget gate and chronologically initializes bias terms. e short-term prediction is effective and relatively loses the long-term information content. Credit spreads are cyclical, so the results will be similar to those in the short term when the ultralong term is 180 days. At the same time, there are different performances on DGRU and DLSTM. For example, the values of DGRU are lower when the lengths of time steps are from 20 to 120, and the values of DLSTM are lower when the lengths of time steps are from 20 to 180. DGRU is a variant of DLSTM, and their performance is equal in many tasks. DLSTM can learn the characteristics of long-term trend series, so its prediction efficiency is improved with the increase in time length. We discover that no matter how much the length of time steps is, the Depth-JANET model with one gated unit performs better than Depth-GRU and Depth-LSTM models, which have more gated units. However, as the length of time steps increases, the prediction accuracy will decrease. Table 6 reports the predicted performance of the depth-gated recurrent neural network with SAM. It can be seen from Table 6 that the four evaluation indicators of the SAM-DGRNN model are all smaller than the DGRNN model, and the D. M. test results also show that the null hypothesis is rejected at least at the significance level of 5%, indicating that depth-gated recurrent neural network with SAM performs better than the models without the mechanism. It also suggests that SAM can improve the  performance of depth-gated recurrent neural network in predicting credit spreads. Figure 7 further shows the fitting curve of the DGRNN model on the prediction set in the U.S. credit spread. It can be seen that the SAM-DJANET curve fits best.

Effectiveness of SAM.
Li believed that simply assigning statistical data to random without testing its certainty and randomness will lead to large deviations between the predicted results and the actual values [38]. In other words, the prediction outcome and accuracy are largely dependent on a reasonable prediction model and the randomness of the original data of the predicted variable. If the original data exhibits a certain logical change and is less random, adopting an appropriate prediction model will inevitably improve prediction performance and have higher accuracy. erefore, following the method in Tang et al. [39], this paper applies the Ljung-Box statistic to conduct a random independence test on the original credit spread data. e test results show that its Ljung-Box statistic value is 3519.1, and the p-value is 0.00, and the original hypothesis is rejected at a 1% significant level.
e results indicate that the original data are not randomly independent, laying a statistical foundation for exploring the best fitting model. e proper models can be extrapolated extensively, and prediction results in the test set should be very close to the actual real value. What is more, these models can extract the original data information. Furthermore, their residual sequences are white noise sequences, which meet random independence.
To further test the prediction extrapolation ability of the model, we perform a Ljung-Box test on the residual sequence of the prediction set in the SAM-DJANET model. e Ljung-Box statistic value is 3.07, and the p-value is 0.19, indicating that the null hypothesis cannot be rejected even at the 10% significance level. e results suggest that the residual sequence is a white noise sequence, confirming that depth-gated recurrent neural network with self-attention mechanism (SAM-DGRNN) has good predictive extrapolation ability and rationality.

Robustness Test.
is paper conducts a robustness test from the following two aspects: (1) setting the cutting point of the training set and prediction set to 3 : 1, and the results are shown in Table 7; (2) shortening the sample interval to 2015-2018 and the results are shown in Table 8. According to the evidence in Tables 7 and 8, the empirical results of this paper are robust.

Conclusions
Traditional prediction data and technologies are insufficient to forecast the credit spread accurately, particularly when using big data. Considering the nonlinear changes in credit spreads, this paper introduces a deep learning algorithm to build depth-gated recurrent neural network with a self-attention mechanism. Additionally, it compares various prediction methods. We choose multiple evaluation indicators and randomness tests for predicted variables' original data to conduct a comparative analysis. e conclusions are as follows.
First, traditional intelligent algorithms such as machine learning and deep learning can capture nonlinear relationships better than linear algorithms. e prediction results of credit spread prediction indicate that deep learning models (LSTM, GRU, and JANET) and traditional machine learning models (SVR, MLP, and RFR) are better than the VAR model. Second, deep learning models with gated unit mechanisms are extremely advantageous in mining the longterm dependence of sequence data. e results show that the deep learning models with gated unit mechanisms have better prediction accuracy and higher stability than those without gated.
ird, when predicting credit spreads, JANET, the latest deep learning model that has only one gated unit, excels in prediction efficiency, accuracy, and stability compared with the earlier models, which have more gated units, such as LSTM and GRU. GRU model with two gated units is superior to the LSTM model with three gating units. Fourth, deep learning models with SAM can efficiently  filter out critical information to the current task. e prediction of the credit spreads in the U.S. bond market shows that the deep learning model based on the attention mechanism has better prediction performance than that without the mechanism. In summary, by comparing each model's prediction results and robustness tests through statistical performance indicators, we confirm that depth-gated recurrent neural network (DGRNN) is an effective prediction method for the U.S. bond credit spread. SAM-DGRNN model can further improve prediction performance. Among the three gated recurrent neural network models, the SAM-DJANET has the highest prediction accuracy, stability, and efficiency. e prediction results can provide a reference for the decisionmaking of market participants and regulatory authorities in the U.S. bond market.
However, we consider some factors that may affect the credit spread, and there may be other influencing factors. So, we can explore adding more relevant variables to improve the forecasting effect. We can further find a more accurate model in a certain type of credit spread according to the maturity, rating, and industry. In addition, these principles and forecasting methods can extend to the relevant problems of financial time series.

Data Availability
e data used to support the findings of the study were obtained from https://fred.stlouisfed.org.

Conflicts of Interest
e authors declare that they have no conflicts of interest.