A Novel Method for Sea Surface Temperature Prediction Based on Deep Learning

Sea surface temperature (SST) forecasting is the task of predicting future values of a given sequence using historical SST data, which is beneficial for observing and studying hydroclimatic variability.Most previous studies ignore the spatial information in SSTprediction and the forecasting models have limitations to process the large-scale SST data. A novel model of SST prediction integrated Deep Gated Recurrent Unit and Convolutional Neural Network (DGCnetwork) is proposed in this paper.-e DGCnetwork has a compact structure and focuses on learning deep long-termdependencies in SSTtime series. Temporal information and spatial information are all included in our procedure. Differential Evolution algorithm is applied in order to configure DGCnetwork’s optimum architecture. Optimum Interpolation Sea Surface Temperature (OISST) data is selected to conduct experiments in this paper, which has good temporal homogeneity and feature resolution. -e experiments demonstrate that the DGCnetwork significantly obtains excellent forecasting result, predicting SST by different lengths flexibly and accurately. On the East China Sea dataset and the Yellow Sea dataset, the accuracy of the prediction results is above 98% on the whole and all mean absolute error (MAE) values are lower than 0.33°C. Compared with the other models, root mean square error (RMSE), root mean square percentage error (RMSPE), and mean absolute percentage Error (MAPE) of the proposed approach reduce at least 0.1154, 0.2594, and 0.3938.-e experiments of SST time series show that the DGCnetwork model maintains good prediction results, better performance, and stronger stability, which has reached the most advanced level internationally.


Introduction
Analyzing sea surface temperature (SST), an essential parameter for studying the marine ecosystem and global climate can efficiently help us to explore the ocean conditions and understand the climatic dynamics. For a long time, SST has been reported the role in different fields of science, such as providing significant predictive information about hydroclimatic variability [1][2][3], supplying basis for revealing the spatial distribution of biological environmental factors [4], and as an indicator to observe and monitor marine disasters [5,6]. Because of large variations in heat flux, radiation, and diurnal wind near the sea surface, the prediction of SST has always been a highly uncertain issue.
Recent years, many methods have been developed for SST prediction. ere are primarily two types of forecasting strategies: physical techniques and statistical techniques [7]. e former is aimed at the physical properties of the ocean, using a series of differential equations to describe the SST data. Statistical models, including linear regression [8], thogonal functions [9], support vector machines (SVM) [10,11], and artificial neural networks (ANN) [7], are extensively used time series-based approaches for SST prediction. ese models are designed to predict SST time series by establishing a relationship between historical values and a predictor. e previous studies found that the SSTprediction result is often unstable. Traditional methods have some disadvantages in processing large-scale SST data, such as slow speed, difficulty in fitting, occupying much machine memory, and computing time.
Recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM) [12,13] and Gated Recurrent Unit (GRU) [14][15][16], have shown to achieve the state-of-the-art results in many applications with time series or sequential data. RNNs enjoy several nice properties such as strong prediction performance as well as the ability to capture longterm temporal dependencies and variable-length observations. LSTM and GRU introduce gate mechanism to overcome the problems of vanishing and explosion of gradients in traditional RNNs when learning long-term dependencies. GRU network is faster and has the simpler structure than LSTM training and performs well in sequence learning tasks [6,17]. Recently, SST prediction progresses further with the advent of deep learning [15] and neural networks methods. Zhang et al. [13] adopted LSTM to predict SST and obtained good prediction results. Based on the existing contributions, however, there are three problems with the studies. Firstly, mining the information of time series by the model structure of a single network layer is limited. Secondly, the current examination did not consider the temporal and spatial characteristics of SST time series simultaneously. In other terms, the isolated prediction of each point ignores the interaction between the SSTs of different points. irdly, the previous ways did not take into account the optimization strategy of the parameters in the prediction model.
In our work, an innovative approach is constructed for SST prediction, which is the Deep Gated Recurrent Unit and Convolutional Neural Network (DGCnetwork). e DGCnetwork model is constructed combining the deep GRU and CNN. e deep GRU layers and the convolutional layer are used to extract the deep hidden temporal features and spatial characteristics of SSTdata, respectively. We apply one full-connected layer to combine all features into global features and map the output of the previous layer to a final prediction. Increasing the depth of a neural network is an effective way to improve the overall performance [15]. Because the proposed model has a more compact representation than the single network layer, it will be better promoted and performed when applying to prediction of SST data. Besides, temporal information and spatial information are all included in our procedure. Research shows that the SST of a specific point interacts with the SST of its surrounding points [4,18]. erefore, when we predict the SST of a certain location, the proposed approach combines the historical SST information of its nearby location. e efficiency of the DGCnetwork depends on several hyperparameters, namely, the number of neurons in every layer and the number of epochs. Without choosing appropriate network parameters, it slows down the training speed and the network is vulnerable to interference in the nearest local minimum. Because the initial values of hyperparameters play a vital role in the training outputs of the neural network [19,20], we adopt the Differential Evolution algorithm (DE) to infer optimal selection for the proposed model's hyperparameters. DE can leverage individual local information and population global information to search for the optimal solution, which has been widely applied [21,22]. e sequel of the paper is organized as follows. e procedures of the DGCnetwork predicting model are explained in detail in Section 2. Section 3 provides the experimental results and discussions. Finally, Section 4 summarizes the conclusions.

Methodology
2.1. e DGCnetwork. In order to solve the task of SST time series prediction, this paper proposes the DGCnetwork model based on deep learning with deep GRU and CNN network.
e DGCnetwork architecture can adapt by learning the nonlinearity and complexity of SST time series data, which includes multiple GRU layers, one CNN layer, and one full-connected layer. After the prediction point is selected, we express the SST time series of the prediction point and its nearest points in a matrix form to input into the model. In the model, each GRU layer operates at different time scales and the CNN layer captures spatial feature. e full-connected layer combines all features into global features and maps the output of previous layer to a final prediction. ey process the certain part of the prediction task. e output of the previous GRU layer is the input of the next GRU layer. e output of the last GRU layer is the input of the CNN layer and finally generates the prediction result by the full-connected layer. As such, the model is an end-toend prediction network. Stacking more GRU layers to the recurrent connections between the units in the model and the feed-forward connections between units in a GRU layer and the GRU layer above, it is helpful to research the largescale SST time series. is ensures an improved learning with more sophisticated conditional distributions of SST time series data. Also, it can perform hierarchical processing on difficult temporal tasks, and more naturally, capture the deep feature of data sequences. e hyperparameters in the network layers are chosen by the DE algorithm.
As shown in Figure 1, the DGCnetwork architecture has three GRU layers, one CNN layer, and one full-connected layer. We define the SST time series as X(x 1 , x 2 , . . ., x t , . . ., x n ).
x t represents the SST value at time t and n is the length of SST time series. Multiple time series constitute the input matrix M(X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 , X 9 ), where X 1 is the predicting point and X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 , and X 9 are the surrounding points. In the DGCnetwork architecture, the input at time t and M t is introduced to the first GRU layer along with the previous hidden state h t−1 (1) , and the superscript (1) denotes the first GRU layer. e hidden state at time t, h t−1 (1) and h t (1) are computed, as shown in Section 2.2. h t (1) goes forward to the time t + 1 and also moves forward to the second GRU layer. h t−1 (2) in GRU layer 2 is computed by h t (1) and h t−1 (2) , which goes forward to the time t + 2 and also moves forward to the third GRU layer in the same way. e output of the third GRU layer is the input of the CNN layer. c l i and c l j are computed, as shown in Section 2.3. e output of the CNN layer is the input of the full-connected layer. Finally, the predicted value y t is obtained by the full-connected layer.
Our proposed DGCnetwork model has three advantages. To begin with, each layer can process some part of the predicted task and GRU layer and pass it on to the CNN layer, until finally the last full-connected layer provides the predicted SST value. Secondly, the hidden state in the model at each level is allowed to deal with at a different time scale which could mine the deep spatial-temporal feature of the data. irdly, the optimal hyperparameters in the model are selected directly by the DE algorithm. e three advantages have great benefit in case of handling the predicting problem of large-scale SST time series data.

Temporal Feature Extraction by GATED Recurrent Unit.
is paper adopts GRU to capture the temporal relationship among SST time series data. GRU was first proposed by Bahdanau et al. [16], which is more accurate than conventional RNNs and more simple than LSTM. In the topological structure of GRU, the forget gate and the input gate are integrated into an update gate. GRU mixes the cell state with the hidden state, and the information flow inside it is modulated by the reset gate and the update gate. As illustrated in Figure 2, r t and z t are the reset gate and update gate, respectively, and h t and h t represent the activity value and the candidate activity value, respectively. e mechanism of the gates could extract the temporal relationship among time series data. e reset gate r t can control the influence containing information of the last implicit state h t−1 on the current information x t , which determines how much information was forgotten in the past. If the value of r t approximates 0, the information of the previous implicit state is discarded. e update gate z t is used to control the importance of the past implicit state h t−1 at the present moment h t . If the value of z t is always approximately 1, the information of h t−1 is always saved through time and passed to h t . is makes the gradient reversely propagate, effectively solving the gradient vanishing problem of RNN. e whole computation can be defined by a series of equations as follows: where σ denotes the sigmoid function, W r , W z , and W h are the recurrent weight matrices. [] represents the two vectors are connected and * is the multiplication of matrix elements. e eigenvalues are required to enter in the chronological order when GRU networks are dealing with the SST time series. Both the sigmoid function σ and the hyperbolic cosine function tanh are adopted as activation functions in the structure. During the training process, the loss of the objective function from the training sets is minimized.

Spatial Feature Extraction by Convolutional Neural
Network. CNN is a special structure of ANN, which has the ability to deal with high-dimensional data. It is general utilized in image recognition, recommender systems, and natural language processing [23]. Since there is interaction between the SST of the adjacent positions, this paper combines the historical SST information of the prediction point and its surrounding points to forecast the target point. In the proposed model, we apply the CNN layer as a module to mine the spatial information of SST time series (Figure 3). After processing the matrix M in the GRU layers, the matrix M′ is input into the CNN layer. To begin with, multiple twodimensional matrices at different time periods are stacked into three-dimensional matrix blocks. en, spatial feature extraction can be achieved by a roll over convolution layer. Afterwards, the outputs of convolution operation are adopted in pooling process. e role of the pooling layer is lowering the computational burden and improving ht (2) ht (   Mathematical Problems in Engineering operation efficiency by compressing the feature map. Finally, the abstract feature set is flattened to a one-dimensional vector and connected with the full-connected layer. CNN has the advantages of local perception, sparse interactions, and parameter sharing. Its weight-sharing network structure makes it more similar to the biological neural network and has achieved good results in time series research [24]. e output of the CNN layer can be written as follows: where Z j is the collection of input maps. Each output map is given an additive bias b; however, for a particular output map, the input maps will be convolved with distinct kernels. e kernels applied to map i are different for output maps j and k when output map j and map k both sum over input map i.

Optimization of Network Parameters by Differential Evolution Algorithm.
ere are some decision parameters to be optimized for the DGCnetwork's training.
is paper applies DE algorithm to optimally select the values of each hyperparameter in the predicting model, including the number of neurons in the GRU layers and the number of epochs. e optimization strategy is convenient for us to seek out the best model's structure in order to minimize the difference between the predicting and actual values. e DE algorithm is a simple, population-based, and direct-search algorithm for optimizing the multimodal functions [25]. DE is reliable due to its ability to reach global optimum values and rapid convergence with fewer control parameters. Previous research states that the DE outperforms several other well-known optimization algorithms in terms of convergence speed and stability [26]. e standard DE consists of four main operations, which are initialization, mutation, crossover, and selection. e four operations make the model evolve to a higher fitness to achieve the goal of optimal solution.

Data and Software.
e data used in our research is the Optimum Interpolation Sea Surface Temperature (OISST), an optimally interpolated SST, from the National Oceanic and Atmospheric Administration (NOAA). Because OISST has good temporal homogeneity and feature resolution [27], it is applied to the analysis and prediction of time series in our work, studying the OISST data is helpful to research the oceanic features. e data we used in the paper is global grid data, the spatial resolution is 1°× 1°, and the time resolution is days. We choose the East China Sea and the Yellow Sea as the experimental objects (see Figure 4). is paper creates two SST datasets which are the East China Sea dataset and the Yellow Sea dataset, respectively. Six points are randomly selected on the two datasets. e time length is from January 1, 2001, to July 15, 2017 (6,040 days).
SST data preprocessing and handling are conducted in Python 3.6, relying on the packages numpy and pandas. Deep learning GRU and CNN networks are implemented with keras, a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or eano.

Evaluation Standard.
In the study, five different indexes are measured in order to estimate the forecasting precision, error, and performance evaluation of the prediction task [28].
Root of mean squared error (RMSE): (3) Among them, y obs i and y pre i represent the true value and its predicted value, respectively. e degree of freedom in RMSE is N − L + 1 − i, where N is the number of samples, L is the length of observations, and i represents the number of independent variables. In this paper, i � 2. RMSE is smaller and its degree of freedom is larger indicating that the model is more effective and universal [29,30]. e important property of the RMSPE, MAPE, and MAE is their values closer to 0 imply higher accuracy of the predicting model. e range of ACC is [0, 1], and the value closer to 1 corresponds to better performance of the forecasting model. It is widely demonstrated in the previous literature that the five measures are the appropriate tools to assess the performance of the forecasting model [31].

Results and Analysis.
ere are some important settings in the DGCnetwork model to be determined beforehand. Firstly, we utilize early stopping to prevent overfitting as a further mechanism.
is paper sets the maximum early stopping duration to 15. Secondly, the data is split into training, testing, and validation set following the ratio of 3 : 1 : 1. e training set is used for training and the test results are obtained on the verification set. Furthermore, we set the batch size as 40 in the experiments.
We proceed now to show the quantitative and visual results of the proposed DGCnetwork. e results shown in all tables and figures indicate the performance of the model in the validation set. is has been done in concurrence with the widely demonstrated fact, which states, the genuine evaluation for forecasting performance should be based on unseen data not the historical (training and testing) data, which is already seen by the model [31].
In the experiment, we use the different lengths of historical observations to predict the future SST value. e length of historical observations is denoted as H. In general, if H is too small, there may not be sufficient sequence information to predict future SST values. Otherwise, with the increase of H, there may be more noise in the training samples [10,32]. When the length of historical observations is from 1 day to 60 days, we apply the DGCnetwork to predict the SST for one day. Figures 5(a) and 6(a) show the forecasting accuracy on the East China Sea dataset and the Yellow Sea dataset, respectively. From the results, the accuracy of the six points on the two datasets are all more than 98% with the different H. Experiments display that the length of historical observations has little effect on the prediction accuracy when the predicting length is one day. en, this paper adopts the DGCnetwork to predict the SST for one week with the length of historical observations from 7 days to 60 days (as shown in Figures 5(b) and 6(b)). at is to say, SST data from the past H days are applied to forecast the value for the seventh day in the future. Considering the problem of the insufficient information, our experiment does not perform the case, where H is less than the predicting length. It is worth mentioning in view of the results that, as H increases, the forecasting accuracy has a raise in tendency.
is could be attributed to more sequence information which is needed when we predict the longer length. Overall, whether it is forecasting the SST value of the first day or the seventh day in the future with different H, the prediction effect on the two datasets could achieve satisfying accuracy (98%∼99%). Moreover, it is interesting that the accuracy of p1 is better than p2 and p3 on the East China Sea dataset. As we all know, the temperature changes in the distant sea are relatively stable, while the fluctuations in the coastal water's temperature are greater. By observing the location of the three points on the map, we can observe that p1 is farther away from the coast than p2 and p3. is is demonstrated that the temperature changes at p1 are relatively stable; therefore, the forecast performance of p1 is better than p2 and p3. On the Yellow Sea dataset, we could obtain the same finding. e forecast accuracy of p5 is better than p4 and p6 which are near the land.
Since the DGCnetwork contains DE algorithm module, the values of each hyperparameter have been optimally selected. is paper analyzes the best model's structure and the prediction results with the different predicting lengths. On the two SST datasets, the optimal model is used to forecast SST value with the historical observations of 30 days used as an example, the predicting length is set as 3 days, 5 days, 1 week, 2 weeks, and 1 month, respectively. DE algorithm in our predicting model makes it convenient to adjust the deep network to the optimal state when the prediction range changes, avoiding the trouble of parameter adjustment. Table 1 lists the predicting results of p1 and p4 on the two SSTdatasets, and it is easy to notice the number of neurons in hidden layers accumulate between 10 and 20 and is larger as the predicting length increases. e number of neurons in the neural network determines the number of input features. Very few neurons can cause part data to be lost. e numbers of epochs in the optimal models are clustered around 100. e forecast result gets better obviously when the predicting length reduces; among them, ACC is near 0.99 when the third day's data is forecasted in the future on the two SST datasets. e error of the model remains small when we forecast the SST data after a month (RMSE is 0.6729 on the East China Sea dataset and 0.5681 on the Yellow Sea dataset). e experimental process also indicates the DGCnetwork optimized by DE may be a good choice for SST time series forecasting. is paper adopts the GRU network to make the comparative analysis of the prediction errors with the proposed method. Figures 7 and 8 depict the prediction results by the two methods when the length of historical observation is 7 days and the predicting length is 1 day. According to the results on the two datasets, it should be pointed out that the prediction results of the six points reflect the same problem. e prediction errors obtained by GRU are more lager near the maximum SST value. However, the DGCnetwork model always maintains small prediction errors and the prediction results are very close to the true SST value. After searching the previous SST prediction studies [13,32,33], we find that, in the literature [13], the SST predicting results also have the larger errors near the Mathematical Problems in Engineering maximum SST value. So far, however, there has been little discussion about the reason for this phenomenon. is paper analyzes the issue from two aspects: data and method. First of all, SST time series presents obvious periodicity tendency.
at is to say, SST data generally reaches its maximum in summer each year. is was demonstrated in some studies that showed in the last two decades; SST has been warming up in the coastal areas of China, and the intensity of extreme high temperature has been significantly enhanced, especially in spring and summer [18,34]. Secondly, the shallow architecture, i.e., the single-layer neural network cannot represent efficiently the complex features of time series data, particularly when attempting to process highly nonlinear and long interval time series datasets [35,36]. On the whole, the single-layer GRU network is difficult to capture the trend of SST data in summer. e proposed method in our research has the higher prediction accuracy because it uses a deep network structure, which can   [13], GRU-SVM [14], WNN [37], and CEEMDAN-LSTM [12].
e results of the experiment on the two datasets which predict 1 day's SST value with the length of historical observation is one week (7 days) are shown in Tables 2 and  3 e results of evaluation indicators indicate that the method in this paper is more effective than traditional methods or other existing predicting models. e DGCnetwork model has the advantages of higher forecasting precision, better performance, and stronger stability.

Conclusions
In this study, we propose a deep GRU and CNN based on the DGCnetwork network to model the spatiotemporal relationship of SST to predict the future value. DE algorithm is adopted to infer optimal selection for the hyperparameters of the model. e contributions of this paper are four folds.
(1) e DGCnetwork has a compact structure and focuses on learning deep long-term dependencies in SST time series. Each layer in the DGCnetwork model processes the part of the predicted task. (2) Apart from temporal information, spatial information is combined in our work to forecast the SST data. (3) We randomly select the points on the East China Sea and the Yellow Sea datasets to experiment. e results show that the DGCnetwork overcomes the disadvantage of GRU network which has lager prediction errors near the maximum SST value. We have conducted the comprehensive experiments and compared with the leading time series predicting models. e experiments have demonstrated that the DGCnetwork model achieves a state-ofthe-art performance and outperforms many existing predicting models. (4) e model can be applied to more time series data. Finally, our future studies would also work on analyzing other types of SST, such as Group for High Resolution Sea Surface Temperature (GHRSST) data.
Data Availability e data used in our research is an open dataset, the Optimum Interpolation Sea Surface Temperature (OISST), from the National Oceanic and Atmospheric Administration (NOAA).

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.