Decomposition-Based Multistep Sea Wind Speed Forecasting Using Stacked Gated Recurrent Unit Improved by Residual Connections

Sea wind speed forecast is important for meteorological navigation system to keep ships in safe areas. .e high volatility and uncertainty of wind make it difficult to accurately forecast multistep wind speed..is paper proposes a new decomposition-based model to forecast hourly sea wind speeds. Because mode mixing affects the accuracy of the empirical mode decomposition(EMD-) based models, this model uses the variational mode decomposition (VMD) to alleviate this problem. To improve the accuracy of predicting subseries with high nonlinearity, this model uses stacked gate recurrent units (GRU) networks. To alleviate the degradation effect of stacked GRU, this model modifies them by adding residual connections to the deep layers. .is model decomposes the nonlinear wind speed data into four subseries with different frequencies adaptively. Each stacked GRU predictor has four layers and the residual connections are added to the last two layers. .e predictors have 24 inputs and 3 outputs, and the forecast is an ensemble of five predictors’ outputs..e proposedmodel can predict wind speed in the next 3 hours according to the past 24 hours’ wind speed data. .e experiment results on three different sea areas show that the performance of this model surpasses those of a state-of-the-art model, several benchmarks, and decomposition-based models.


Introduction
Sea wind always threatens the safe navigation of ships. According to the Marine Casualties and Incidents Reports published by the International Maritime Organization (IMO), there were 1561 well-documented ship accidents in the first 10 years of the 21st century, of which 755 were caused by strong winds and large waves caused by strong winds, and accidents caused by strong winds accounted for 48.3% of total accidents. In addition, when the wind wave and swell appear at the same time, the danger of navigation will be greatly increased [1]. erefore, the accurate wind speed forecasting is of great significance for routes optimization and navigation risk management.
Wind speed forecasts are divided into 4 categories, super-short-term [2,3], short-term [4][5][6][7], medium-term [8,9], and long-term [10], ranging from a few seconds to 30 minutes, from 30 minutes to 6 hours, from 6 hours to 24 hours, and from 24 hours to a week or more, respectively [11]. According to the principle of the wind speed forecasts models, they are classified into physical models and statistical models. e second class includes time series models and machine learning models [11,12]. e physical methods, like the numerical weather prediction (NWP), construct differential equations about physical factors such as wind speed, wind direction, air temperature, and pressure. Solving meteorological equations requires a large number of computing resources and time. e result belongs to longterm forecasting of a large area. e time series methods model the relationship between current wind speeds and historical wind speeds, which are suitable for short-term and medium-term forecast. Most time series methods, such as the autoregressive integrated moving average (ARIMA) [2,13] and autoregressive moving average with exogenous variables (ARMAX) [14], assume that there is a linear relationship between current data and past data or errors. e construction, order identification of these models is easy to understand, but their linear assumptions lead to poor forecast performance on nonlinear data. Machine learning methods are suitable for short-term or super-short-term forecasting. ey take each time point of past series as an input feature and that of predicted series as an output feature and construct nonlinear relationship. Many complex machine learning models, such as the long short-term memory (LSTM) [15] and gated recurrent unit (GRU) [4,16], are able to learn temporal correlation and often outperform time series models. Among them, GRU not only alleviates the risk of gradient explosion and vanishing but also is faster than LSTM.
Decomposition-based methods have attracted much attention recently. ese methods decompose the original wind speed into several subseries and use a group of same or different individual prediction models to learn each subseries [17]. Usually, time series or machine learning models are selected as predictor. e decomposition-based methods reduce the complexity of original data and make the predictors easier to learn. In addition, multimodel ensemble decreases the risk of getting stuck in local optima in the training process [18]. erefore, decomposition-based forecasting is more accurate than direct forecasting via individual model. In the field of wind speed forecasting, wavelet transform (WT) methods and empirical mode decomposition (EMD) methods are the most used algorithms [19]. Usually, the repeated WT and ARIMA are used to predict super-short-term wind speed of a 10 min scale [20]. Because LSTM is more effective than ARIMA in nonlinear system, the WT and LSTM are combined to predict hourly wind speed, and feature selection based on mutual information is executed between decomposition and predictors [6]. e characteristics of linear and nonlinear are different; then the ARIMA and multilayer perceptron (MLP) are used to predict linear and nonlinear subseries which are classified based on the EMD [21]. ere are also some decompositionbased methods that are used to deal with nonlinearity. Moving average (MA) filter [22] and ARIMA filter [23] are used to separate linear components, and MLP is used to predict the nonlinear parts [24]. Besides the linear and nonlinear predictors, a predictor is used to predict the residual of EMD, since it includes some information [25]. For short-term forecast, a permutation entropy (PE) method is used to predict a 3-step hourly forecast. e subseries is reorganized into several new series according to their PE values. Because it is difficult to capture the nonlinear features, this method uses MLP to predict each component [26].
According to above references, the decomposition-based methods have many advantages. However, they have some problems that have not been widely solved. Firstly, the decomposition algorithms have some defects. Although the WT and EMD are used in wind speed forecast, there are some defects that decrease the forecast accuracy. It is difficult to use WT to analyse local low-frequency changes [27] and the decomposition behaviour depends on wavelet basis functions [15]; different wavelet basis functions bring different decomposition results. e EMD decomposes a time series into subseries with different frequency domain bandwidths and the frequency bands have no overlap ideally. When there is a frequency band overlap in the subseries, multiple modes are mixed, and it is not suitable for further processing [28]. In [29], the ensemble empirical mode decomposition was used to predict short-term wind speed. In [30], the complementary ensemble empirical mode decomposition was used to alleviate mode mixing. ese two methods add multiple white noises to the original data and then integrate the results of multiple EMDs. Variational mode decomposition (VMD) is proposed to solve the mode mixing and does not depend on fixed basis. Different from the EMD-based algorithms, it avoids mode mixing as much as possible by solving specific intrinsic mode functions (IMFs) [31]. In recent studies, the VMD, MLP, and autoregressive moving average (ARMA) are used to predict wind speed with 10-minute interval [32] and 30-minute interval [19].
Secondly, the subseries' predictors can be improved by stacking. Although the decomposed subseries are simplified in frequency, they still have relatively high nonlinearity. Many prediction studies used the support vector machine regression (SVR) and MLP as nonlinear predictors [21,33]. e neural networks are good at nonlinear modelling, so complex neural networks, such as LSTM [6,34] and GRU [4], are helpful to improve the accuracy of forecast. A hybrid predictor that includes the VMD and a single-layer GRU is used to predict the wind power interval [3]. Ideally, stacking more models will significantly improve the ability of nonlinear modelling. However, the actual performance of a stacked network often becomes worse when there are more layers. It is difficult to train deeper layers to fit an identity mapping and it leads to the degradation of stacked models. e residual connections solve this degradation phenomenon by building linear paths between deep layers [35]. e stacked LSTM with residual connections shows superior accuracy in machine translation and sentiment intensity prediction [36,37], but this improvement has not been applied to wind speed forecasting. In [36], two 8-layer LSTMs are added with residual connections every 2 layers. In [37], an 8-layer LSTM is added with residual connections every 1 and 2 layers, respectively, and two types have their own advantages. In wind speed forecasting field, these two types remain to be verified by experiments.
is paper proposes a VMD-Stacked GRU model with residual connections to forecast the short-term global sea wind speed with multiple steps. e decomposition and predictor are designed based on analysis and experiments. Original wind speed data is complex and the VMD is used to decompose the wind speed data; it makes an adaptive decomposition that overcomes the defect caused by mode mixing in EMD-based models. A modified GRU is used as subseries predictor to improve its nonlinear modelling ability. e performances of the stacked GRU are improved by adding the residual connections between the last two layers. is improvement by adding residual connection is very novel in the field of wind speed forecasting. In addition, a lot of experiments are carried out on the European Reanalysis (ERA5) dataset. It has been proved to surpass the 2 Complexity performances of several benchmark and baseline models. Different from most studies based on wind farm observation data, it supports the study of sea wind speeds. e experiment results show that the performance of the proposed model is better than those of some benchmark and baseline models. is paper is organized as follows. Section 2 describes three methods involved in the proposed model. Section 3 details the proposed model's architecture and evaluation criteria. Section 4 details the experiments and analysis that are used to obtain the best forecast performances. Section 5 provides discussion and Section 6 summarises the conclusions. An acronyms list is shown in Table 1.

Variational Mode Decomposition.
Wind speed series are nonlinear nonstationary signals which contain a variety of period characteristics. For example, the Fourier transform for hourly wind speed shows that it is not a 24-hour period, but there are many significant periods. It means that multiperiod wind speed cannot be represented by an instantaneous frequency. When forecasting wind speed directly, the complex periodicity will be disadvantageous to model learning. To understand the signals with complex periodic patterns, an effective strategy is to use IMFs, which are ideal functions with fixed instantaneous frequency. Since there is no complex periodicity, it is relatively easy to predict the IMFs.
To extract IMFs from the original series, the EMD adopts a completely different iteration method to deal with the original data adaptively [27]. But, in practice, there are some imperfections such as overshoots, undershoots, asymmetric wave forms, and ends swing in the results of EMD, which make them not the ideal IMFs [38]. In order to alleviate the above problems, the VMD is proposed to calculate the IMFs more accurately. By constructing and solving a constrained variational problem, the VMD obtains all modal components nonrecursively and improves the decomposition robustness to noise. Under the constraint that the summation over all modes is equal to the original signal, the sum of the all estimated bandwidths of modes is minimized, and the following optimization problem is constructed [31]: where u k and w k are the k-th modal component and the center frequency after decomposition, respectively.
f is the original time series, and K is its mode decomposition number. D t is the estimated bandwidth of each modal component: where 2 2 is the squared L 2 -norm of the gradient and * represents the convolution operation. z t is the partial derivative operation, and δ(t) is the Dirac distribution.
By using quadratic penalty factor α and Lagrangian multipliers λ, the lowest point of this variational constraint problem is transformed into saddle point of augmented Lagrange equation defined as follows. e augmented Lagrange equation is shown as follows [31]. e equation can be iteratively calculated by the Alternating Direction Multiplier Algorithm.

Stacked Gate Recurrent
Unit. e conventional machine learning methods deal with time series problem; each moment of a sample is regarded as a different independent random variable, and it is given into the regression model or neural network for training. However, these models assume that the data at different moments are independent of each other, and their sequence in time is not considered. e recurrent neural network (RNN) is proposed to capture this temporal correlation by using the machine learning. e GRU is a modified RNN based on the LSTM. When error signals propagate backwards through time in the conventional RNN, the signals tend to vanish or blow up, and both of the cases lead to the failure of the network to learn from data [28]. e GRU not only retains the ability to prevent the previously mentioned problems but also reduces the complexity of the structure without losing the efficient learning ability [39]. e structure of the GRU at each step is the GRU cell, which is shown in Figure 1. In this figure, the reset gate and the update gate are fully connected layers with sigmoid activation, which are used to control the memory. e previous hidden state preserves the past memory, the reset gate controls how to combine the input with the past memory to become a candidate hidden state, and the update gate controls how to add the candidate hidden state into the hidden state [39]. Finally, the candidate hidden state, previous hidden state, and output of the update gate constitute the current hidden state and output. e GRU cell can be expressed as follows: where h t−1 is the hidden state at t − 1 and x t , z t , r t , h t , and h t are the input of the GRU cell, output of the update gate, output of the reset gate, candidate hidden state, and hidden state at t, respectively. W and U are the weight matrices of the fully connected layer, and b is the bias vector. σ and tanh are sigmoid and tanh activation function, respectively. ⊙ represents the element-wise product between two matrices of the same size. e GRU cell is shown in Figure 1.
To make the GRU work, its current hidden state is connected to the next hidden state input. In order to improve the learning ability, multiple GRU cells can be stacked along the input-output direction, and the output of the GRU cell at each step can be used as the input of the next GRU cell at corresponding step. Compared with single-layer GRU, stacked GRU has multiple hidden layers, which can improve the ability to learn time series. e structure of stacked GRU along the time axis is shown in Figure 2.

Residual Connections.
With the appearance of normalization and dropout, the vanishing and exploding gradients problem of the stacked neural network is greatly alleviated, which makes the training of deep network no longer difficult. In theory, the learning ability of the stacked neural network increases with the number of layers, and its error also decreases until it remains unchanged. But actually, when the number of layers increases, the network's performance will degrade rapidly. At present, stacked RNN, LSTM, or GRU generally has no more than four recurrent layers [36].
e latest research pointed out that overfitting is not the cause of stack network degradation. e assumption that the performance of a deep network is not lower than that of a shallow network is based on the ability of the deep part of the network identity mapping its input, in other words, the ability of the deep part of the network fitting f(x) � x [35]. However, artificial neural network has been proved to be difficult to apply in learning linear relationship [33]. In order to give the network layer this ability, the residual connections as shown in Figure 3 are proposed [35]. e red part in Figure 3 is the added residual connections, also known as skip connections or shortcut connections. After adding residual connections, the input of the network layer is directly superimposed with the output, and the layer is transformed from fitting is the approximate identity mapping of x; the network layer is changed to learning the nonlinear residual of identity mapping. It is much easier for neural network to learn a group of nonlinear data close to zero compared to linear data. e residual connections shown in Figure 3 were first used to solve the degradation problem of the deep convolutional neural network in image recognition, but they can also be applied to any stacked network. Figure 4 shows the stacked GRU structure with residual connections, which is the same as Google's stacked LSTM in its machine translation model [36]. Different from Figure 3, the residual connections in Figure 4 skip one GRU layer instead of two, and Add is set before the activation function. e GRU network layer inputs the second GRU layer after the element-wise addition of the output and input at each step. Each GRU layer with residual connections constitutes a residual block, which can be defined as follows: where the function is composed of equations (6)-(8), representing the i -th GRU layer. e residual connections can significantly improve the flow of gradients between network layers. In theory, the network with any number of layers can be trained after

Model
Architecture. e proposed model architecture is shown in Figure 5. It contains three parts: data split, data decomposition, and components prediction. e process of the proposed model is summarized as follows: ( 1)      e number of subseries determines the number of predictors and total training time. In order to make a trade-off between the training time and forecasting accuracy, the wind speed series are decomposed into four subseries in this part under termination conditions � 1e-7.
(3) In the components prediction part, since the subseries have different frequency characteristics, five stacked GRU models with residual connections are used to predict the subseries and a decomposition residual, respectively. e final forecast values are obtained by integrating the forecast outputs of all predictors. Since this paper is a short-term hourly forecasting; the data from the past 24 hours are highly related to the forecast values. erefore, the data from 24 hours are used to make a 3-hours-ahead wind speed forecasting.
According to the above section, the residual connections should be set in the deep layer of the network, so we design a stacked GRU with four layers, and GRU layer 1 and GRU layer 2 are independent, while the residual connections are set at the input of GRU layer 3 and the output of GRU layer 4. e output of GRU layer 2 will be fed to the output of GRU layer 3 and added to it, and then the sum of them will be fed to the output of GRU layer 4 and added to it. In order to match the output of the stacked GRU with the desired output step size, we use a flatten layer to reshape the output into a one-dimensional vector, and then a dense layer with linear activation function is used for linear conversion. A detailed parameters determination is described in the Parametric Study section.

Evaluation Criteria.
Time series forecasting can be converted into a supervised regression problem, so we use three regression metrics to evaluate the forecasting performance.
ese regression metrics are the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE). ey are defined as follows: MAPE � 100 n n i�1 where x i represents the model inputs, f(x i ) is the forecast value, y i is the corresponding actual value, and n is the number of actual values.
RMSE and MAE have the same physical dimensions as the original data and range from 0 to ∞. A lower RMSE or MAE means the model has a higher forecasting accuracy. MAE uses the absolute value to describe the gap between two curves, and the error of each prediction point has the same weight in the final error. erefore, the MAE is less than the RMSE on the same data. Since the square term of the RMSE magnifies the error between two points, the large gap between the RMSE and MAE can indicate that some prediction points contribute significantly to the final error of the prediction curve. e MAPE is a dimensionless metric ranging from 0 to ∞, and a lower MAPE means the model has a higher forecasting accuracy. We use this metric because it considers the proportion of error in the total data and is able to evaluate performance of different models on the same dataset. In addition, 5-fold cross-validation strategy is used in the Result of Multistep Wind Speed Forecasting section.

Datasets.
Marine meteorological datasets are collected, sorted, and released by scientific research institutions in various countries. e use of the datasets varies greatly depending on the observation method, observation area, observation period, and observation elements. Selecting a high-quality, long-term, and high-resolution marine meteorological dataset is the premise of modelling. Reanalysis datasets are produced from the buoy and satellite observation data by determining the optimal estimation of the system state. e reanalysis datasets can be regarded as the real global ocean data and are currently used as the data source for the NWP. erefore, the ERA5 is selected in the case study. e ERA5 is the latest global meteorological dataset released by the European Centre for Medium-Range Weather Forecasts (ECMWF). e ERA5 provides hourly wind speed data in 137 levels from the surface up to a height of 80 km, covering the global land and ocean with 30 km grids [40]. e wind speeds are decomposed into u-component and v-component. e positive u-component of wind is eastward wind speed, and the negative counterpart is westward wind speed. e positive v-component of wind is northward wind speed, and the negative counterpart is southward wind speed.
In order to verify the applicability of the proposed model in global ocean, we use hourly wind speed u-component from 1 January 2016, 00 : 00, to 31 December 2017, 23 : 00, which includes 17544 hours to make forecast experiments. e forecast areas are located in the Pacific, Indian, and Atlantic Oceans, respectively. Figure 6 shows the coordinates of three forecast areas and their surrounding areas on the map, as well as the heat map of wind speed in January 2016. e map in Figure 6 is drawn based on the National Oceanic and Atmospheric Administration (NOAA) Panoply software. e statistical indices of three areas in two years are shown in Table 2.
To further analyse the characteristics of the dataset, Figure 7 shows the original time series, as well as [0, 1] normalized trends and seasonality of the first month with 24 Complexity 7 hours as period. e trendy series present great changes in a month. e seasonal series do not show repetitive patterns, which indicates that the wind speed data contain multiple seasonalities. erefore, it is necessary to use decomposition methods that support multiple frequencies.
In Table 2, the positive average wind speed means eastward wind speed, and the negative means westward wind speed. It can be seen from Table 2 that the maximum westward wind speeds in areas 2 and 3 are significantly higher than the maximum eastward wind speed, while they are similar in area 1. Area 2 has the highest average wind speed and the largest standard deviation. In addition, although the maximum wind speed in area 3 is just 9.3708 m·s −1 , its minimum value is 0.1613 m·s −1 , much higher than other areas. According to the statistical indices of trends and seasonality, area 1 and area 2 have similar seasonal standard deviations, and the higher volatility of area 2 is attributed to its trendy part. Area 3 has the lowest trendy and seasonal standard deviations among the three areas.

Comparison between Decomposition-Based Models.
In order to prove that the VMD is superior to other decomposition methods and that the performance of the stacked GRU is improved by the residual connections as a component predictor, the following experiments are carried out. All of data are normalized when passed to the model for training.
First, a group of experiments are carried out in area 1 to prove that the proposed combination of the VMD and the stacked GRU is superior to the combinations of the other decomposition and prediction models. In the experiment, three decomposition methods and four prediction models are cross-combined. e WT, EMD, and VMD are selected. Among them, the decomposition level of the WT is 4, which means that the wind speed sequence will be decomposed into an approximate component, four detail components, and a decomposition residual sequence. e EMD processes the sequence adaptively, so the wind data in area 1 is decomposed into 9 to 11 subseries. erefore, the highest frequent subseries will be discarded until the number of all subseries does not exceed 9. e VMD decomposes wind data into four subseries under termination conditions � 1e-7.
e prediction methods include the LSTM, GRU, stacked LSTM, and stacked GRU. e LSTM and GRU are designed as a single-layer structure with 512 neural units. e stacked LSTM and stacked GRU are designed as four layers with 512, 32, 32, and 32 neural units in each layer, respectively. e batch size is 25 and Adam optimizer's learning rate is 0.001. e results are shown in Table 3. e three metrics of the EMD are slightly lower than those of WT. ere are some exceptions in Figure 3. For   e RMSE of GRU is lower than LSTM under the three decomposition methods, and the RMSE of stacked GRU is lower than the stacked LSTM. is shows that GRU shows better performance than the LSTM in both single layer and multiple layers. However, the metrics of stacked GRU are higher than GRU, and the metrics of stacked LSTM are higher than LSTM. For example, the 1-step RMSE of VMD-Stacked GRU is 0.1436, and the 1-step RMSE of VMD-GRU is 0.1385.
is shows that the stacked GRU and stacked LSTM are degraded when combined with the VMD. is degradation is found in the models based on all three decomposition methods.

Improvement of VMD-Stacked GRU by Residual
Connections. After the above analysis, the VMD-GRU is determined as the best combination of decomposition-prediction methods, and the VMD-Stacked GRU is determined as the second best combination. In order to confirm that residual connections improved VMD-Stacked GRU to make it surpass the VMD-GRU method, a comparative experiment was carried out. In the experiment, two kinds of residual connections were used. Residual connections (a) represent the structure shown in Figure 4 and equation (8), and residual connections (b) change the single-layer skipping to double-layer skipping. e results are shown in Table 4. e VMD-Stacked GRU with residual connections (a) outperforms that with residual connections (b) in most metrics.
e VMD-stacked GRU with residual connections (a) perform slightly worst than that with residual connections (b) only on the 1st and 3rd steps of Area 1 and the 3rd step of Area 3. erefore, it can be considered that residual connections (a) are more suitable than residual connections (b) for wind speed prediction tasks. e VMD-Stacked GRU with residual connections (a) outperforms VMD-GRU in most metrics. It is illustrated that residual connections (a) solve the degradation of VMD-Stacked GRU and make it surpass VMD-GRU.  To determine the optimal parameters of the proposed model, a detailed parametric study is carried out. e parameters to be determined are the network parameters and training parameters of the stacked GRU. e network parameters include the number of hidden neural units of each layer. e training parameters include batch size, maximum epochs, and the learning rate of the optimizer. e training parameters are first determined, followed by the network parameters. When determining the training parameters, the network parameters are set in advance according to the latest research. In the article in [4], the optimal two-layer GRU is determined with 512 units in the first layer and 32 units in the second layer. Since the Adam optimizer outperforms classical optimizers such as RMSProp, SGD, and Adagrad [41], it is adopted in the experiment, and its learning rate is searched in {0.1, 0.001, 0.0001}. Batch size is searched in {8, 16, 25, 32, 64} since a study in [42] showed that a smaller batch helps to model training. We set maximum epochs � 50 and use TensorFlow 2.3.1's callback function [43] to monitor the lowest loss value of the validation set in epochs iteration. e parametric study is carried out on area 1, and results are average of three time steps. e RMSE results are shown in Figure 8.
e above results show that the best configuration of training parameters is batch size � 24 and learning rate-� 0.001. e stacked GRU network parameters are, respectively, marked as (a * , a * ), (a, a * , a * ), and (a, b, b * , b * ) according to different layers. For example, (a * , a * ) represents two-layer GRU, and the units number is a; and * means that there are residual connections in this layer. Units search is firstly conducted in {10, 100, 200, . . ., 600} and then a more accurate search is conducted in the best interval. e RMSE results are shown in Figure 9.
It can be seen that (500, 50, 50 * , 50 * ) are the optimal network parameters. Above all, the best parametric configuration set is shown in Table 5.

Result of Multistep Wind Speed Forecasting.
To prove the superiority of the proposed model VMD-Stacked GRU with residual connections, we choose seven time series and machine learning models, as well as a published EMD-based model [26], as baseline models. ese models directly learn wind speed series without composition.
rough Autocorrelation Function and Partial Autocorrelation Function diagrams, the ARIMA parameters p � 2, d � 1, and q � 12 are determined. e support vector machine regression (SVR) uses RBF kernel and establishes the relationship between past information and each forecast time step. e structure of MLP is a four-layer structure with 400, 32, 32, and 32 units in each layer, respectively. e EMD-PE-ANN reconstructs IMFs into IMF 1 , IMF 2 , and 9 i�3 IMF i . 5-Fold cross-validation strategy is used to obtain the results in Tables 6-8. It can be seen from Tables 6-8 that, compared with predicting wind speed directly, the proposed model has lower error metrics at three time steps in three areas. Most of the error metrics of GRU are lower than those of LSTM, especially in area 2. Most of the error metrics of stacked GRU are lower than those of stacked LSTM, and the RMSE is lower than stacked LSTM only on the 3rd step of area 1 and the first step of area 3. When directly predicting wind speed, the stacked model also has a less obvious degradation effect. For example, compared with the GRU, the RMSE of the stacked GRU shows degradation in the 3 steps of area 1 and the 1st step of area 2 and area 3. In addition, although not surpassing the proposed model, the two classic methods, ARIMA and SVR, have relatively good metrics which are close to GRU.
To further illustrate the superiority of the proposed model, the following figures show the comparison curves of models in area 1. It can be seen from Figure 10 that the fitting effect of the prediction curve (red line) of the proposed model is significantly higher than those of the other models. e values at the last input time step (Persistence) and the overall mean values of inputs (Average) are also added in the figure as benchmarks, and the experiment results show that the proposed model surpasses the benchmarks.

Discussion
e case study concludes that, compared with other direct prediction or decomposition-based prediction models, the proposed VMD-Stacked GRU model with residual connections is more accurate in multistep forecasting. e proposed model performs well on the ERA5 sea surface wind Table 4: Comparison of VMD-based models with and without residual connections.
Step VMD-GRU (1) e VMD is an excellent decomposition method, and its error in combination with LSTM, GRU, stacked LSTM, and stacked GRU is lower than the combination of WT or EMD and these methods. (2) e GRU is a better forecasting model than LSTM. In the cases of direct prediction, direct prediction after stacking, and decomposition-based prediction after stacking, the GRU's error metrics are lower than LSTM. (3) e residual connections can alleviate the degradation of stacked GRU and improve its learning ability. e overall error metrics of VMD-Stacked GRU with residual connections are lower than those of VMD-GRU and VMD-Stacked GRU without residual connections.

Conclusions
e sea wind speed forecasting is a key part to guarantee safety of sailing ships. To solve the problems of hourly short-term wind speed forecasting, an ensemble model based on the VMD and stacked GRU is proposed, and the residual connections are used to improve stacked GRU. e model uses VMD to decompose the wind speed series and then uses the stacked GRU model with residual connections to predict each component. In order to prove the performance of the proposed model, three cases from the Pacific, Indian, and Atlantic Oceans are studied. In the experiment, three error metrics, RMSE, MAE, and MAPE, are used to evaluate each time step. rough the case studies, the following conclusions can be illustrated: (1) Separately predicting the decomposed wind speed sequence and then superimposing it as the final result can improve the prediction effect, and VMD is the most effective one of the various decomposition methods. (2) e forecast error metrics of VMD-Stacked GRU with residual connections are generally lower than those of ARIMA, SVR, MLP, LSTM, GRU, stacked LSTM, and stacked GRU models at the 1st, 2nd, and 3rd steps.
Data Availability e wind speed data used to support the findings of this study are freely available and supplied by ECMWF. Requests for access to these data should be made through ECMWF website: https://cds.climate.copernicus.eu/cdsapp.