Data Normalization to Accelerate Training for Linear Neural Net to Predict Tropical Cyclone Tracks

When pure linear neural network (PLNN) is used to predict tropical cyclone tracks (TCTs) in South China Sea, whether the data is normalized or not greatly affects the training process. In this paper, min.-max. method and normal distribution method, instead of standard normal distribution, are applied to TCT data before modeling. We propose the experimental schemes in which, with min.-max. method, the min.-max. value pair of each variable is mapped to (−1, 1) and (0, 1); with normal distribution method, each variable’s mean and standard deviation pair is set to (0, 1) and (100, 1). We present the following results: (1) data scaled to the similar intervals have similar effects, no matter the use of min.-max. or normal distribution method; (2) mapping data to around 0 gains much faster training speed than mapping them to the intervals far away from 0 or using unnormalized raw data, although all of them can approach the same lower level after certain steps from their training error curves. This could be useful to decide data normalization method when PLNN is used individually.


Introduction
Numerous prediction models have been proposed to raise the forecasting precisions of the tropical cyclone tracks (TCTs) in South China Sea (SCS) in the past decades to reduce the loss from these disasters.TCT forecasting is a time series problem.There are a lot of improved time series techniques that could be applied to this field [1][2][3][4].The approaches are getting more and more complicated, including ensembles [5][6][7][8].The results presented here are part of a wider project on committee machines aiming to obtain higher performance through combining multiple simple experts.These experts require reasonable accuracy and diversity [9].We use pure linear neural networks (PLNNs) as the experts and then combine their results by means of fuzzy logic.So it relies on comprehensive understanding of the PLNNs, especially PLNNs on the particular data set [10].
One of the problems we face is before modeling whether data should be normalized or not to predict TCTs in SCS and what is the effect.It has been reported in literature that normalization could improve performance.Sola and Sevilla [11] pointed out the importance of data normalization prior to the neural network training to fasten the calculations and obtain good results in nuclear power plant application.Jayalakshmi and Santhakumaran [12] concluded that various statistical normalization techniques enhance the reliability of trained feed forward backpropagation neural networks and the performance of the diabetes data classification model using the neural networks is dependent on the normalization methods.Zhang and Sun [13] also provided a normalization method considered to be suitable for the particular data set from UCI repository.So, data normalization is important whereas the results are by no means universally applicable to the particular data.In predicting the TCTs, we usually use raw environmental data or simply map the raw data to 0-1 interval linearly.But we are not certain if that is beneficial.
In this paper, we try two commonly used normalization methods, that is, linear min.-max.method and normal distribution method, propose the experimental schemes to map the raw data variables to the intervals near 0 as well as far away from 0, and then apply the normalized data to a standalone PLNN model.We intend to study whether data normalization really works and how it affects the PLNN model with TCT data, so that we can be more aware of how to pretreat the data and make the combining system more controllable once the PLNNs are used as ensemble members.
We demonstrate that proper data normalization speeds up convergence remarkably with the TCT data.From the trends of error curves, lower levels of training error can be achieved in all situations as long as enough training iterations are provided.However, mapping the data to a proper interval makes more sense than how the data is mapped, according to the min.-max.method and normal distribution method.

Pure Linear Neural Network
The pure linear neural network in this paper is a single-layer structure model named ADALINE (adaptive linear neuron), first proposed by Widrow and Hoff in 1960, as shown in Figure 1 [14].Each node in the structure represents a neuron, and it has its own weights w and a bias  and gets the same input vector p.The weighted inputs are summed with the bias to form the following net output : where w = [ 1 ,  2 , . . .,   ]  and p = [ 1 ,  2 , . . .,   ]  .Equation (1) could also be expressed as for simplicity, where x = [w  , ]  and z = [p  , 1]  .The neuron uses pure linear transfer function which takes the net output as its input and forms the neuron output .The neuron is shown in Figure 2.
The learning rule could be derived by minimizing the mean square error of the training instance where  is the difference between the target (or observed value)  and the network output (or prediction) .The rule is written as   or for simplicity [15], where   is a learning rate.
In this paper, we use a MATLAB function introduced in the Neural Network Toolbox, as the learning rate.This function takes (6) as the learning rule.We use the traditional incremental training strategy, with which the weights and biases are updated based on each single instance, while, with the batch training, weights and biases are updated based on more than one instance [16][17][18].

Experiments and Results
3.1.Data Set.The used TCT data set in our experiment is published by China Meteorological Administration.The objects are the TCs that form in or move into the SCS area and last for at least 48 h.The TCTs were sampled every 12 h, starting from the moment when a TC moves into the sea area or when a TC developed in the area.We use the sampled 750 instances since 1960.Our object is to predict the longitude and latitude of the TC center in the next 24 h [19].
A TCT and its changes are associated with TC's intensity, accumulation and replenish of energy, and various nonlinear changes in its environmental flow field, which are referred to as variables in this paper.The variables used in our experiment include the climatology and persistence (CLIPER) factors representing changes of TC itself, such as changes in the latitude, longitude, and intensity of a TC at 12 and 24 h before prediction time.Table 1 lists the variables initially selected by using the CLIPER method with the significance level of 0.01 [19,20].
For our case study we adapt the preceding 720 instances for training and the following 30 as independent instances for testing.Eight and 11 common predictors among July, August, and September are carefully picked from the different number  of variables as shown in Table 1 for longitude and latitude, respectively.Some statistical information about the data set is listed in Table 2.

Experiment Setup.
Two types of normalization methods, min.-max.method and normal distribution method, are used in our experiment.The min.-max.method linearly maps the original values to the new interval determined by the assigned min.-max.values (see Figure 3).The original minimum value  min and maximum value  max can be achieved from the statistical information of the raw data, and the new minimum value  min and maximum value  max are assigned by us, so the new value  could be calculated from original value  by while the normal distribution method maps the original values to the new interval according to the new mean and deviation (see Figure 4).Four normalized data sets originated from the raw data set, as well as the raw one, are used in our experiment and are described in Table 3.
The training process on each training data set is summarized in Algorithm 1 [21], in which we have the following.(1) p ∈ P trn : each instance in the training set is applied to training process one by one in turn.
(2) Calculations of instance indicates necessary calculations such as forward calculating the network output, errors, and so on, as required.
The experimental process is summarized in Algorithm 2, in which Algorithm 1 is used twice, and we have the following.
(1) Network weights are initialized with uniform distributed random real values ranging between 0 and 1, and the experiments on the five data sets use the same initial weights. ( Training set is initialized with the preceding 720 out of the total 750 instances (including both longitude and latitude data and the same below).
(3) Testing set is initialized with the last 30 instances.
(4) During the pretraining, the maximum number of iterations ( max ) in Algorithm 1 is set to 1000.
(5) During the repeat statement,  max is set to 20.
In Algorithm 2, we arranged a pretraining process in which the maximum number of iterations ( max pretrain ) is much larger than that in the training process with respect to each independent instance ( max train ).This arrangement is made with the concerns that we hope the prediction of the independent instance is made using the adequately trained model.If  max pretrain is set too small, the model will be inadequately trained when it applies to predict the independent instances.To compensate for this,  max train must be set larger to train the initial training set more, other than the newly added independent instance, to make the model more adequately trained.This is reasonable for the beginning independent instances.However,  max train aims to train the newly added independent instance based on the adequately trained model using the former training set.For the later independent instances, there is only one new instance added to the training set, so the training load is not so heavy, and the larger  max train wastes more resources.As a result,  max pretrain is set to be larger and  max train is set to be smaller.Another character in Algorithm 2 is that each time only one instance is predicted and after that the predicted instance is added to the training set.So the model should be retrained before next prediction.This is in accordance with the actual facts.We use the 24-hour model to predict the track center in the next 24 hours.Once it comes true, this instance could be more valuable and should be added to the training set for training to prepare for the next 24 hours.

Result and Analysis.
Figure 5 shows the convergence curves during the pretraining process for the five normalization configurations.The left half of the figure is for the longitude and the right one for the latitude.Both halves share the same -coordinate, namely, iterations labeled on the top edge and numbered along the bottom edge.The -coordinate is the mean absolute error (MAE) for all the instances in the training set, after  iterations of training.The scales in the -coordinate are different for longitude and latitude and are numbered along the left edge of the figure.The five colored thick curves with various line styles reflect the training precision and speed.The smaller the -value, the higher the precision, and the steeper the curve, the faster the convergence.We should make it clear that all the results, including the MAEs, are denormalized before drawing in the figures or listing in the table, and it is the same as below.As Algorithm 1 shows, 1 iteration of training means that each (1) begin (2) Initialization begin (3) Network weights x (4) Training set P trn (5) Testing set P tst ( 6) end (7) Pre-training on the initial training set P trn using Algorithm 1 ( 8) repeat (9) Training on the current training set P trn using Algorithm 1 (10) Test on the first (P tst ) (11) Move the first (P tst ) from P tst to P trn (12) until P tst = 0; (13) end of the instance in the training set should be applied to the learning rule to update the weights for once.
Both curves of Conf. 1 and Conf. 2 use the min.-max.method of normalization.The difference is that Conf. 1 makes the normalized data located around zero with the radius 1 while Conf. 2 around 0.5 with the radius 0.5.This minor difference makes the curve of Conf. 1 smaller and steeper than that of Conf. 2 during the pretraining, for both longitude and latitude.
What Conf. 1 and Conf. 3 have in common is that their means are 0 and radii are about 1. Their differences are uniform distribution for Conf. 1 and normal distribution for Conf.3. The convergence curves of Conf. 1 and Conf. 3 are very close, both for longitude and latitude data.For longitude data, MAE values on the curve of Conf. 1 is smaller than on the curve of Conf.3, while for latitude data things are the opposite.So it is hard to tell which configuration is more superior to the other.
Conf. 3 and Conf. 4 both use normal distribution, but the mean of Conf. 3 is 0, while that of Conf. 4 deviates from 0 to 100.As a result, the convergence MAE for Conf. 4 is too large and disappears out of the axis limits in Figure 5.
Conf. 5 uses the raw data, located in different intervals ranging from −45 to 525 for different variables (see Table 2).The curve of Conf. 5 doesn't deviate so much as the curve of Conf. 4 does, although the curve of Conf. 5 is flatter than the curves of Conf. 1, Conf. 2 and Conf.3. It implies that the initial weights ranging between 0 and 1 are possibly not so decisive for big valued data, although data from Conf. 4 could be influenced to some extent by the same initial weights with other configurations.
Table 4 lists the required iterations for PLNN to reach the same level of training errors during the pretraining process with both longitude and latitude data.It quantitatively illustrates that mapping the data to the intervals around 0 (Conf. 1 and Conf.3) could reach the smaller training error much faster, and even if slight shift for the mean from 0 to 0.5 (Conf. 2) could extend this process, let alone giant shift from 0 to 100 (Conf.4).
Figure 6 illustrates the fitting curves of longitude and latitude on independent instances.The left and right halves of the figure are for longitude and latitude, respectively.Both halves share the same -coordinate indicating the indexes of the independent instances labeled on the top edge and numbered along the bottom edge.The -coordinate is the real east longitude or north latitude value.The five prediction curves, as well as the observed one (the targets, black solid line), are drawn for comparison.Each curve connects the adjacent fitting points.It is worth noting that the corresponding training set to each independent instance includes the original training set (720 instances) and the previous used independent instances.For example, if  is 5, then the current training set is 720 plus 4, totally 724 instances.For -value, the closer to the target point, the more accurate the prediction.For -value of east longitude, the bigger value implies that the prediction is east to the target position and the smaller value implies west, and vice versa.
For -value of north latitude, the bigger and smaller values imply north and south to the target values, respectively.The curves of Conf. 1, Conf. 2, and Conf. 3 are very consistent, and there is almost no difference in the figure.Conf. 5 uses the unnormalized data, and its predictions are similar to that of Conf. 1, Conf. 2, and Conf.3.For the curve of Conf.4, sometimes the deviations from the targets are large, and sometimes small, and show more instability.However, when combining systems, this instability could cancel out errors.So, the direction and magnitude of these instable deviations are also worth further study.From the qualitative point of view, prediction trends of all the five configurations are consistent with the target.The distance errors of Conf. 1 and Conf.3, whose means are zero, have a difference of only 0.3 km.Among the errors of Conf. 1, Conf. 2, and Conf.3, whose means are near zero, they have a maximum difference of only 7.8 km.For such a large-scale motion as TC, these errors could be regarded as on the same level, while, for the distance error of Conf.4, the difference is up to 56.8 km, comparing to the minimum error of Conf. 2. the raw data, in which some are around zero, some are not, and some are positive, while some are negative.They could also cancel out the errors to some degree.
The probing value ranges for data normalization in our experiments are not systematic, both for min.-max.method and normal distribution method.The experiments could be improved by using more data configurations to confirm the model trends.The experiments could have more different data centers and more ranges.For examples, the data centers could be 0, 1, 50, 100, 500, 1000, and so forth, and, on each data center, several radiuses on different orders of magnitude, depending on the center value, could be considered.
In our experiments, only one normalization method is used on all variables in the same data.In other words, in one experiment all the variables in the data set use the same normalization method.However, this is different from the real situations.Maybe different variables should use different methods.For example, the variable, longitude, should be scaled to an evenly distributed range, because the earth is round and the longitudes are evenly distributed on the surface, while the surface wind speed should use the normal distribution method because of its possible values.
Despite slower convergence speed and larger mean errors of Conf.4, individual points on the fitting curve of independent instances exist with higher precision, as well as lower precision, and show more instability.However, when combining subsystems in the committee machines, which require the experts to be reasonably accurate (better than guessing) and diverse (errors are uncorrelated), some instabilities may be beneficial.The diversity of curves for Conf. 1, Conf. 2, Conf. 3 and Conf. 5 in Figure 6 is obviously not obvious.So, in the study of committee machines, how to make use of the normalization methods to control the model outputs is worthy of further exploration, not only showing some diversity, but also being limited to some degree.

Conclusions
In this work, we have proposed the experimental schemes to map the TCT data to different ranges, near to as well as far away from 0, using min.-max.and normal distribution methods.Then the normalized data were applied to the PLNN model to predict TC centers in SCS.It has been shown that both min.-max.and normal distribution methods produce similar results, as long as they map the data to similar value ranges.And we also demonstrated that mapping the data to around 0 could remarkably fasten the training speed, although, provided enough training iterations, reaching smaller error is possible in all given situations.Future work will be devoted to initialization of the network weights, which is also supposed to greatly affect the results.Being more aware of how data normalization and initial weights of the networks influence the training process, plus the theoretically optimal learning rate provided by MATLAB Neural Network Toolbox, we can have a more comprehensive understanding of individual PLNN and can better design and control the combining system.

Figure 4 :
Figure 4: Normal distribution method of normalization.

Figure 6 :
Figure 6: Predictions of the independent instances for 5 normalized data and the observations.

Table 1 :
Numbers of CLIPER variables and physical variables initially gathered for July, August, and September.

Table 2 :
Variable information of the data set.

Table 4 :
Required iterations to reach decreasing MAE stages during pretraining.Algorithm 1: Training process on the training set P trn .
Table 5 lists the mean absolute errors of the testing set, from the quantitative point of view.  and   are target and predictive output of the th independent instance.The distance error (Dst., as row header in Table 5) is calculated by the common used equation Dst = 110 × √ MAE 2 lon.+ MAE 2 lat. .

Table 5 :
Mean absolute errors of the testing set.