When pure linear neural network (PLNN) is used to predict tropical cyclone tracks (TCTs) in South China Sea, whether the data is normalized or not greatly affects the training process. In this paper, min.max. method and normal distribution method, instead of standard normal distribution, are applied to TCT data before modeling. We propose the experimental schemes in which, with min.max. method, the min.max. value pair of each variable is mapped to (−1, 1) and (0, 1); with normal distribution method, each variable’s mean and standard deviation pair is set to (0, 1) and (100, 1). We present the following results: (1) data scaled to the similar intervals have similar effects, no matter the use of min.max. or normal distribution method; (2) mapping data to around 0 gains much faster training speed than mapping them to the intervals far away from 0 or using unnormalized raw data, although all of them can approach the same lower level after certain steps from their training error curves. This could be useful to decide data normalization method when PLNN is used individually.
Numerous prediction models have been proposed to raise the forecasting precisions of the tropical cyclone tracks (TCTs) in South China Sea (SCS) in the past decades to reduce the loss from these disasters. TCT forecasting is a time series problem. There are a lot of improved time series techniques that could be applied to this field [
One of the problems we face is before modeling whether data should be normalized or not to predict TCTs in SCS and what is the effect. It has been reported in literature that normalization could improve performance. Sola and Sevilla [
In this paper, we try two commonly used normalization methods, that is, linear min.max. method and normal distribution method, propose the experimental schemes to map the raw data variables to the intervals near
The pure linear neural network in this paper is a singlelayer structure model named ADALINE (adaptive linear neuron), first proposed by Widrow and Hoff in 1960, as shown in Figure
Structure of pure linear neural network.
Equation
The neuron uses pure linear transfer function
Pure linear neuron.
The learning rule could be derived by minimizing the mean square error of the training instance
In this paper, we use a MATLAB function
We use the traditional incremental training strategy, with which the weights and biases are updated based on each single instance, while, with the batch training, weights and biases are updated based on more than one instance [
The used TCT data set in our experiment is published by China Meteorological Administration. The objects are the TCs that form in or move into the SCS area and last for at least 48 h. The TCTs were sampled every 12 h, starting from the moment when a TC moves into the sea area or when a TC developed in the area. We use the sampled 750 instances since 1960. Our object is to predict the longitude and latitude of the TC center in the next 24 h [
A TCT and its changes are associated with TC’s intensity, accumulation and replenish of energy, and various nonlinear changes in its environmental flow field, which are referred to as variables in this paper. The variables used in our experiment include the climatology and persistence (CLIPER) factors representing changes of TC itself, such as changes in the latitude, longitude, and intensity of a TC at 12 and 24 h before prediction time. Table
Numbers of CLIPER variables and physical variables initially gathered for July, August, and September.
Position  July  August  September  Commonly used and chosen 

Longitude  46  57  40  8 
Latitude  100  62  75  11 
For our case study we adapt the preceding 720 instances for training and the following 30 as independent instances for testing. Eight and 11 common predictors among July, August, and September are carefully picked from the different number of variables as shown in Table
Variable information of the data set.
Variable  Min.  Max.  Mean  Std.  Meaning 

Lon. v1  105.3  123.0  114.7  4.2  Initial lon. (E) 
Lon. v2  −39.8  23.1  −9.1  10.3  Zonal motion at −12 h 
Lon. v3  −36.1  20.8  −9.3  9.7  Zonal motion at −24 h 
Lon. v4  106  130  116.8  4.9  Lon. (E) at −12 h 
Lon. v5  −43.9  20.8  −9.6  10.3  Zonal motion from −24 to −12 h 
Lon. v6  −4.5  2.7  −1.0  1.2  Lon. difference between 0 and −12 
Lon. v7  106.0  124.6  115.2  4.3  Lon. (E) at −6 h 
Lon. v8  106.0  128.3  116.3  4.7  Lon. (E) at −24 h 
Lon. t  101.4  128.3  112.8  4.7  Current lon. (E) 


Lat. v1  10.8  23.5  18.7  2.4  Initial lat. (N) 
Lat. v2  −39.8  23.1  −9.1  10.3  Zonal motion at −12 h 
Lat. v3  −17.4  26.6  4.2  6.5  Meridional motion at −12 h 
Lat. v4  −36.1  20.8  −9.3  9.7  Zonal motion at −24 h 
Lat. v5  0.0  525.2  50.2  62.0  Squared zonal motion at −24 
Lat. v6  10.0  23.7  18.2  2.4  Lat. (N) at −12 h 
Lat. v7  −4.5  2.7  −1.0  1.2  Lon. difference between 0 and −12 
Lat. v8  −8.2  4.8  −2.1  2.2  Lon. difference between 0 and −24 h 
Lat. v9  10.3  23.8  18.5  2.4  Lat. (N) at −6 h 
Lat. v10  9.6  24.0  18.0  2.5  Lat. (N) at −24 h 
Lat. v11  10  60  23.5  9.9  Max. surface wind at −6 h 
Lat. t  12.2  30.0  19.8  2.6  Current lat. (N) 
Two types of normalization methods, min.max. method and normal distribution method, are used in our experiment. The min.max. method linearly maps the original values to the new interval determined by the assigned min.max. values (see Figure
Min.max. method of normalization.
while the normal distribution method maps the original values to the new interval according to the new mean and deviation (see Figure
Normal distribution method of normalization.
Similarly, the new value could be calculated by
Four normalized data sets originated from the raw data set, as well as the raw one, are used in our experiment and are described in Table
Normalization configuration.
Conf. number  Method  Parameters  

Conf. 1  Min.max. 


Conf. 2  Min.max. 


Conf. 3  Normal distribution 


Conf. 4  Normal distribution 




Conf. 5  Raw data set 
The training process on each training data set is summarized in Algorithm
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
The experimental process is summarized in Algorithm
During the pretraining,
During the
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
In Algorithm
Figure
Mean absolute errors (MAEs) of the training set during pretraining.
Both curves of Conf. 1 and Conf. 2 use the min.max. method of normalization. The difference is that Conf. 1 makes the normalized data located around zero with the radius 1 while Conf. 2 around 0.5 with the radius 0.5. This minor difference makes the curve of Conf. 1 smaller and steeper than that of Conf. 2 during the pretraining, for both longitude and latitude.
What Conf. 1 and Conf. 3 have in common is that their means are 0 and radii are about 1. Their differences are uniform distribution for Conf. 1 and normal distribution for Conf. 3. The convergence curves of Conf. 1 and Conf. 3 are very close, both for longitude and latitude data. For longitude data, MAE values on the curve of Conf. 1 is smaller than on the curve of Conf. 3, while for latitude data things are the opposite. So it is hard to tell which configuration is more superior to the other.
Conf. 3 and Conf. 4 both use normal distribution, but the mean of Conf. 3 is 0, while that of Conf. 4 deviates from 0 to 100. As a result, the convergence MAE for Conf. 4 is too large and disappears out of the axis limits in Figure
Conf. 5 uses the raw data, located in different intervals ranging from −45 to 525 for different variables (see Table
Table
Required iterations to reach decreasing MAE stages during pretraining.
Conf. number  MAE of longitude  MAE of latitude  

1.2  1.175  1.15  1.125  1.1  1.0  0.9  
Conf. 1  4  4  5  409  26  46  91 
Conf. 2  107  119  144  10^{3}+  169  302  593 
Conf. 3  7  8  11  10^{3}+  5  9  20 
Conf. 4  10^{3}+  10^{3}+  10^{3}+  10^{3}+  10^{3}+  10^{3}+  10^{3}+ 


Conf. 5  811  10^{3}+  10^{3}+  10^{3}+  722  909  10^{3}+ 
Figure
Predictions of the independent instances for 5 normalized data and the observations.
The curves of Conf. 1, Conf. 2, and Conf. 3 are very consistent, and there is almost no difference in the figure. Conf. 5 uses the unnormalized data, and its predictions are similar to that of Conf. 1, Conf. 2, and Conf. 3. For the curve of Conf. 4, sometimes the deviations from the targets are large, and sometimes small, and show more instability. However, when combining systems, this instability could cancel out errors. So, the direction and magnitude of these instable deviations are also worth further study. From the qualitative point of view, prediction trends of all the five configurations are consistent with the target. Table
Mean absolute errors of the testing set.
Conf. 1  Conf. 2  Conf. 3  Conf. 4  Conf. 5  

Lon. (°)  1.016  0.960  1.017  1.529  0.983 
Lat. (°)  0.566  0.570  0.568  0.618  0.666 


Dst. (km)  127.9  122.8  128.2  181.4  130.6 
The MAEs for longitude and latitude are calculated by
The distance errors of Conf. 1 and Conf. 3, whose means are zero, have a difference of only 0.3 km. Among the errors of Conf. 1, Conf. 2, and Conf. 3, whose means are near zero, they have a maximum difference of only 7.8 km. For such a largescale motion as TC, these errors could be regarded as on the same level, while, for the distance error of Conf. 4, the difference is up to 56.8 km, comparing to the minimum error of Conf. 2.
The experiments show that data normalization does affect the performance, as we had expected. It is more likely to converge, if we use the data scaled to around zero, no matter the min.max. method or normal distribution method. However, it is not so absolute that the data center should locate on the zero point. In our experiments, testing errors using the data scaled to
The raw data are made up of all kinds of meteorological variables, so it is difficult to make sure that they can locate around zero. This could affect the performance. In our experiments, although the values far from zero also exist in the raw data, which is as low as −45 and high up to 525, the errors are not so large and the convergence speed is not as slow as Conf. 4. This is probably because of the diversity of the raw data, in which some are around zero, some are not, and some are positive, while some are negative. They could also cancel out the errors to some degree.
The probing value ranges for data normalization in our experiments are not systematic, both for min.max. method and normal distribution method. The experiments could be improved by using more data configurations to confirm the model trends. The experiments could have more different data centers and more ranges. For examples, the data centers could be 0, 1, 50, 100, 500, 1000, and so forth, and, on each data center, several radiuses on different orders of magnitude, depending on the center value, could be considered.
In our experiments, only one normalization method is used on all variables in the same data. In other words, in one experiment all the variables in the data set use the same normalization method. However, this is different from the real situations. Maybe different variables should use different methods. For example, the variable, longitude, should be scaled to an evenly distributed range, because the earth is round and the longitudes are evenly distributed on the surface, while the surface wind speed should use the normal distribution method because of its possible values.
Despite slower convergence speed and larger mean errors of Conf. 4, individual points on the fitting curve of independent instances exist with higher precision, as well as lower precision, and show more instability. However, when combining subsystems in the committee machines, which require the experts to be reasonably accurate (better than guessing) and diverse (errors are uncorrelated), some instabilities may be beneficial. The diversity of curves for Conf. 1, Conf. 2, Conf. 3 and Conf. 5 in Figure
In this work, we have proposed the experimental schemes to map the TCT data to different ranges, near to as well as far away from
Future work will be devoted to initialization of the network weights, which is also supposed to greatly affect the results. Being more aware of how data normalization and initial weights of the networks influence the training process, plus the theoretically optimal learning rate provided by MATLAB Neural Network Toolbox, we can have a more comprehensive understanding of individual PLNN and can better design and control the combining system.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China (Grant no. 61203301), Major Special Funds for Guangxi Natural Science Foundation (no. 2011GXNSFE018006), and National Natural Science Foundation of China (no. 41065002). Ming Li acknowledges the supports in part by the National Natural Science Foundation of China under the Project Grants nos. 61272402, 61070214, and 60873264. Ming Li also thanks the support by the Science and Technology Commission of Shanghai Municipality under Research Grant no. 14DZ2260800. The authors are grateful to XiaoYan Huang for providing and refining the variable information of the data set.