A new methodology combining data mining technology with statistical methods is proposed for the prediction of tropical cyclones’ characteristic factors which contain latitude, longitude, the lowest center pressure, and wind speed. In the proposed method, the best track datasets in the years 1949~2012 are used for prediction. Using the method, effective criterions are formed to judge whether tropical cyclones land on Hainan Island or not. The highest probability of accurate judgment can reach above 79%. With regard to TCs which are judged to land on Hainan Island, related prediction equations are established to effectively predict their characteristic factors. Results show that the average distance error is improved compared with the National Meteorological Centre of China.
Typhoon is a kind of tropical cyclones (TCs), the centersustained wind speed of which arrives at level 12 to level 13 (typhoon is not distinguished from TC in this paper unless specially emphasized). Hainan Island (108°37′E~111°05′E, 18°10′N~20°10′N) in China is well known as “typhoon corridor.” According to the historical data analysis of TCs landing on Hainan Island, the yearly and the monthly statistical results are shown in Figures
The yearly statistical result.
The monthly statistical result.
The main methods for traditional TC forecast contain statistical methods and dynamic methods, most of which are along with complicated processes or lower precision. The statistical methods use the historical TCs’ positions, intensity, and so on to predict TC’s characteristic factors, such as fuzzy multicriteria decision support model [
By using data mining technology in combination with statistical methods, a new TC forecast method based on the historical data is proposed in this paper. Firstly, the region where typhoon centers were located 48 (or 72 as a comparative experiment) hours before landing on Hainan Island is divided into five (or a number of
Data used in this research is based on TCs’ best track datasets of the years 1949~2012 in the northwestern Pacific waters (including the South China Sea, northern of the equator, and western of 180°E) [
Set the moments of current observation point and previous two observation points as
As mentioned in the Introduction, a landing TC is defined as a TC that the minimum distance between the typhoon center and Hainan Island is no more than the preset influencing radius. Hence, in order to distinguish TCs between landing and not landing on Hainan Island, the minimum distance between each TC’s track and the outer boundary of Hainan Island needs to be calculated according to CMA best track datasets. Due to the variety of TCs’ tracks, applying general curvefitting directly does not provide a good result. Therefore, polynomial fitting [
The curve fitting of the outer boundary of Hainan Island.
The curve fitting of an arbitrary TC.
The time intervals used to forecast TCs by three main prediction centers (NHC, JMA, and NMCC) are 24, 48, and 72 hours. In order to forecast TCs in a timely manner and compare forecast accuracy among different methods, here the region where TCs’ centers were located 48 hours before they landed on Hainan Island (shown in Figure
The region selected as research object.
For each of the five areas, all the observation points of both landing and not landing TCs which entered into this area are filtrated. With strength grade, latitude, longitude, air pressure, wind speed, latitude migration velocity, and longitude migration velocity as classification properties, the TC landing criterion of each area is formed by using CART algorithm [
Flow diagram of forming landing criterions.
For TCs which are judged to be landing on Hainan Island, PRESS and its progressive optimal algorithm and MLRM can be used to forecast TCs’ characteristic factors (including latitude, longitude, the lowest center pressure, and wind speed). Forecasts in this paper contain landing prediction pattern and dynamic prediction pattern. Landing prediction pattern is defined as employing the observation point information of those points which first enter into any area to predict the characteristic factors when TC lands. Dynamic prediction pattern is defined as 24 hours’ and 48 hours’ prediction with respect to the observation point which enters into any area. The flow diagrams of landing forecast pattern and dynamic forecast pattern are shown in Figures
Flow diagram of landing prediction pattern.
Flow diagram of dynamic prediction pattern.
In this section, the situation, in which the region where TCs’ centers were located 48 hours before they landed is selected as research object and the research object is divided into five areas, is firstly researched. Other situations of a comparative experiment, in which the research object may be the region where TCs’ centers were located 72 hours before they landed and the number of areas of divided research object may be any number of
The geometric centers and scopes of five areas.
Area number  Longitude 
Latitude 
Radius 

1  118.4511  15.2845  2.7902 
2  113.1435  14.9623  3.1706 
3  116.9517  20.8023  3.3048 
4  123.9859  14.4951  2.7098 
5  122.6830  18.5854  2.8346 
The number of OPs in five areas.
Area number  Number of landing 
Number of not landing 
The ratio of landing 

1  537  1031  34.25% 
2  855  1389  38.10% 
3  788  1514  34.23% 
4  378  1070  26.10% 
5  340  1419  19.33% 
The positions of OPs for landing and not landing TCs which entered into each area.
According to the CART algorithm, the landing criterions in five areas are shown in Figure
Area number 




1  74.68%  17.94%  39.48% 
2  75.31%  15.55%  39.53% 
3  77.50%  11.76%  43.15% 
4  66.23%  36.54%  25.93% 
5  79.76%  16.49%  35.88% 
The landing criterions in five areas.
In Sections
In order to distinguish different situations, the labels of them are denoted in Table
The labels of different situations.
Ti  Nu  Label of corresponding situation 

48  1  FE1 
48  3  FE3 
48  5  FE5 
48  7  FE7 
72  1  ST1 
72  3  ST3 
72  5  ST5 
72  7  ST7 
It can be seen from Table
The
Area number 




1  0.7884  0.0696  0.5279 
Area number 




1  0.7454  0.1139  0.5191 
2  0.7636  0.1875  0.3054 
3  0.7829  0.0446  0.7814 
Area number 




1  0.8522  0.0154  0.8414 
2  0.7486  0.1991  0.3959 
3  0.7302  0.1808  0.4176 
4  0.7719  0.1165  0.4398 
5  0.7236  0.0740  0.6735 
6  0.7834  0.1512  0.2975 
7  0.7631  0  1 
Area number 




1  0.8164  0.0412  0.6998 
Area number 




1  0.8320  0.0677  0.5877 
2  0.7396  0.2062  0.3462 
3  0.8270  0  1 
Area number 




1  0.8284  0  1 
2  0.7613  0  1 
3  0.7566  0.1590  0.3564 
4  0.8596  0  1 
5  0.8281  0.0678  0.5813 
Area number 




1  0.7874  0  1 
2  0.7020  0.1664  0.4723 
3  0.8799  0.0118  0.8777 
4  0.8358  0  1 
5  0.8537  0  1 
6  0.7761  0.0955  0.5047 
7  0.7407  0.1028  0.6720 
In order to select the best of these eight situations in Table
It is obvious that the higher the Index is, the better the result of the landing criterions on the whole is. The Index for each of eight situations is shown in Table
The Index for each of eight situations.
Label of corresponding situation  Index 

FE1  0.6750 
FE3  0.6659 
FE5  0.7026 
FE7  0.6625 
ST1  0.6297 
ST3  0.6572 
ST5  0.6391 
ST7  0.6231 
The landing forecast pattern is defined as follows: obtaining the observation point information (seven predictors) when landing TCs’ centers first enter into any area, which can be used to forecast the characteristic factors (LAT, LON, AP, and WS) when TCs land on Hainan Island. The flow diagram is shown in Figure
The OPs of historical landing TCs’ centers when they firstly entered into area 1 (LP: landing position).
The tracks of historical landing TCs which passed through area 1.
Dividing the historical landing TCs passing through each area into two groups with the same number, one group of TCs is used to establish prediction equations and the other group is used to test the accuracy of these equations. The results of testing of these prediction equations for TCs which passed through each area are shown in Table
Prediction results in landing prediction pattern.
Area number  The mean/SD of LAT deviation 
The mean/SD of LAT deviation 
The mean/SD of AP deviation 
The mean/SD of WS deviation 

1  0.9849/0.8671  0.7264/0.6404  9.2889/6.7882  5.9540/4.3839 
2  0.5268/0.6009  0.8677/0.8973  7.9946/6.5339  5.3036/3.8997 
3  0.9710/0.7805  0.6340/0.7711  10.6941/8.3758  7.0915/5.2637 
4  1.1537/0.8640  0.7150/0.6162  10.3463/6.1761  6.8151/4.7498 
5  1.1152/0.6941  0.6728/0.7103  12.1138/8.5744  8.0403/5.3759 
The mean of GCD for each area in landing prediction pattern.
The standard deviation of GCD for each area in landing prediction pattern.
The dynamic prediction pattern is using the current observation point information to conduct 24 hours’ and 48 hours’ forecast, which is also illustrated in Figure
In the process of actual prediction, for TCs, which are judged to be landing on Hainan Island, the observation point information when their centers enter into any area for the first time and the established equations in forecast model 1 are used to conduct dynamic prediction. Furthermore, the observation point information when TCs’ centers are in area (it is not necessary that TCs’ centers enters into this area for the first time) any area and the established equations in forecast model 2 can be used to conduct dynamic prediction.
Similar to Section
The results of testing forecast model 1 and forecast model 2.
Forecast hour  Area number  Forecast model 
The mean/SD of LAT deviation 
The mean/SD of LON deviation 
The mean/SD of AP deviation 
The mean/SD of WS deviation 

24 h  1  1  0.8128/0.6515  1.0295/0.7600  4.9278/3.9733  3.7787/2.8858 
2  0.7004/0.5153  0.8802/0.7011  5.4114/4.3187  3.9637/2.9843  
2  1  0.9131/0.5832  0.9864/0.7794  6.3412/4.4247  3.8268/2.7857  
2  0.7538/0.6538  1.0050/0.8670  4.9175/4.9673  3.8487/3.0261  
3  1  0.7555/0.5945  1.0528/0.9093  7.8149/6.9316  4.7342/4.1782  
2  0.7727/0.6183  0.8949/0.7255  7.3489/6.6512  5.1718/4.3368  
4  1  0.6582/0.5365  0.8797/0.6251  6.6373/5.5333  3.6128/2.6486  
2  0.6404/0.4911  0.9500/0.6746  6.2280/5.6617  3.7521/2.5645  
5  1  0.6899/0.4989  0.8618/0.6124  6.2218/6.3820  2.9503/2.5823  
2  0.6996/0.5634  1.0017/0.7915  6.3203/6.1131  4.1867/3.2785  


48 h  1  1  1.2198/0.8343  1.7735/1.4304  9.7782/5.7246  6.8402/4.0289 
2  1.2598/0.8949  1.6941/1.2488  9.4557/7.2439  6.5030/4.4361  
2  1  1.2250/0.9604  2.1960/1.6746  10.9775/8.4764  7.9032/5.9307  
2  1.2857/1.0131  2.0330/1.5552  8.1434/6.3813  6.5293/4.3273  
3  1  1.2407/0.7253  2.1655/1.7786  9.6580/6.9204  7.5541/5.2989  
2  1.3250/0.9565  1.8414/1.5666  8.1024/7.2309  6.2452/5.3772  
4  1  0.9162/0.8120  1.7618/1.3047  7.9009/5.5968  5.7899/4.5371  
2  0.9896/0.7848  1.5020/1.1732  7.0978/4.8208  5.2826/3.7708  
5  1  0.9650/0.7833  1.6960/1.4327  9.1107/7.4706  8.7411/6.4006  
2  1.1315/1.0845  1.9731/1.5516  9.2449/6.8041  6.8057/4.8502 
The results of statistical significance tests for each equation used to forecast corresponding characteristic factor in forecast model 1 and forecast model 2.
Forecast hour  Area number  Forecast model (1 or 2) 





24 h  1  1  0.5883/9.52 × 10^{−11}  0.6011/7.91 × 10^{−9}  0.6222/4.46 × 10^{−10}  0.7615/5.47 × 10^{−15} 
2  0.4493/2.26 × 10^{−32}  0.5809/5.93 × 10^{−50}  0.7030/1.31 × 10^{−68}  0.7731/1.01 × 10^{−81}  
2  1  0.7177/1.86 × 10^{−14}  0.4391/3.66 × 10^{−5}  0.7556/5.03 × 10^{−16}  0.7995/3.57 × 10^{−18}  
2  0.6808/5.17 × 10^{−101}  0.6753/1.77 × 10^{−99}  0.7481/2.13 × 10^{−122}  0.7277/3.71 × 10^{−114}  
3  1  0.6471/6.36 × 10^{−11}  0.6559/3.48 × 10^{−11}  0.6310/1.83 × 10^{−10}  0.6495/3.24 × 10^{−10}  
2  0.6802/6.46 × 10^{−94}  0.7506/1.19 × 10^{−114}  0.5509/1.40 × 10^{−62}  0.5573/9.84 × 10^{−65}  
4  1  0.7572/1.05 × 10^{−12}  0.5348/3.32 × 10^{−8}  0.7594/1.15 × 10^{−13}  0.8174/2.47 × 10^{−15}  
2  0.6405/4.83 × 10^{−42}  0.6266/1.64 × 10^{−40}  0.6903/1.01 × 10^{−44}  0.7652/9.25 × 10^{−57}  
5  1  0.5031/2.40 × 10^{−6}  0.5154/1.51 × 10^{−6}  0.8732/2.73 × 10^{−14}  0.8524/5.03 × 10^{−15}  
2  0.6239/2.84 × 10^{−35}  0.6464/1.67 × 10^{−37}  0.7835/6.98 × 10^{−52}  0.7550/1.52 × 10^{−48}  


48 h  1  1  0.1337/0.0240  0.3943/3.91 × 10^{−5}  0.3208/1.77 × 10^{−4}  0.4725/5.34 × 10^{−6} 
2  0.2177/0.0049  0.0399/0.3398  0.2532/0.0016  0.4275/7.92 × 10^{−6}  
2  1  0.4515/9.05 × 10^{−6}  0.2668/0.0218  0.6720/1.72 × 10^{−10}  0.4622/5.99 × 10^{−6}  
2  0.4022/1.64 × 10^{−36}  0.3095/6.51 × 10^{−26}  0.2960/5.05 × 10^{−23}  0.3332/6.27 × 10^{−27}  
3  1  0.2251/0.0218  0.3935/4.57 × 10^{−5}  0.0026/0.7270  0.1362/0.1596  
2  0.3073/6.18 × 10^{−25}  0.4369/2.13 × 10^{−39}  0.1768/2.51 × 10^{−12}  0.2479/8.91 × 10^{−18}  
4  1  0.4992/9.60 × 10^{−7}  0.3055/2.74 × 10^{−4}  0.4303/1.53 × 10^{−5}  0.6249/1.84 × 10^{−9}  
2  0.3323/3.77 × 10^{−16}  0.3270/1.02 × 10^{−16}  0.2471/1.70 × 10^{−9}  0.3580/3.65 × 10^{−16}  
5  1  0.4546/6.19 × 10^{−5}  0.2554/0.0043  0.3705/1.91 × 10^{−4}  0.3535/0.0012  
2  0.4219/1.09 × 10^{−18}  0.3825/3.44 × 10^{−17}  0.2120/2.18 × 10^{−7}  0.3322/2.46 × 10^{−12} 
The mean/standard deviation of GCD for 24 hours’ forecast of two forecast models.
The mean of GCD for 24 hours’ forecast of two forecast models
The standard deviation of GCD for 24 hours’ forecast of two forecast models
The mean/standard deviation of GCD for 48 hours’ forecast of two forecast models.
The mean of GCD for 48 hours’ forecast of two forecast models
The standard deviation of GCD for 48 hours’ forecast of two forecast models
In this paper, the CMA best track datasets from 1949 to 2012 are used, in combination with data mining technology and statistical methods, to put forward a new methodology to forecast TCs’ characteristic factors. This methodology can accurately judge whether TCs land on Hainan Island or not and forecast their characteristic factors (including longitude, latitude, the lowest center pressure, and wind speed). The average of the probabilities of accurate judgment for landing criterions is 74.70% and the highest accuracy can reach 79.76%. For the forecast of landing TCs’ characteristic factors, landing prediction pattern and dynamic prediction pattern are proposed, which not only can accurately forecast the characteristic factors when TCs land but also realize dynamically 24 hours’ and 48 hours’ forecast. The effect of the landing prediction pattern is better, of which the mean of GCD is 144.6382 km, compared with the current 48 hours’ forecast in the South China Sea, which is 222.6 km. Even though the mean of GCD in dynamic prediction pattern is no less than three main prediction centers (NHC, JMA, and NMCC), it is much less than the numerical prediction model in [
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is in part supported by the National ScienceTechnology Support Plan in the domain of advanced energy technology. Hainan Power Grid Corporation provided support for this research through “Integrated Demonstration Project of Regional Smart Grid” no. 2013BAA01B03. Dr. Xinhong Huang, a Research Engineer in the Department of Electrical & Computer Engineering from the University of Western Ontario, also puts forward valuable revision comments on this paper.