Prediction of Tropical Cyclones ’ Characteristic Factors on Hainan Island Using Data Mining Technology

A newmethodology combining datamining technology with statistical methods is proposed for the prediction of tropical cyclones’ characteristic factors which contain latitude, longitude, the lowest center pressure, and wind speed. In the proposed method, the best track datasets in the years 1949∼2012 are used for prediction. Using themethod, effective criterions are formed to judgewhether tropical cyclones land on Hainan Island or not. The highest probability of accurate judgment can reach above 79%. With regard to TCswhich are judged to land onHainan Island, related prediction equations are established to effectively predict their characteristic factors. Results show that the average distance error is improved compared with the National Meteorological Centre of China.


Introduction
Typhoon is a kind of tropical cyclones (TCs), the center-sustained wind speed of which arrives at level 12 to level 13 (typhoon is not distinguished from TC in this paper unless specially emphasized).Hainan Island (108 ∘ 37  E∼111 ∘ 05  E, 18 ∘ 10  N∼20 ∘ 10  N) in China is well known as "typhoon corridor." According to the historical data analysis of TCs landing on Hainan Island, the yearly and the monthly statistical results are shown in Figures 1 and 2, respectively.(Note: in this paper the condition to determine whether a typhoon lands or not is that the minimum distance between the typhoon center and Hainan Island is no more than the preset influencing radius, which is 300 km herein).Thus the frequency of TCs landing on Hainan Island is very high.Besides, typhoon ranks at the top among all kinds of disasters on Hainan Island.Taking the typhoon "Damrey" as an example, in 2005, it destroyed 18 cities of Hainan and affected up to 6.305 million people among whom 21 persons were killed.The direct economic loss reached 12.1 billion RMB [1].Therefore, the timely and accurate forecast of TCs is very important for disaster prevention on Hainan Island.It can also effectively reduce the damage and loss caused by the TC when it happens.
The main methods for traditional TC forecast contain statistical methods and dynamic methods, most of which are along with complicated processes or lower precision.The statistical methods use the historical TCs' positions, intensity, and so on to predict TC's characteristic factors, such as fuzzy multicriteria decision support model [2], conditional nonlinear optimal perturbation, first singular vector, ensemble transform Kalman filter [3], back propagation-neural network [4], adaptive neural network classifier using a two-layer feature selector [5], and a support vector machine using data reduction methods [6].Dynamic methods are mainly based on numerical forecast, such as a simplified dynamical system based on a logistic growth equation (LGE) [7], a regional coupled atmosphere-ocean model [8], the PSU-NCAR Mesoscale Model version 5 [9], and the GFDL 25-km-Resolution Global Atmospheric Model [10].Taking three main prediction centers, for example, the average distance error of 24/48 hours' forecast by the USA National Hurricane Center (NHC) is 106/187 km, which is 125/243 km for Japan Meteorological Agency (JMA) and 120/215 km for National Meteorological Centre of China (NMCC) [11].Zhang et al. compared the monitoring data from HY-2 and QuikSCAT's satellite scatterometers with the actual typhoon data from ground observation.The result shows that the deviations of typhoon path and intensity are large and their standard deviations are also very big [12].Therefore, although there are many typhoon forecast methods in use at present, their precision still cannot meet the need for real-time typhoon warning.
By using data mining technology in combination with statistical methods, a new TC forecast method based on the historical data is proposed in this paper.Firstly, the region where typhoon centers were located 48 (or 72 as a comparative experiment) hours before landing on Hainan Island is divided into five (or a number of {1, 3, 7} as a comparative experiment) areas using -means clustering algorithm.Then the TC landing criterion of each area is formed by classification and regression trees (CART).Further, prediction sum of squares (PRESS) algorithm and its progressive optimal algorithm are applied to optimize forecast factor sets.Finally, part of the historical data is used to establish prediction equations by multiple linear regression model (MLRM) and the accuracy of these equations is examined by the remaining historical data.All results show that this methodology is more accurate compared with present existing forecast methods.

Data and Methodology
2.1.Data.Data used in this research is based on TCs' best track datasets of the years 1949∼2012 in the northwestern Pacific waters (including the South China Sea, northern of the equator, and western of 180 ∘ E) [13], which are derived from the TC information center of China Meteorological Administration (CMA) (http://tcdata.typhoon.gov.cn/).CMA best track datasets contain 2172 TCs, which in total have 62663 observation points.Every observation point may provide information as follows: the observation time, strength grade, latitude, longitude, the lowest center pressure (hereinafter referred to as air pressure), 2-minute-average-near-centermaximum wind speed (hereinafter referred to as wind speed), and average wind speed in 2 minutes.Because the average wind speed in 2 minutes of most observation points cannot be obtained, strength grade (SG), latitude (LAT), longitude (LON), air pressure (AR), wind speed (WS), latitude migration velocity (LATMV), and longitude migration velocity (LONMV) are selected as seven predictors (hereinafter referred to as observation point information).Current LATMV and LONMV can be calculated using the following method.
Set the moments of current observation point and previous two observation points as (  ,   ), ( −1 ,  −1 ), and ( −2 ,  −2 ), respectively (where  is on behalf of longitude and  is on behalf of latitude).Then the LONMV and LATMV of the current observation time are LONMV  and LATMV  , which are calculated as  where  is the mean radius of the Earth with the value of 6370.856km; sgn() represents the sign function; the unit of LONMV  and LATMV  is km/h.

Methodology.
As mentioned in the Introduction, a landing TC is defined as a TC that the minimum distance between the typhoon center and Hainan Island is no more than the preset influencing radius.Hence, in order to distinguish TCs between landing and not landing on Hainan Island, the minimum distance between each TC's track and the outer boundary of Hainan Island needs to be calculated according to CMA best track datasets.Due to the variety of TCs' tracks, applying general curve-fitting directly does not provide a good result.Therefore, polynomial fitting [14] is applied in this paper where an intermediate variable is introduced to conduct curve-fitting with the latitude and longitude, respectively.Taking an arbitrary TC, for example, the specific fitting effects are shown in Figures 3 and 4. Using the fitting polynomials of each TC's and the outer boundary of Hainan Island's latitude and longitude with respect to the corresponding intermediate variables, the distance between any point of each TC's track and any point on the outer boundary of Hainan Island can be calculated, from which the minimum distance can be selected.The great circle distance (GCD) between any two points on the Earth can be calculated using formula (3).The GCD is the shortest distance between any two points on the Earth.Set any two points on the Earth as  1 ( 1 ,  1 ) and  2 ( 2 ,  2 ) and the GCD is The time intervals used to forecast TCs by three main prediction centers (NHC, JMA, and NMCC) are 24, 48, and 72 hours.In order to forecast TCs in a timely manner and compare forecast accuracy among different methods, here the region where TCs' centers were located 48 hours before they landed on Hainan Island (shown in Figure 5) is selected as research object.In order to narrow the research scope, means clustering algorithm [15,16] is applied to divide the region where TCs' centers were located 48 hours before they landed on Hainan Island into five areas.In this section the situation, in which the region where TCs' centers were located 48 hours before they landed is selected as research object and the research object is divided into five areas, is taken as an example for a convenient statement.Other situations of a comparative experiment are also conducted in Section 3.3.
For each of the five areas, all the observation points of both landing and not landing TCs which entered into this area are filtrated.With strength grade, latitude, longitude, air pressure, wind speed, latitude migration velocity, and longitude migration velocity as classification properties, the TC landing criterion of each area is formed by using CART algorithm [17,18].The flow diagram of forming landing criterions is shown in Figure 6.
For TCs which are judged to be landing on Hainan Island, PRESS and its progressive optimal algorithm and MLRM can be used to forecast TCs' characteristic factors (including latitude, longitude, the lowest center pressure, and wind speed).Forecasts in this paper contain landing prediction pattern and dynamic prediction pattern.Landing prediction pattern is defined as employing the observation point information of those points which first enter into any area to predict the characteristic factors when TC lands.Dynamic prediction pattern is defined as 24 hours' and 48 hours' prediction with respect to the observation point which enters into any area.The flow diagrams of landing forecast pattern and dynamic forecast pattern are shown in Figures 7  and 8, respectively.Here PRESS [19] and its progressive optimal algorithm [20,21] are used to select the best forecast factor set from seven predictors which will be used to forecast corresponding characteristic factor.MLRM [22] is used to establish corresponding forecast equations.MLRM is expressed as [23] where   is the estimated value,  0 ∼   is the regression coefficients,   is the random error, and  1 ∼   are the forecast factors of the observation point.

Results and Discussions
In this section, the situation, in which the region where TCs' centers were located 48 hours before they landed is selected as research object and the research object is divided into five areas, is firstly researched.Other situations of a comparative experiment, in which the research object may be the region where TCs' centers were located 72 hours before they landed and the number of areas of divided research object may be any number of {1, 3, 5, 7}, are also discussed at the end of the section.

Dividing the Research Region into Five Areas. 𝐾-means
clustering algorithm is used to divide our research region into five areas as described in Section 2.2, of which the geometric centers and scopes are shown in Table 1.With respect to each area, all the observation points of both landing and not landing TCs which entered into this area are filtrated.The positions of these observation points are shown in Figure 9 and are used to form TCs' landing criterions.The numbers of these observation points (OPs) for both landing and not landing TCs are shown in Table 2.The division of five areas further narrows the research scope and makes the selection of the observation points more pertinent so as to form the effective landing criterions, which will be illustrated further in Section 3.3.

The Formation of Landing Criterions in Five Areas.
According to the CART algorithm, the landing criterions in five areas are shown in Figure 10 (refer to Section 2.1 for the meaning of seven predictors).The corresponding probability of accurate judgment ( AJ ), probability of false alarm ( FA ), and probability of false dismissal ( FD ) are shown in Table 3. Set the numbers of OPs for landing and not landing TCs in any area to be  1 and  2 , respectively; the number of OPs which are judged to be landing according to landing criterions when they landed truly is denoted as  1 and the number of OPs which are judged to be not landing according to landing criterions when they did not land truly is denoted      and compared with each other.Finally, we select the situation which produces the best result.

Advances in Meteorology
In order to distinguish different situations, the labels of them are denoted in Table 4, where the parameter Ti is used to It can be seen from Table 4 that FE5 is the situation which has been researched in Sections 3.1 and 3.2.The remaining seven situations are researched as follows using the methods which are identical with FE5.
The  AJ ,  FA , and  FD of each area for each of the remaining seven situations can be calculated according to formula (5), the results of which are shown in Table 5.
In order to select the best of these eight situations in Table 4, an evaluation method is introduced, with which the Index of each situation is calculated, where Index is defined according to formula (6).For any situation,  denotes the number of areas of divided research object and  AJ  ,  FA  , and  FD  denote the  AJ ,  FA , and  FD of the th area, respectively.Consider the following: It is obvious that the higher the Index is, the better the result of the landing criterions on the whole is.The Index for each of eight situations is shown in Table 6.The situation FE5 shows the best result, which also illustrates that the research scheme in Sections 3.1 and 3.2 is better compared with other situations.Finally, situation FE5 is selected to form the landing criterions.

Landing Prediction Pattern.
The landing forecast pattern is defined as follows: obtaining the observation point information (seven predictors) when landing TCs' centers first enter into any area, which can be used to forecast the characteristic factors (LAT, LON, AP, and WS) when TCs land on Hainan Island.The flow diagram is shown in Figure 6.Taking area 1, for example, the OPs of historical landing TCs' centers when they first entered into area 1 are shown in Figure 11 and the tracks of historical landing TCs which passed through area 1 are shown in Figure 12.
Dividing the historical landing TCs passing through each area into two groups with the same number, one group of TCs is used to establish prediction equations and the other group The TCs which passed through area 1 is used to test the accuracy of these equations.The results of testing of these prediction equations for TCs which passed through each area are shown in Table 7. Making use of the actual and predicted longitude and latitude of TCs' centers, in combination with the formula (3), the calculated mean and standard deviation of GCD in the landing prediction pattern are shown in Figures 13 and 14, respectively.Averaging the results of five areas, it can be obtained that the average of the mean/standard deviation (SD) of GCD is 144.6382/97.8740 km.In [24], Yu et al. analyze the average GCD error of 48 hours' forecast in the South China Sea, which is 222.6 km.Therefore, the landing prediction pattern proposed in this paper shows good prediction accuracy.For TCs which are judged to be landing on Hainan Island, as long as the observation point information when their centers first enter into any area are obtained, the corresponding forecast equations can be used to predict characteristic factors when they land.

Dynamic Prediction Pattern.
The dynamic prediction pattern is using the current observation point information to conduct 24 hours' and 48 hours' forecast, which is also illustrated in Figure 8.There are two different forecast models in dynamic prediction pattern that are described as follows.
Forecast Model 1.It is to obtain the current observation point information (seven predictors) when landing TCs' centers enter into any area for the first time, making use of which to conduct 24 hours' and 48 hours' forecast.The standard deviation of GCD for 24 hours' forecast in each area using forecast models 1 and 2 (b) The standard deviation of GCD for 24 hours' forecast of two forecast models Forecast Model 2. It is to obtain the current observation point information (seven predictors) when landing TCs' centers are in any area (not necessarily enter into any area for the first time), making use of which to conduct 24 hours' and 48 hours' forecast.
In the process of actual prediction, for TCs, which are judged to be landing on Hainan Island, the observation point information when their centers enter into any area for the first time and the established equations in forecast model 1 are used to conduct dynamic prediction.Furthermore, the observation point information when TCs' centers are in area (it is not necessary that TCs' centers enters into this area for the first time) any area and the established equations in forecast model 2 can be used to conduct dynamic prediction.
Similar to Section 3.4.1, the historical observation points that meet the corresponding requirements in corresponding

Summary
In this paper, the CMA best track datasets from 1949 to 2012 are used, in combination with data mining technology and statistical methods, to put forward a new methodology to forecast TCs' characteristic factors.This methodology can accurately judge whether TCs land on Hainan Island or not and forecast their characteristic factors (including longitude, latitude, the lowest center pressure, and wind speed).The average of the probabilities of accurate judgment for landing criterions is 74.70% and the highest accuracy can reach 79.76%.For the forecast of landing TCs' characteristic factors, landing prediction pattern and dynamic prediction pattern are proposed, which not only can accurately forecast the characteristic factors when TCs land but also realize dynamically 24 hours' and 48 hours' forecast.The effect of the landing prediction pattern is better, of which the mean of GCD is 144.6382km, compared with the current 48 hours' forecast in the South China Sea, which is 222.6 km.Even though the mean of GCD in dynamic prediction pattern is no less than three main prediction centers (NHC, JMA, and NMCC), it is much less than the numerical prediction model in [25] and the method using satellite scatterometer's monitoring data in [12].The forecast methodology proposed in this paper provides a new method for typhoon warning on Hainan Island without getting too much knowledge of meteorology involved and thus simplifies the implementation of the prediction process and meanwhile guarantees the accuracy of prediction.

Figure 3 :
Figure 3: The curve fitting of the outer boundary of Hainan Island.

Figure 4 :Figure 5 :
Figure 4: The curve fitting of an arbitrary TC.

Longitude 2 LongitudeLongitudeLongitudeFigure 9 :
Figure 9: The positions of OPs for landing and not landing TCs which entered into each area.

Figure 10 :
Figure 10: The landing criterions in five areas.

1 TheFigure 11 :
Figure 11: The OPs of historical landing TCs' centers when they firstly entered into area 1 (LP: landing position).

Figure 12 :
Figure 12: The tracks of historical landing TCs which passed through area 1.

Figure 13 :Figure 14 :
Figure 13: The mean of GCD for each area in landing prediction pattern.
mean of GCD (km)The mean of GCD for 24 hours' forecast in each area using forecast models 1 and 2Forecast model 1 Forecast model 2(a) The mean of GCD for 24 hours' forecast of two forecast models standard deviation of GCD (km)

Figure 15 :Figure 16 :
Figure 15: The mean/standard deviation of GCD for 24 hours' forecast of two forecast models.

Table 1 :
The geometric centers and scopes of five areas.

Table 2 :
The number of OPs in five areas.

Table 3 :
AJ ,  FA , and  FD of each criterion. .Then  AJ ,  FA , and  FD of this area are calculated as follows:

Table 4 :
The labels of different situations.
3.3.Other Situations as a ComparativeExperiment.In Sections 3.1 and 3.2, the region where TCs' centers were located 48 hours before they landed is selected as research object and the research object is divided into five areas.In this section other situations as a comparative experiment are researched

Table 5 :
AJ ,  FA , and  FD for each of the remaining seven situations. AJ ,  FA , and  FD of each area for FE1  AJ ,  FA , and  FD of each area for FE7  AJ ,  FA , and  FD of each area for ST7 (a) (d)  AJ ,  FA , and  FD of each area for ST1 (e)  AJ ,  FA , and  FD of each area for ST3

Table 6 :
The Index for each of eight situations.

Table 7 :
Prediction results in landing prediction pattern.

Table 8 :
The results of testing forecast model 1 and forecast model 2.

Table 9
, which show that  value is much less than