Applications of Cluster Analysis and Pattern Recognition for Typhoon Hourly Rainfall Forecast

Based on the factors of meteorology and topography, it is assumed that there exist some certain patterns in spatial and temporal rainfall distribution of a watershed. A typhoon rainfall forecastingmodel is developed under this assumption. If rainfall patterns can be analyzed and recognized in terms of individual watershed topography, only the spatial rainfall distribution prior to a specific moment is needed to forecast the rainfall in the next coming hours. It does not need any other condition in meteorology and climatology. Besides, supplement techniques ofmissing rainfall gage data are also considered to build an all-purpose forecastmodel. By integrating techniques of cluster analysis and pattern recognition, present proposed rainfall forecasting model is tested using historical data of Tamsui River Basin in Northern Taiwan. Good performance is validated by checking on coefficient of correlation and coefficient of efficiency.


Introduction
Typhoon rainfall forecast is extremely important since it is the basic requirement in flood routing simulation using a hydrologic model, allowing an extension of the lead-time of the river flow forecasting computations.It is particularly needed in small-and medium-sized mountainous basins [1].In Taiwan, due to the high mountains and steep river slopes, heavy rainfalls, especially during typhoon events, have frequently led to serious disasters, such as flooding, landslide, or debris flows.In order to reduce loss of life and major economic impacts, the government has invested a great deal of manpower and budgets to build the disaster warning systems in which rainfall forecast plays a key role.It provides rainfall input data to forecast the surface runoff outflow of a watershed.This outflow or the gaged water depth at the outlet of the watershed is also needed as the information for the upstream boundary condition of unsteady river flow computations [2][3][4].Quite often, whenever a typhoon has occurred, undesired conditions may occur when gaged rainfall data do not transmit into database system at all for further computational uses.Furthermore, lack of immediate rainfall data may affect the accuracy in real-time flood forecasting or other systems.In order to deal with such situation, the authority should not only assure the stability of an observation system and its transmission instruments but also build an all-purpose rainfall forecast model to manage the situation of lost data at any moment and provide reasonably accurate and efficient forecast data.
Traditionally, rainfall forecasting is based mainly on numerical fluid dynamic models [5].This classical approach attempts to model the fluid and thermal dynamic systems for grid-point time series prediction based on boundary meteorological data.The simulation often requires intensive computations involving complex differential equations and computational algorithms.Besides, the accuracy is bounded by certain constraints such as the adoption of incomplete boundary conditions, model assumptions, grid resolutions, and numerical instabilities.Furthermore, because of the high variability in space and time, typhoon rainfall is one of the most difficult elements for the hydrologic cycle to forecast.The highly nonlinear and extremely complex physical process of typhoon rainfall also leads to a lot of difficulties in constructing a physically based mathematical model [6].

Advances in Meteorology
Radar data and satellite images were also used to forecast the rainfall [7,8].Unfortunately, the relationship between rainfall and the outputs from satellite and radar images was not clear while the outputs do not allow a satisfactory assessment of rain intensities [1].Another reason was that due to ground occultation and altitude effects, the radar detection was particularly difficult in mountainous regions [9,10].
In recent decades, the research using artificial intelligence has gained scientific attention.The Artificial Neural Network (ANN) is one of the most representatives of these achievements.Researches using ANNs were sequentially reported.Luk et al. [11] assumed that the spatial rainfall distribution at a specific moment is bounded with the records of the relevant rainfall gages in the lasted time interval.By using a backward propagation neural network (BP-ANN), they successfully built a model for forecasting the rainfall pattern in the next coming 15 minutes.The same concept was used to build another rainfall forecasting model by applying other kinds of neural networks such as feedforward neural network, partial recurrent network, and time delayed neural network [12].Toth et al. [1] compared the accuracy of the short-term rainfall forecasts obtained with three time series analysis techniques, such as linear stochastic autoregressive moving average (ARMA) models, artificial neural networks (ANNs), and the nonparametric nearest-neighbors method.Chang et al. [13] compared and discussed three types of multistep-ahead (MSA) methods using previous rainfall and river stage for flood forecasting.Lin et al. [14] used a novel kind of neural network called support vector machines (SVMs) to construct typhoon rainfall forecasting models.They used these models with and without typhoon characteristics to forecast the rainfall.Because all the rainfall or flood forecasting models mentioned above regard the gaged rainfall records in the last period of time as the input data, these models might not work properly when data gaps occur.The model could not carry on further computations unless the lost data can be estimated correctly.
When a storm or frontal surface is approaching, the rainfall patterns in the windward area may be quite different from those in the leeward area, due to the topographical effects.As the storm or frontal surface moves during the typhoon period, the rainfall patterns may alter drastically at a specific gage location.This implies that the spatial and temporal distribution of the rainfall are influenced by some information of meteorology and topography.Because the topography does not change with time and also storms or frontal surfaces usually move along some certain paths, the trends of spatialtemporal rainfall distribution could be bounded within some specific patterns.Based on the consideration of these meteorological and climatological factors, it is assumed in this paper that there exist certain patterns in spatial and temporal rainfall distribution for a particular river basin.An unsupervised pattern recognition method, which has powerful ability of fault tolerance, is applied.The clustered construction can identify the coordinate data from the remainder data even if the input data are incomplete or have data gaps.The results from the recognized patterns are model outputs.These model outputs are used as the input for river runoff or elevation forecasting at the outlet of the basin.This paper brings up the pattern recognition and cluster analysis in statistics to classify the rainfall distribution in space and time from historical data of similar meteorological and climatological conditions.This study intends to build an all-purpose model with good accuracy and reliability for typhoon hourly rainfall forecast.The model holds good for its design function even with data gaps in rainfall data.

Methodology
2.1.Cluster Analysis.Cluster analysis is the general logic, formulated as a procedure, by which we objectively group the entities together on the basis of their similarities and differences [15].The objective of data clustering is to employ certain clustering algorithms to identify clusters consisting of similar data within a dataset.The original dataset is thus decomposed into disjoint clusters, with each cluster having a center to represent the cluster.We can use the cluster centers to represent the original dataset to achieve the following two goals, namely, data compression, and computation reduction.In general, clustering algorithms can be divided into two types: (1) hierarchical clustering and (2) nonhierarchical clustering (or called partition clustering).Two sorts of hierarchical clustering could be found.They are agglomerative and divisive ones.For agglomerative hierarchical clustering, the number of clusters is increased from one until the desired number of clusters is reached.On the other hand, for divisive hierarchical clustering, the number of clusters is decreased from the size of the dataset until the desired number of clusters is reached.For nonhierarchical clustering approaches, the number of clusters is fixed in advance.And then a number of iterations are performed to identify the best clusters with their cluster centers [16].
Many empirical results indicate that the point of adding nonrandomly selected, nonhierarchical clustering method is better than the hierarchical clustering method [17].Meanwhile, in nonhierarchical clustering the number of clusters should be predetermined and its starting from a randomly initial partition may cause optimization locally.Therefore, some algorisms such as two-stage cluster or two-step cluster were developed by using one or two algorisms above to increase their advantage and decrease their shortcoming.The Statistical Product and Service Solutions (SPSS) twostep cluster will be used in this paper, and below is mainly drawn from "the support document of SPSS and IBM knowledge center" [18,19], for completeness.The SPSS two-step clustering component is a scalable cluster analysis algorithm designed to handle very large datasets and is well-known for recent years.The procedure of the cluster is divided into two steps.In the first step, the records were preclustered into many small subclusters by a sequential clustering approach.Thus, the records were scanned one by one and decided if the current record should merge with the previously formed clusters or start a new cluster based on the distance criterion.A modified cluster feature (CF) tree which consists of levels of nodes was implemented.In the second step, subclusters resulting from the first step were taken as input and then were grouped into the desired number of clusters by agglomerative hierarchical clustering method.

Pattern Recognition.
The concept of "recognition" comes from the main theory of artificial neural networks.When new input data comes out, one can determine the category and the output corresponding to that category immediately.The network structure requires powerful ability of fault tolerance.A clustered construction, even if the input data is incomplete, can still identify the coordinate data from the remainder data and show which category it belongs to.The key point of pattern recognition in this study is the winner-take-all (WTA) network.For a group of artificial neurons, the neurons compete with each other.The weight is given as 1 to the winner neuron, the one who is closest to the input data, and 0 to all others.This process is known as the winner-take-all.
In this paper, a "pattern" is a multivariable time-space series.The rainfall record of some lasted time interval at a specific moment of several gages is combined as an input vector and the dataset collected from numerous storm events is divided into some specific groups.This way, not only the characteristics of rainfall within the space, such as topography (windward, leeward, altitude, etc.), but also the "behavior" that they change over time, can be obtained.With these procedures, a model of typhoon rainfall forecast can be established.The so-called "pattern" is referred to as the rainfall distribution in time and space with respect to a certain typhoon category, and "recognition" is the information available to the corresponding classification categories.
Assume that there is a group of statistical samples.Each sample is composed of n values and expressed as a mathematical vector of  components: where  is the serial number of a specific sample.Firstly, the cluster analysis is preceded.In order to divide these samples into several certain patterns, the neural network structure of winner-take-all (WTA) is employed to describe the distribution of samples.The pattern which any specific sample belongs to can be expressed as where (  ⇀   ) is a natural number that expresses the pattern to which the th sample belongs,  denotes the numbers of classification, and where ‖  ⇀   −  ⇀   ‖ = √∑  =1 ( , −  , ) 2 in which  , denotes the th component of  ⇀   and  ⇀   represents the center of the th cluster, which resulted from the approach of two-step clustering (see Section 2.1): After completion of classifying the statistical samples, for any new input,  ⇀  , one can find which pattern it belongs to by checking with this formula: Furthermore, a relation between the input data and output data needs to be constructed.Consider an output data  ⇀   which is composed of  out values; each  ⇀   corresponds to a specific  ⇀   .The output vector is expressed as Here  is the serial number of a specific sample as previously defined.After all the samples have been clustered, one can find the th component of the output vector  ⇀  corresponded to an input vector  ⇀  by the following formula: where ŷ, represents the th component of the th output pattern.If each sample belongs to a certain cluster and the distances among them are very small, one can determine ŷ, by using the average value to represent the whole values of output data: where  is the total number of samples.When new data are added, one can find the cluster centers as described previously, identify to which pattern this sample belongs, and may use the relationship between input and output to predict the corresponding output.

Model Setup.
In practice, the input data,  ⇀   , are composed of spatial and temporal information and can be expressed as follows:  where  is the serial number of a specific sample and   is the number of time steps considered in the input pattern.The subscript of  (i.e., 1, 2, . . .,   ) is the serial number of the rain gage while  1 () denotes the rainfall data at time  in rain gage number 1 and  1 ( −   ) denotes the rainfall data at previous   time steps in the same rain gage.The values of  2 (),  2 ( − 1), . . .,  2 ( −   ),  3 ( − 1), . . .,    ( −   ) are all defined in a similar way.And the output data, ⃗   , can be expressed as follows: In this paper, rainfall records of last (specifically,   = 1) and present hour are used as the input data to forecast gaged rainfall in the next hour.So,  ⇀   contains 2 ×   components and ⃗   contains   components.

Study Area.
In this paper, the feasibility of this method is tested to the rainfall forecasting in Tamsui River Basin in the Northern Taiwan.Tamsui River runs through Taipei, the capital city of Taiwan, and has a total drainage area of approximately 2726 km 2 .Due to the peculiar topography, the three mainly tributaries, Keelung River, Dahan Stream, and Sintain Stream, converge in Taipei Basin in which there usually are severe damage during storms and typhoons.Because of concentration of population (population 6.5×10 6 ) and developed urban and suburban areas, government has invested a great deal of manpower and budgets to build the flood warning system.So there are abundant historical observations of rainfall data.However, when typhoon occurs, the transmittal system becomes poor, resulting in missing rainfall data.Furthermore, lack of immediate rainfall data may affect the accuracy in flood forecasting.In order to deal with this situation, one should not only ensure the stability of observation system and transmission instrument but also build an allpurpose forecast model to manage the situation of lost data at any moment.
There are many rain gages in Tamsui River Basin.Some of them, belonging to Water Resources Agency, are operationally stable and experience fewer situations of lost data.Therefore, in this paper hourly rainfall data of these rain gages are used to forecast the gaged rainfall in the next hour.There are total 16 rainfall gages in Tamsui River Basin which belong to Water Resources Agency.Three of them were set up after 2001; the other 13 gages have more than 20 years of historical data.Locations of Tamsui River Basin and these 13 rain gages are shown in Figure 1.Frequency diagrams and information of hourly rainfall of the rain gages in Tamsui River Basin during typhoon events are shown in Figure 2.

Calibration and Validation of Dataset.
After removing the events with incomplete data, total of 32 typhoon events which occurred during 1995-2015 were analyzed for this   study.Each of them caused intense rainfall in Tamsui River Basin when the typhoon centers passed the vicinity of Northern Taiwan.The collected events are separated into two sets of data, calibration and validation, as listed in Table 1.There are 2175 samples for calibration.Dataset was entered into SPSS by using two-step clustering component for classification and log-likelihood option was chosen as the distance measurement.The number of clusters is determined by how many samples we have and at least how many samples should be in a cluster.Although choosing more categories may produce more accurate results, overfitting could also happen if the dataset is divided into too many clusters.Considering each cluster should be 10 samples at least and clusters are as many as possible, the dataset is divided into 22 clusters.Thus, twenty-two clusters are chosen for the number of clusters in this application.Table 2 shows the result of the classification.In this paper, the hourly rainfall data of Typhoon Soudelor (2015) and Typhoon Dujuan (2015) were chosen as the validation dataset.

Supplement of Missed
Rainfall Data.The model has capacity to automatically fill in any missing data within the gages.The basic concept of supplement is to arrange the data of several gages in the catchment in sequence hours to a mathematical vector, and historical rainfall records were divided into m clusters.The winner-take-all neural network is used to build the relationship between samples and m clusters as well.
When part of the input vector data is missing, one can still use the remaining information to determine the cluster.Figure 3 shows the flow chart of vector transform when losing data.Due to lack of data in some stations, the -component vector will be transformed to an   -component vector and the computation will be proceeded in remaining   components.By using the procedure of Figure    Then the pattern of the lost data can be supplement by the following formula: where subscript  is pattern number and  denotes the th component of  ⇀  and ⃗   .Note that this  ⇀  is a new input vector when the rainfall is applied.Another typhoon event, Aere in 2004 with large missing data in Sanxia, was chosen for demonstrating the performance of incomplete input data.

Validation and Performance Measures.
To evaluate the model performance, two indices which are commonly used are employed here.
Coefficient of correlation (CC) is as follows: Coefficient of efficiency (CE) is as follows: In ( 13) and ( 14),  is the forecast value and  is the observation value. is the serial number of sample, and  is the amount of samples included in a certain event.The over bar indicates the average quantities:

Result and Discussion
Table 3 illustrates the results of coefficient of correlation and coefficient of efficiency in the dataset of calibration.It is apparent from the information supplied that they showed the consistency among the 13 rain gages.The correlation coefficient exceeds 0.66 (the lowest at Shimen), while the highest is up to 0.84.The highest coefficient of efficiency is 0.65 (Fushan) and the average is 0.55.Typhoon Soudelor was the most intense tropical cyclone to develop in the Northern Hemisphere in 2015 (category 5 super typhoon scaled by SSHWS).When it passed through Taiwan, torrential rains and destructive winds caused widespread damage and disruptions, especially in north area.According to Central Emergency Operation Center, at least eight people were killed and four were missing in Taiwan, in addition to 437 injured.Agricultural losses across the island were estimated at NT$2.2 billion (US$66.7 million) by August 11.A record-breaking 4.29 million households lost power on the island.Figure 4 shows the observed hourly rainfall data and the simulation results during Typhoon Soudelor (2015).As we can see, the observed hourly rainfall data are up to 87 mm in Zhuzihu rain gage.In addition, the model output showed a good agreement between simulated and observed data.The coefficient of correlation and efficiency values are shown in Table 4.The coefficient of correlation exceeds 0.68 (the lowest at Shimen), while the highest is up to 0.89 (Zhuzihu).The highest coefficient of efficiency is 0.74 (Shiding and Ruifang) and the average is 0.62.
Typhoon Dujuan was the second most intense tropical cyclone of the Northwest Pacific Ocean in 2015 (category 4 typhoon scaled by SSHWS).Three people were killed and 376 were injured in Taiwan.Figure 5 shows the observed hourly rainfall data and the simulation results during Dujuan Typhoon (2015).It also indicated that the observed data and the simulated data were quite close.The coefficient of correlation and efficiency values are shown in Table 5.The coefficient of correlation exceeds 0.69 (the lowest at Shimen), while the highest is up to 0.88 (Jhongjheng Bridge).The highest coefficient of efficiency is 0.77 (Jhongjheng Bridge) and the average is 0.65.Figure 6 shows the observed data and results of simulation during Typhoon Aere (2004).Because of missing data, there is no observed rainfall data at Sanxia station during Typhoon Aere.By the procedure of supplement, data of remaining stations could still be simulated and compared to the observed data.The coefficient of correlation and efficiency values are shown in Table 6.The simulation of the coefficient of correlation exceeds 0.61 (the lowest at Ruifang), while the highest is up to 0.89 (Fushan).The highest coefficient of efficiency is 0.76 (Fushan), while the average is 0.55.
Compared to previous studies, in Luk et al. [11], they used BPNN and successfully built a model for forecasting the rainfall pattern in the next coming 15 minutes.Normalized mean squared error (NMSE) was chosen as the performance indicator and was about 0.63 to 0.65.In Luk et al. [12] they also used the same concept to build another rainfall forecasting model by applying other kinds of neural networks such as multilayer feedforward neural network, partial recurrent network, and time delayed neural network.Normalized mean squared error was about 0.63 to 0.67 forecasting the rainfall pattern in the next coming 15 minutes.In Lin et al. [14], they used SVM-based models with and without typhoon characteristics to forecast the rainfall.The coefficients of efficiency are 0.44 and 0.43, respectively.In this study the average  coefficient of efficiency is 0.64 and reasonably good results have been observed.We thus think that the improvement is effective.
Note that the dimensions consist of two factors: the number of previous time steps and rain gages.They are also related to cluster sizes while sizes and number of clusters depend on how many samples we have.As we have limited number of samples, the cluster sizes would be too small if we use too many dimensions.That would cause inaccuracy by overfitting.Because the rainfall forecasting model is developed for the use of flood mitigation around the capital city of Taiwan, we should use as many rain gages in the entire watershed as possible.That is also the reason we need to make the system keep on working even if data missing occurs.When trying to develop a similar rainfall forecast system in other area with a larger sample size, one could have more options to test the effect of number of dimensions and cluster size.In this paper rainfall records of last and present hour are used as the input data to forecast gaged rainfall in the next hour.Totally 13 rain gages in space were used since data in remaining 3 rain gages are not enough.So we have totally 26 components in the input data.

Conclusions
By integrating technique of cluster analysis and pattern recognition, an unsupervised method is adopted in this paper to provide reasonably accurate and effective typhoon hourly rainfall forecast.Not only can the missing data be supplied but also rainfall data of space and time in the previous time steps are needed to forecast the hourly intensity of rainfall for next time steps.Present proposed forecast model is tested using historical rainfall data in Tamsui River Basin.Among 32 typhoon events from which complete rainfall records can be obtained, 30 of them are used to calibrate.The data are clustered into 22 patterns for the network construction.After the framework is built, the rest two of the typhoon events, Soudelor (2015) and Dujuan (2015), are used to validate the model.Additionally, another typhoon event, Aere (2004), during which the rainfall data was lost at one of the 13 gages, is used to illustrate how this model works when the input of the model is incomplete.The performance is testified by coefficient of correlation and coefficient of efficiency.Reasonably good results have been observed in these cases.It shows that present proposed forecast model is well suited for predicting the hourly rainfall during typhoon in Northern Taiwan.

Figure 1 :
Figure 1: Location of Tamsui River Basin and the 13 rain gages.

Figure 2 :
Figure 2: Frequency diagrams and information of hourly rainfall of the rain gages in Tamsui River Basin during typhoon events.

Figure 3 :
Figure 3: Flow chart of vector transform when data missing occurs.

Table 1 :
The list of typhoon events collected in this study for the model establishment.

Table 2 :
Results of the classification of rainfall sample in Tamsui River Basin.

Table 3 :
Coefficients of correlation and efficiency in the dataset of calibration.

Table 6 :
Coefficients of correlation and efficiency during Typhoon Aere (2004) (data at Sanxia gage was missing).