Prediction of Frost Occurrences Using Statistical Modeling Approaches

We developed the frost prediction models in spring in Korea using logistic regression and decision tree techniques. Hit Rate (HR), Probability of Detection (POD), and False Alarm Rate (FAR) from both models were calculated and compared. Threshold values for the logistic regression models were selected to maximize HR and POD and minimize FAR for each station, and the split for the decision tree models was stopped when change in entropy was relatively small. Average HR values were 0.92 and 0.91 for logistic regression and decision tree techniques, respectively, average POD values were 0.78 and 0.80 for logistic regression and decision tree techniques, respectively, and average FAR values were 0.22 and 0.28 for logistic regression and decision tree techniques, respectively.The average numbers of selected explanatory variables were 5.7 and 2.3 for logistic regression and decision tree techniques, respectively. Fewer explanatory variables can be more appropriate for operational activities to provide a timely warning for the prevention of the frost damages to agricultural crops. We concluded that the decision tree model can be more useful for the timely warning system. It is recommended that the models should be improved to reflect local topological features.


Introduction
It is widely known that many perennial crops such as fruit tree in South Korea are prone to be damaged from late-spring frost events.Dehydration resulting from the extracellular ice formation leads to permanent tissue damage of crops through a freeze event from a frost event [1,2].Frost events can be divided into two categories: radiation frosts and advective frosts [3].The former tends to occur at the meteorological characteristics of clear skies, no wind, and a low dewpoint temperature.The latter typically occurs under the meteorological conditions of cloudy skies, moderate to strong winds, no temperature inversion, and low humidity.
There have been many studies on meteorological conditions at frost events [4,5].Kwon et al. [4] analyzed the following eight meteorological variables at each station in South Korea when frost events occurred from 1973 to 2007: minimum temperature (denoted as TMIN), grass minimum temperature (denoted as GMINT), dewpoint temperature (denoted as Dewpoint), and wind speed (denoted as Wind) on frost occurrence days, mean relative humidity (denoted as RH mean ), minimum relative humidity (denoted as RH min ), and cloud amount (denoted as Cloud) on one day before the frost occurrence days, and difference between maximum temperature on one day before the frost occurrence days and minimum temperature on the frost occurrence days (denoted as  diff ).These meteorological variables have been used to estimate the frost probability.For example, Floor [6] used wind speed, total cloud amount, minimum temperature, and grass minimum temperature for the estimation of frost events at Eelde (Netherlands).
Frost warning systems have been developed based on meteorological variables.Three levels of frost warnings are currently issued by The National Weather Service (NWS) based on air temperatures and wind speeds [7].The criteria for the three levels (frost warning, frost/free warning, and Advances in Meteorology frost warning) are 0 ∘ C and 16 km h −1 , for air temperature and wind speed, respectively.Chevalier et al. [3] developed a web-based fuzzy expert system for frost event warnings based on predicted air and dewpoint temperatures and observed current wind speeds.An expert system was used to predict the frost occurrence on roads and bridges [8].The prediction systems used observed maximum and minimum temperatures from the previous day and estimates of air temperature, dewpoint temperature, cloud cover, precipitation, and average wind speed.
Frost occurrences can be described as a binary variable since we can divide days into two: days when frost occurs and when frost does not occur.The logistic regression and decision tree techniques can be used to predict frost occurrences.Those techniques have been used for various fields including agronomy [9,10], meteorology [11], and medicine [12].However, to the best of our knowledge, few studies have been conducted on frost prediction using those techniques.These techniques are much simpler than fuzzy logic techniques.These simple methods can be more useful for operational activities to prevent frost damages to agricultural crops.The objective of this study was to develop the prediction models of frost occurrences using the logistic regression and decision techniques.We identified the most selected meteorological variables by the two techniques and compared the frost prediction models from both techniques.

Study Site and Data
Collection.For this study, six stations (Chuncheon: 101, Suwon: 119, Seosan: 129, Cheongju: 131, Gwangju: 156, and Jinju: 192) were selected as described in Figure 1.The numbers after each station are the station numbers named by the Korea Meteorological Administration (KMA).Based on a study by Kwon et al. [4], the eight meteorological variables were selected for this study.The frost occurrences and eight meteorological variables from the years 1973 to 2014 were collected from the KMA.For the developed system for frost predictions or warnings, two [7] to seven [8] meteorological variables were used.For this study, spring seasons defined as March to May were focused, especially since late frost in spring seasons frequently damaged overall crop growth seriously.
The meteorological characteristics and their statistics at the frost occurrence days at the six stations from 1973 to 2014 are summarized in Table 1.The Seosan station with 909 days found that the most frost events occurred, while the smallest was observed at the Gwangju station with 510 days.Frost in spring occurred when minimum temperature is approximately −2 ∘ C at most of stations except for Gwangju station (−0.2 ∘ C).Frost events at Chuncheon, Suwon, Seosan, and Cheongju stations were observed when the range of grass minimum temperature was between −6.1 and −7.4 ∘ C, while frost events at Gwangju and Jinju stations occurred when the range of grass minimum temperature was between −4.1 and −4.9 ∘ C. For wind speed, the range when frost occurred was 1. 6

Logistic Regression.
The eight meteorological variables which were identified as the most influential meteorological variables on frost events by Kwon et al. [4] were used for the explanatory variables of the prediction models.Frost occurrences ( of ( 2)) can be defined as binary variable (0 for no frost event and 1 for frost event) in logistic regression modeling.For frost occurrence probability (  ), the logistic regression equation can be given as where  0 is the intercept of the linear term and   is the coefficients corresponding to explanatory variable   .The probability where frost events occur can be defined as On the other hand, the probability where frost events do not occur can be calculated as The Hosmer-Lemeshow significance test [13] with a significance level of 0.05 was used as a criterion to select explanatory variables.A backward elimination method was used for the selection of the explanatory variables.First, the model was fitted with all those eight meteorological variables, and the explanatory variable with the highest  value greater than the selection criterion (i.e., insignificant at the significance level of 0.05) was removed.Second, the model was fitted with the rest of the explanatory variables and we eliminated the explanatory variable whose  value was highest among insignificant explanatory variables at the significance level of 0.05.This elimination was repeated until all explanatory variables were significant at the significance level of 0.05.

Decision Tree.
Frost occurrence can be described as binary feature (i.e., "Yes-No" response).For this study, a binary decision tree model was used for the frost prediction model.For the split, entropy was used for a separation criterion.Parent's node entropy () was calculated by (4) as follows: A decision tree model consists of decision nodes and leaf nodes.Each decision node has exactly two branches in a binary tree.The topmost decision node (i.e., root node) in a tree was selected as an explanatory variable (i.e., the best predictor variable) with the maximum entropy.The frost occurrences were split to maximize the difference of   and   .The entropies of two branches were then calculated using ( 5) and ( 6).The explanatory variable with the largest   by (7) was selected as a decision node.To avoid overfitting problem in building decision trees, we stopped the split where change in   was relatively small.

Logistic Regression versus Decision
Tree.Logistic regression and decision tree techniques were used to develop prediction models for frost occurrence events in spring, and their predictabilities were compared.The results predicted from these developed models were summarized in the 2 × 2 contingency table (i.e., Table 2).Using these tables, we calculated Hit Rate (HR), Probability of Detection (POD), and False Alarm Rate (FAR) as the skill scores for the predictability of the models and these skill scores from both models were compared.These skill scores can be calculated by HR =  +   +  +  +  , where , , , and  are defined in Table 2.
The probability of frost occurrences using (2) can be calculated.When the probability is greater than a threshold value (default = 0.5), the model will predict that a frost event occurs.In contrast, the model will predict that a frost event does not occur with the probability smaller than a threshold value., , , and  in (8) can be changed by adjusting the threshold value.Subsequently, the three skill scores (HR, POD, and FAR) can be changed depending on the threshold value.We selected threshold values for the logistic regression models to maximize HR and POD and minimize FAR for each station.The Leave-One-Out-Cross Validation technique was used to estimate the performance of the developed models using both techniques.
Since frost events are very local and too many meteorological variables may not be always available at all local sites, limited number of explanatory variables might be more suitable for the operational purpose of the frost occurrence prediction models.We compared the developed models using both logistic regression and decision tree techniques and proposed a better technique for the operational use based on their performance and the number of selected explanatory variables.

Results and Discussion
3.1.Logistic Regression.The prediction models for the frost occurrence events were developed using the logistic regression techniques.As shown in Table 3, the threshold values which were determined to maximize HR and POD and to minimize FAR for the logistic regression model varied with the range of 0.42 (Jinju) to 0.49 (Suwon).The HR, POD, and FAR values resulting from these threshold values were approximately 0.9, 0.8, and 0.2, respectively (Table 4).Three variables, TMIN, Dewpoint, and Wind, were selected at all six stations, while the variables, RH mean and cloud, were used at five stations except for the Suwon and Cheongju stations, respectively.Particularly, TMIN and Wind among the three variables have been commonly used for the predictions of frost events [3,[6][7][8].Seven out of eight meteorological variables except for  diff were selected for the prediction models using the logistic regression technique.These seven variables have been frequently found in previous studies [3,6,8]. diff in those studies have not been used for the frost predictions or warnings.GMINT was selected at four stations and RH min was the least selected explanatory variable.These results implied that minimum and grass minimum temperatures are negatively correlated with the probability of the frost occurrence.On the contrary, Dewpoint and RH mean are positively correlated with that.Intercepts of each equation (Table 3) at the six stations varied with the range of −2.554 (Cheongju) and −0.19 (Suwon).The highest coefficient of TMIN was −0.262 at Gwangju and the lowest one −0.92 at Jinju.Although these two stations are located in the same latitude, odds ratio of TMIN at Gwangju is approximately twice as high as that at Jinju.While the coefficients of TMIN largely varied over the six stations, the variations of the coefficients of Wind and Dewpoint were somewhat small over the six stations.Average coefficients of Wind and Dewpoint were about 0.20 and −0.42, respectively.
Figure 2 depicts the observed and predicted number of daily frost events in spring during the study period.The results from the Leave-One-Out Cross Validation method are also described in Figure 2. The models overestimated the frost events in early March at all stations, while slightly underestimated results were shown in April at all stations.

Decision Tree.
Figure 3 displays the results using the decision tree technique.For example, the tree in Figure 3(a) represents that the tree model predicts that frost will occur when TMIN < −0.35 ∘ C and Wind > 2.45 m s −1 .Like TMIN in Figure 3(a), the same meteorological variable might be used for the separation criterion in a tree.The terminal nodes (i.e., denoted as "N" in Figure 3) which were classified into nonfrost occurrences had different probability of frost occurrences.For example, the tree model in Figure 3(a) will predict no frost as long as TMIN ≥ −0.35 ∘ C.However, the probability of no frost can be different whether the interval which consists of −0.35 and 1.05 ∘ C includes minimum temperature.For this case (i.e., Figure 3(a)), the probability of no frost was 0.98 when TMIN ≥ 1.05 ∘ C and that was 0.61 when −0.35 ∘ C ≤ TMIN < 1.05 ∘ C. The average of Wind was 3.1 m s −1 .This value is slightly lower than the criterion of wind speed (4.4 m s −1 ) used by Perry [7] and slightly higher than that (2 m s −1 ) of Floor [6] and is between the two criteria proposed by Chevalier et al. [3].They proposed three categories based on two wind speeds (less than 2.2 m s −1 , between 2.2 m s −1 and 4.4 m s −1 , and greater than 4.4 m s −1 ).The two to three variables were used in the decision tree model for frost prediction at study stations.For example, TMIN and Wind were selected at Chuncheon station, and GMINT, Dewpoint, and Wind were selected at Cheongju station.Unlike the result of the logistic regression technique, TMIN was selected at only three stations, while GMINT was selected at the same stations as the results of the logistic regression technique (i.e., Suwon, Seosan, Cheongju, and Gwangju stations).Overall, four explanatory variables (TMIN, GMINT, Wind, and Dewpoint) out of the eight meteorological variables were selected at the study stations with the decision tree technique.Wind was used at five stations for the decision tree technique.This implies that wind speed is the most important separation criterion for the construction of tree.
The predicted frost events using the decision tree technique were compared with the observed frost events for the study period (Figure 4).Unlike the results using the logistic regression technique, there was no apparent pattern in early March.However, the model using the decision tree technique overestimated frost events at the Cheongju station in March and underestimated those in April.

Logistic Regression versus Decision Tree.
The average numbers of selected explanatory variables were 5.7 and 2.3 for logistic regression and decision tree models, respectively.The number of selected variables for logistic regression models is slightly smaller than the number of variables (7) used by Takle [8].While that for decision tree models is slightly higher than that (two variables) by Perry [7], that is, slightly smaller than those by Floor [6] and Chevalier et al. [3].Four and three meteorological variables were used by Floor [6] and Chevalier et al. [3], respectively.TMIN, Wind, and Dewpoint were selected at all six stations from the logistic regression technique, while Wind (five stations) was the most selected variable from the decision tree technique,  followed by GMINT (four stations) and TMIN (three stations).Particularly, GMINT were most selected (at the same stations: Suwon, Seosan, Cheongju, and Gwangju) from both techniques.This result seems consistent with the result where Kwon et al. [4] reported that GMINT was more influential than TMIN.
Skill scores calculated using the 2 × 2 contingency table for the two different models are summarized in Table 4.The HR and POD values for the logistic regression models varied with the range between 0.899 (Chuncheon) and 0.935 (Jinju) and with the range 0.767 (Chuncheon) to 0.816 (Cheongju), respectively (Table 4).The HR values for the decision tree models varied with a similar range to that for the logistic regression models.The POD values for the decision tree models were slightly higher than those for the logistic regression models, while the FAR values for the decision tree models were slightly higher than those for the logistic regression models.Unfortunately, those skill scores did not show consistent results from the two techniques.However, the POD values would be more appropriate skill score for the case where frost events do not occur very often, since a main purpose of the frost prediction model would be a timely warning for the prevention of the frost damages.Furthermore, fewer explanatory variables were selected for the decision tree models, suggesting that these models can be more appropriate for operational activities.These results imply that the decision tree model for the frost occurrence event prediction can be more useful to provide a timely warning for the prevention of the frost damages to agricultural farms.It is concluded that this proposed technique may be useful to better support farmers by providing adequate strategies to reduce frost damages through a timely warning.

Conclusions
To develop prediction models for frost events in the six stations in the Korean peninsula, we used the logistic regression and decision tree techniques using eight meteorological variables.Although three skill scores (HR, POD, and FAR) resulting from the two techniques were not consistent, the decision tree models were selected for the potential operational activities since fewer explanatory variables were used in the models and higher POD values than those for the logistic regression models.It is concluded that the prediction models of frost occurrence events in spring can be used  to prevent frost damages to agricultural crops by providing timely interventions for frost damages to agricultural farms and facilities.
For the forecast of frost events, predicted meteorological variables should be used.The predictability of frost occurrences might be different according to forecast lead times.However, this study does not address this question.It is recommended that this question be addressed in a further study.In addition, since the frost events are very local and are affected by local topological characteristics, it is recommended that the models should be improved to reflect these local topological characteristics.

Table 1 :
Meteorological characteristics at frost occurrences at the study stations in spring.Difference between maximum (one day before frost occurrence day) and minimum (on frost occurrence day) temperatures.

Table 2 :
Two-dimensional (2 × 2) contingency table.+  + where  ++ is the number of parent's node data and   is the number of data in th node and  class. is the entropy of parent's node,   is the entropy of the left branch,   is the entropy of the right branch, and   is the entropy difference between decision node and its branches.

Table 3 :
Logistic regression models and threshold values for stations.

Table 4 :
Hit Rate (HR), Probability of Detection (POD), and False Alarm Rate (FAR) by the logistic regression and decision tree models.