Harmonization and Verification of Three National European Icing Forecast Models Using Pilot Reports

The Single European Sky Air Traffic Management Research (SESAR) program aims at modernizing and harmonizing the European airspace, which currently has a strongly fragmented character. Besides turbulence and convection, in-flight icing is part of SESAR and can be seen as one of the most important meteorological phenomena, which may lead to hazardous flight conditions for aircraft. In this study, several methods with varying complexities are analyzed for combining three individual in-flight icing forecasts based on numerical weather prediction models from Deutscher Wetterdienst, M´et´eo-France, and Met Office. The optimal method will then be used to operate one single harmonized in-flight icing forecast over Europe. As verification data, pilot reports (PIREPs) are used, which provide information about hazardous weather and are currently the only direct regular measure of in-flight icing events available. In order to assess the individual icing forecasts and the resulting combinations, the probability of detection skill score is calculated based on multicategory contingency tables for the forecast icing intensities. The scores are merged into a singleskill score to give an overview of the quality of the icing forecast and enable comparison of the different model combination approaches. The concluding results show that the most complex combination approach, which uses iteratively optimized weighting factors for each model, provides the best forecast quality according to the PIREPs. The combination of the three icing forecasts results in a harmonized icing forecast that exceeds the skill of each individual icing forecast, thus providing an improvement to in-flight icing forecasts over Europe.


Introduction
e European airspace has a strongly fragmented character. At the continental scale, there are multiple air navigation service providers each responsible for a portion of the European airspace. Furthermore, air tra c control in Europe is very complex because the airspace is one of the busiest in the world with over 30,000 ights daily on average (2018, before COVID-19 pandemic) and a very high airport density. In the year 2000, the Single European Sky initiative was created to structure airspace and air navigation services in Europe to overcome the fragmentation and increase ight capacity and air tra c management e ciency. e Single European Sky Air Tra c Management Research (SESAR) program aims at modernizing and harmonizing the European airspace through the development and implementation of a new set of operational procedures and systems among European aviation stakeholders [1,2].
In order to achieve the goals of SESAR, aeronautical and meteorological information need to be harmonized and consistent over Europe. Several national meteorological service providers operate their own Numerical Weather Prediction (NWP) systems on various grid scales (global to regional) and time scales resulting in several weather forecasts available over Europe at any given time. Additionally, harmonized global aviation weather forecasts provided by the World Area Forecast Centre's are available. However, for the highly fragmented and high-capacity European airspace, there is a need for more detailed weather forecast information (high-resolution aeronautical meteorological (MET) information for short-distance flights, terminal areas, and airports). e SESAR Deployment project [3] 2015_068_AF5 "European Harmonised Forecasts of Adverse Weather (Icing, Turbulence, Convection and Winter Weather)" aims at fulfilling this need by providing European harmonized MET products of aviation weather hazards in high resolution and of high accuracy. Alongside turbulence and convection, in-flight icing is seen as one of the most important meteorological phenomena, which may lead to hazardous flight conditions for aircraft. In-flight icing typically occurs in clouds with supercooled water droplets. e temperature within these clouds is usually between 0 and -20°C [4] and can be, depending on the surrounding conditions (such as the absence of ice nuclei), as cold as -40°C [5]. An aircraft flying through such a cloud may collect some of the supercooled droplets on its wings, propellers, or turbines, measuring instruments and other structures of the aircraft. e droplets freeze instantaneously in contact with the aircraft (large droplets may remain in a semi-liquid state for up to one second [5]) and can accrete to form an ice layer of up to several centimeters [6]. ese ice layers can have a strong impact on the maneuverability of the aircraft and may lead to hazardous flight conditions in such a way that measuring instruments and the aircraft aerodynamics are compromised. In the latter case, icing causes an increased drag and a decreased lift, which results in control problems [7]. Green [8] gives a long-term overview  of accidents and incidents related to in-flight icing. Further descriptions of some individual incidents are discussed by Bernstein et al. [9].
Due to the hazardous nature of in-flight icing, aircraft are equipped with anti-icing and de-icing systems. e two most commonly used systems are heated surfaces and pneumatic rubber boots on the leading edge of the wings and at other locations where ice accretion may lead to flight problems. However, their utilization leads to a significant increase in fuel consumption, and therefore, avoidance of icing conditions is preferred. Consequently, there is a demand for more reliable icing forecasts to reduce costs as well as increase flight safety.
In this study, a harmonized icing forecast over Europe is produced using forecast data from Deutscher Wetterdienst (DWD), Météo-France (MF), and Met Office (MO). e three icing forecast products, each based upon a different NWP model, are ADWICE from DWD [5] (postprocessing based on model ICON-EU [10]), ARPEGE from MF [11], and Unified Model (UM) from MO [12]. e harmonized product, hereafter called by its working title "HarmonICE," is produced by combining the different icing indices created by each of the individual NWP models.
e primary goal of this study is to analyze different methods combining the icing indices, such as the minimum, maximum, or mean of all data and, as a more complex approach, the method suggested by Pepe and ompson [13] described in detail below. e approach with the highest forecast skill will be deployed operationally to provide a harmonized European icing forecast with a high accuracy and consistency across national borders and among airspace users and aviation stakeholders.
Pilot reports (PIREPs) are used to verify the forecasts as supported by previous studies, for example, Brown et al. [14] who describe in detail the advantages and disadvantages of using PIREPs for NWP verification. PIREPs provide information about hazardous weather and are currently the only direct regular measure of in-flight icing events. An overview of the PIREPs used in this study and the methods for the combination of the individual icing forecasts are described in the next section, followed by a detailed description of the results and an outlook.

Data and Methods
PIREPs from the winter season 2017/18 (October 2017 to April 2018) are used in order to determine the best approach for combining the three icing forecasts based on single NWP models and to verify the final harmonized icing forecast, HarmonICE. Icing events over Europe typically occur more frequently during this period of the year than in other months. However, compared to the large amount of forecast data available in the time period, there are few observations of the icing phenomenon.

Observation
Data. An icing event reported by a PIREP provides a point observation of the existence and severity of icing (categorized as none, trace, light, moderate, or severe) at the location of the aircraft. Alongside the five-level icing intensity, the geographic location, altitude, and observation time are reported. PIREPs have been used as icing observation data in several other studies such as Brown et al. [14], Taffner et al. [15], and Kalinka et al [5].
In total, 4268 PIREPs were collected during the winter season. e geographical distribution of "no icing" and "icing" events is shown in Figure 1. e majority of the positive icing events were reported over land in the middle of Europe and close to large cities. Aircraft sending these PIREPs were often located at middle flight levels during approach or landing. is can be seen in the altitude distribution of the PIREPs in Figure 2. In contrast, negative reports were mostly sent from over the sea in the western part of the analyzed area ( Figure 1) and from higher flight levels ( Figure 2). As a result, a large number of "no icing" reports are at heights that are above the maximum forecast level (FL 360) where supercooled liquid water rarely exists due to very low temperatures. e frequency distributions of the PIREPs separated into four icing intensities can be seen in Figure 3. Trace icing observations are included in the light icing intensity category as there are few trace icing reports compared to light icing reports. By far, most PIREPs report moderate icing. A relatively small number of negative reports are available (662) compared to the number of positive icing events (3606). is distribution can be explained by the fact that pilots are expected to avoid areas with severe icing intensities and that light icing intensities may not give rise to a level of 2 Advances in Meteorology concern for the pilot to consider making a report. Furthermore, icing intensity is subjective since no quantitative de nition of severity exists. A detailed discussion about the low quality and quantity of PIREPs is given by Brown et al. [14]. ey conclude that PIREPs are challenging to use in an absolute sense for NWP veri cation. However, in a relative sense, they can be useful, for example, for comparing the detection rates of di erent icing forecasts or the combination of them as presented in this study.

Forecast
Data. e three icing forecasts used for the harmonized product from DWD, MF, and MO are based on di ered NWP systems. ADWICE from DWD provides hourly data output with a horizontal grid spacing of 0.0625°a nd 32 vertical layers up to 225 hPa. e icing intensities provided by ADWICE match the four categories used for the PIREP values in the study. e icing index based on ARPEGE from MF has a lower horizontal grid resolution of 0.1°and 14 layers up to 400 hPa vertically. e data output is hourly up to forecast hour 12 and then every three hours with an icing intensity range between 1 and 10 in steps of one. e UM icing index from MO provides hourly data with 0.140626°horizontal grid spacing in east-west direction and 0.09375°in north-south direction with 16 vertical grid layers extending up to 250 hPa. e icing intensity is continuous ranging between 0 and 1. All models are run every six hours (0, 6, 12, and 18 UTC) and simulate up to 36 hours or longer. e di erent grid spacings and time resolutions have to be considered in the nal harmonized product. e approach used for HarmonICE uses the nest of the aforementioned horizontal grid resolutions of 0.0625°and 29 vertical grid layers (up to 225 hPa or ight level 360). is is done so that in the nal product, there is no loss of icing intensity information resulting from extrapolation to a Flight Level Figure 2: Distribution of the PIREPs by altitude (left: positive icing, right: negative icing). Note that di erent ranges on the abscissa are shown due to the relatively small number of no icing reports. One report may be counted multiple times because the report identi es multiple levels.
coarser grid. Furthermore, customers who currently use the ner resolved single product could have reservations against the harmonized product if it contained less information. Any reduction of information in the vertical dimension would result in less ight levels available in the forecasts, reducing the icing intensity information available when compared to existing single model forecasts, which is undesirable for customers. is may give the impression that the harmonized product is not as good as the higher resolved single product and result in reluctance to use HarmonICE.
Data output of HarmonICE is hourly up to forecast hour 36. e nal output is the categorical icing intensity forecast utilizing the categories recommended in ICAO Annex 3 [16], which are the same categories used for PIREPs. e domain of HarmonICE goes from 23.5°West to 62.5°East and from 29.5°to 70.5°North (see Figure 1).
In order to combine the three di erent centers' forecasts, each individual icing forecast is re-gridded to the Har-monICE grid using linear interpolation. Time interpolation is performed with the ARPEGE data, and the icing indices of ARPEGE and UM are transformed to the four-step categorical range of HarmonICE by applying the thresholds listed in Table 1. is transformation is necessary because unlike ADWICE, which produces categorical icing intensity forecasts, there are no internal calculations to a categorical icing intensity within ARPEGE and UM. e thresholds listed in Table 1 were calculated during an earlier phase of the project using PIREPs from another winter season. ey are currently xed at the given values but will be veri ed and if necessary recalculated in future.
Due to storage problems, some gaps in the forecast data exist for the analyzed winter season 2017/18. All three datasets cover approx. 70% of the complete winter time period so not all PIREPs gathered could be used for comparison with the forecast data. In total, 3010 PIREP forecast data comparisons could be performed.

Analyzing Methods.
is section describes the data processing required to enable evaluation and veri cation of the forecasts. First, the forecast data corresponding to each PIREP are extracted from the four-dimensional data les. Second, the individual icing forecasts are combined in order to produce the harmonized icing intensity.

Pairing Data.
For each PIREP of the examined winter season, the position, observation time, and icing intensity were compared with the forecast data. e point in time for each PIREP was shifted to the nearest full hour so that a maximum time di erence between an icing forecast validity time and the PIREP time was 30 minutes. Within this halfhour window, the detected icing area (cloud) could be advected horizontally some distance according to the mean wind at this altitude leading to some spatial deviation in the location of the icing forecast. Assuming typical mean wind speed of around 15 m/s in a middle altitude of the troposphere, the advection would lead to a distance of around 30 km between the measuring point and the potential new position of the measured cloud half an hour earlier or later. Another source of uncertainty lies in the reported information in the PIREP (e.g., rounding errors of the longitude and latitude values or a time o set during the observation resulting from aircraft travel). In order to account for these uncertainties in time and space, a box was placed around each PIREP and the maximum forecast icing intensity in the box was used to compare to the observed icing intensity (similar to other studies, e.g., Kalinka et al. [5]). e sensitivity of the results to the size of such a box was tested by using four di erent box sizes. e smallest box is simply a single grid length in all dimensions, which results in a pointto-point comparison. All other boxes are increased vertically by one grid point up and down (totaling three grid points in height), and by one, two, and three grid points, respectively, in all four horizontal directions. us, the largest box has dimensions 3 × 7 × 7, totaling 147 grid points, that is, three vertical grid points and seven horizontal grid points in North-South and East-West directions (denoted as 3v7h), corresponding to approximately 600 m vertically and 50 km in both horizontal directions.

Merging Data.
For merging the individual icing forecasts, di erent approaches were tested. Besides the classical method to calculate the mean of the three forecasts, the combination of the maximum and minimum was calculated. Additionally, the method proposed by Pepe and ompson [13] is used. is method combines the forecast values p1,  Advances in Meteorology p2, p3 by choosing a weighting factor λ ∈ [-1,1] in such a way that the combinations λpi + pj (with i, j ∈ {1,2,3}) result in an optimal value for the chosen skill score for each pair of possible combinations (pi, pj). e pair with the highest resulting skill score is then combined into a new intermediate combined icing forecast value pij. is process is then repeated for the remaining pair of icing forecasts pk, pij (with k ∈ {1,2,3}) in order to calculate a second weighting factor for the nal product, which has the following form: pijk pi + λ1 pj + λ2 pk.

Evaluating Data.
Former studies often used a binary yes/no icing analyzing method, for example, Brown et al. [14] and Kalinka et al. [5]. A contingency table enables the calculation of POD (probability of detection) for yes and no icing as well as calculation of the FAR (false alarm rate). e AUC (area under the curve) can also be calculated. ese skill scores provide an overview of the forecast quality and enable comparisons of di erent NWP model performance to be made. However, this binary method distinguishes only between an icing and a no icing event with no consideration to the intensity. For this reason and the fact that only a small number of negative icing reports are available (they are needed for the FAR calculation), a more complex multicategory contingency table is used to analyze the data (e.g., Murphy [17], Brooks [18]). A detailed description of the method applied in this study for calculating skill scores from the multicategory contingency tables is given in the next section alongside an example.

Results and Discussion
In this section, the results are presented for combining individual in-ight icing forecasts using simple and more sophisticated methods. e skill of each method for combining the individual in-ight icing forecasts as well as of each individual in-ight icing forecast is shown. Based on these results, the recommended approach for combining the in-ight icing forecasts for HarmonICE is presented.

Example of a Combination of Individual Forecasts.
One of the simplest ways of combining the icing forecasts is by calculating the average icing intensity (mean method). An example of such an icing intensity eld is shown in Figure 4. A pressure level of 600 hPa on 11   icing) and in the areas where icing occurs (ARPEGE has no icing in the northeastern and northwestern part of the domain). However, the position and spatial extension of the frontal zone agree well between all models. In the Har-monICE eld, the mean method is shown to remove icing areas with light intensities predicted by only one of the three forecast products and typically reduces the higher icing intensity areas since the highest intensity predicted by all products appears only in few areas. e two PIREPs available at this ight level and time, shown as colored dots in Figure 4, agree well with the forecast icing intensities of HarmonICE. In contrast, not all of the individual forecasts are a good t to the PIREPs, further highlighting the bene t of intelligently combining the individual in-ight icing forecasts into HarmonICE. Analysis of all data during the winter season for this meanmethod case is shown in Figure 5. e multicategory contingency table belongs to the point comparison (smallest model box size around each PIREP) and can be read as follows: on the green diagonal, the forecast values agree perfectly with the observed data. An optimal forecast would generate nonzero values only along this diagonal; all other entries would be zero. e blue cells show under-forecast cases where the predicted icing intensity is smaller than the corresponding PIREP icing intensity. e red cells show the over-forecast cases. e darker the red and blue colors, the larger the di erences between the forecast icing intensities and each corresponding PIREP. e percentage of the three categories (perfect forecast, under-forecast, and over-forecast) is mentioned below the table. e gray cells give the sum of the corresponding rows and columns. As mentioned previously, 3010 PIREPs were compared with a forecast value. is is less than the total amount of PIREPs available in the winter season (see section "Data and Methods") since not all forecasts were not available over the entire time period and some PIREPs were reported from an altitude above the model domain height.
In this example of averaging the individual model forecasts, the results show a tendency to underestimate the icing intensities. Nearly half of the compared values of the mean-method data are smaller than the corresponding PIREP. A perfect forecast, where the predicted icing intensity and the corresponding observed icing intensity are identical, is observed in 41.7% of all cases. e multicategory contingency table provides a variety of di erent information about the analyzed forecast product. However, it does not provide one single skill score, which would allow an overall evaluation of the product and a simple comparison with multicategory contingency tables derived from other forecast products. In order to create a single skill score to use to compare the individual icing forecasts, the PODs of each icing intensity, as well as the total hit rate, the FARs of each icing intensity as well as the total false-forecast rate and under-and over-forecasting rates were calculated. For the latter four values, the total amount of observed PIREPs (3010) was used as the divisor. To calculate the icing intensity hit rates (no, light, moderate, and severe), the number of correctly forecast icing intensities (green values on the diagonal of the table in Figure 5) was divided by the amount of PIREPs with the corresponding icing intensity (gray values in the last line of the table in Figure 5). For example, the hit rate of no icing was calculated by dividing 236 by 247, which results in 0.96. Figure 6 shows the above-listed skill scores calculated from the multicategory contingency table ( Figure 5). e very high hit rate for no icing reports (0.96) highlights the problems mentioned in section "Observation Data" section with PIREPs of no icing. Most of the no icing PIREPs are at altitudes where no icing clouds occur and thus these statistics look very good. e reason for taking both the single hit rates of each icing intensity (hit no, hit light, hit moderate, hit severe) and the total hit rate (hit all) into account is that in the total hit rate, each value has the same in uence on the probability, while the single hit rates are di erently in uenced by one value because their total numbers are di erent.
In order to determine one single skill score from the values shown in Figure 6, they were merged into three di erent groups of skill scores: the hit rates for the single icing intensities (hit no, hit light, hit moderate, and hit severe), the total hit rate, which is equal to 1-false all and the under-and over-forecast rates. ese three groups of scores contain all relevant information in the multicategory contingency table ( Figure 5). By averaging the single scores within the groups and then averaging the three groups, one multicategory average skill score (MCASS) is obtained, which contains all relevant characteristics of the multicategory contingency table. e calculation described above can be mathematically formulated as with the rst term representing the average of the single hits, the second term representing all hits, and the third term giving consideration to under-and over-forecast events. e under-forecast and over-forecast values are subtracted from one in order to get a value that is best when it is equal to 1. e authors are aware that there are other scores or combinations of scores that can be derived from the multicategory contingency table such as the Heidke skill score (HSS) [19]. However, the MCASS (equation (1)) provides a good and su cient indication of the icing forecast skills investigated in this study. is was also manually checked by comparing a set of calculated MCASS with the corresponding multicategory contingency tables. One additional advantage to the formulation of the MCASS over skill scores such as the HSS is that it can be modi ed by di erent weighting factors for each term in equation (1) in order to consider certain single terms more than others. is might be the case if customer requirements are modi ed in a way that, for example, over-forecasting must have a stronger impact than under-forecasting or the hit rate of severe must be considered stronger than the other hit rates. In the current study, however, the single terms of equation (1) are weighted homogeneously. More detailed analysis of the di erences as well as the advantages and disadvantages of the MCASS in comparison with other skill scores will be investigated in future studies. e MCASS was computed for each individual icing forecast and each individual combination method discussed in the "Merging Data" section (results for the mean method presented in Figures 5 and 6).

Overview of Various Combination Methods.
Following the example for the multicategory contingency table in the previous section, the performance of each individual icing forecast and combination method is now assessed. A nal value of MCASS is shown for the recommended combination method for producing HarmonICE.
For the Pepe and ompson (PT) method, the best values for λ1 and λ2 were iteratively calculated as 0.6 and -0.2, respectively. e nal optimized combination results in ADWICE + λ1UM + λ2ARPEGE. In Figure 7, all results are summarized for the three forecast products as well as for the four combinations (min, mean, max, and PT). e box sizes used around the PIREPs are indicated as di erent colors of the bars. Figure 7 shows that when considering the individual forecasts, the box size only has a signi cant in uence on ADWICE. e larger the box size, the better the results indicated by the MCASS.
is can be explained by the relatively high grid resolution of ADWICE compared to the other two forecasts. e higher grid resolution leads to higher spatial uctuations of the icing intensity within small areas, and hence, larger icing intensity values are more likely to be encountered by increasing the box size. Furthermore, ADWICE tends to underestimate the icing intensity more than the other models for the smallest box size (not directly shown here but re ected by the relatively small blue bar of ADWICE in Figure 7), which also leads to better results for an increasing box size. e box size dependency of ADWICE can also be seen in the combined PT method since ADWICE is included with the largest factor of one. Overall, the PT method leads to the best results of all forecast combinations and is better than each individual forecast product. is statement is independent of the box size around the PIREPs. For each box size (bar color), the PT method provides the largest MCASS value.
In order to give a better insight into the improvement of HarmonICE provided by the PT method, Figure 8 shows the multicategory contingency table, which belongs to the largest bar in Figure 7 box 3v7h of PT method red bar). e table in Figure 8 can be compared with that of Figure 5, which was obtained by using the mean method with the smallest box size (box 1v1h of the mean method in Figure 7-blue bar). e MCASS di erence between these two examples is approximately 0.12. is is less than the di erence between the PT method and most of the individual forecasts. e perfect forecast using the PT method is 63.1%, more than 20 percentage points larger than with the mean method (41.3%, Advances in Meteorology 7 see Figure 5), and the under-forecasting is reduced to 29.6%. e main reason for the higher skill using the PT method is that a large number of the light icing forecasts in the other methods are increased to moderate icing. is leads to a high hit rate of moderate icing, which is the group with the most icing reports and consequently has a large impact on the results. Despite this increase, the hit rate of the no icing events is still very good with the PT method (234 vs. 236 with the mean method). is shows that the in uence on the hit rate of no icing reports is small.

Conclusion and Outlook
In this study, a method for harmonizing in-ight icing forecasts over Europe is introduced. Di erent approaches were tested in order to nd the best combination of three inight icing forecast indices based on three di erent NWP models. PIREPs are used as veri cation data, although they have some disadvantages for this purpose. An example on a single day shows the di erences in the individual forecast products and the combined forecast product using the mean method. e PIREPs reported during the single day event t reasonably well to the combined data set.
In order to be able to compare the di erent combination methods, the MCASS was derived from each multicategory contingency table generated for various forecast combinations and the individual forecast products.
e MCASS contains all relevant information given by the multicategory contingency table, that is, the hit rates of the four icing intensities, over-and under-forecasting, and the total hit rate. It allows quantitative comparisons of the forecast skill from each icing forecast product as well as from the combination methods presented. e nal comparison of the MCASS of the di erent forecast combinations and the individual forecast products shows that the PT method provides the best results including showing an improvement compared to each single icing intensity forecast. e sensitivity study of box sizes around the PIREP positions in which the forecast data were extracted shows that the results are independent of the chosen box size; however, it does have an in uence on the MCASS. In most cases, an increased box size leads to a larger MCASS, but for certain forecast products or combinations, the results were una ected by box size or showed a decrease in MCASS.
Within the SESAR Deployment project, initial tests of the preoperational version of HarmonICE have run since the middle of 2019. Since 2020, the data have been provided as a six-hourly icing forecast over Europe. Veri cation to inform the revision of the best forecast combination method-and if  8 Advances in Meteorology necessary, a new calculation of the weighting factors-is and will be performed annually with the PIREPs from the previous winter season. Within that recalculation, it will also be possible to adapt MCASS to user requests. e calculation method allows for a different weighting of the single skill score terms such as over-forecasting or a certain intensity hit rate (e.g., severe icing intensity). If one of these single skill score terms must be considered more strongly than the others, it can be implemented in the average skill score calculation. PIREPs are used for icing verification since better alternatives have previously been unavailable. One alternative for determining observed icing intensities, or at least whether icing occurs or not, is the use of a satellite inferred icing potential [20]. Stretton et al. [21] showed that satellite data can be used to detect the "no icing" regions in cloud-free areas. Furthermore, cloud top heights and the corresponding icing potential at this height level can be derived from the satellite. Overall, satellite data cannot provide a three-dimensional dataset of icing intensities, but it can help to improve the forecast of the areas of icing conditions and accuracy of the verification dataset together with PIREPs. e use of satellite data as a verification tool alongside PIREPs will be explored in future work.
Data Availability e underlying data have been produced for aviation purposes. ey can be provided for aviation use and for research activities. Contact the author or VL.Kunden-service@dwd.de.