The results of a deterministic calibration for the nonhydrostatic convection-permitting LAM-EPS AEMET-
The necessity for calibration of meteorological models is known for many years. Whether by the use of classical statistical methods or by more modern and advanced techniques [
In this work, the nonhydrostatic convection-permitting LAM-EPS AEMET-
For the calibration, it was decided to use a deterministic approach with different machine learning (ML) methods; that is, it was decided to calibrate each of the 20 members as if they were a deterministic model, and 5 airports that represented different climatic conditions of Spain were chosen; these airports were Madrid-Adolfo Suárez-Barajas, Barcelona-El Prat, Vigo-Peinador, Palma de Mallorca-Son San Juan, and Málaga-Costa del Sol. Madrid has an airport in the middle of the Iberian Peninsula, with a continental and dry climate; two airports were close to the coast (Barcelona and Palma de Mallorca): one of them (Palma de Mallorca) is in an island with Mediterranean climate. The other two airports are in the wet Atlantic facade of Spain (Vigo) and in the hot land of Andalucía in the South of Spain (Málaga).
It was decided to calibrate 3 variables that have a clear impact for the sensitive weather in the surface: temperature, wind speed, and precipitation in 24 hours. The sophistication of the calibration was roughly in the increasing order from temperature to precipitation, due to the inherent difficulties associated with such variables. As mentioned, machine learning (ML) tools were used, a range of powerful statistical methods that are growing in popularity due to their success. A brief overview of ML methods is shown in the next section.
Some efforts have been done in the past using ML as a calibration tool. There are new and promising results using ML for instance for the nowcasting of precipitation [
Machine learning methods are a wide range of statistical tools that allow extraction of meaning from data. Indeed, statistics and machine learning can be synonyms. Both are concerned with learning from data. Simplifying perhaps too much, one could say that statistics puts an effort in formal inference for low-dimensional problems while machine learning deals with high-dimensional problems [
Machine learning can be divided in 3 big paradigms: reinforcement, supervised, and unsupervised learning. Reinforcement learning is applied when a system learns, while it evolves interacting with an environment. The learning is supervised or unsupervised depending on if there is a function or a set of labels that guide the learning. In this work, the supervised paradigm is used, with data (observations) that should be similar to the model. Inside the supervised paradigm, there are classification or regression problems depending on discrete or continuous variables, respectively. As in this work variables are continuous, regression is the technique used.
Among the many techniques present in the literature of ML, these methods were chosen: ridge regression, lasso, elastic net, Bayesian ridge, random forest regression, gradient boosting, XGBoost, AdaBoost, polynomial regression, singular vector regression (SVR), and feedforward neural networks (FNNs) that are briefly described in the next sections.
These methods are sophisticated versions of the classical linear regression solved by Carl Gauss 200 years ago. They minimize a squared error function as in linear regression but with the peculiarity of adding an extra term to prevent
In the case of ridge, the penalty term (also known as a
Lasso is very similar to ridge, but it uses a
One of the main features of lasso is that it reduces the number of predictors used in the regression to only those that provide more information for the function to be closest to the observations.
Elastic net is a combination of ridge and lasso at the same time. Bayesian ridge assumes a Bayesian probability thinking, assigning a Gaussian probability distribution for the parameters of the model and then estimating them during the regression; this results in an approach very similar to ridge [
Random forest consists in the creation of an ensemble of decision trees and in taking the mean of the values that those trees estimate. A decision tree is a model similar to a flowchart consisting of branches and nodes. Nodes are functions of the data, such as mean squared error (MSE) or information-related metrics (Akaike information criterion, for instance). Branches are the different outcomes of those operations performed in the nodes. At the end of the tree, we have the leaves, that is, the final outcome of the different operations. When working with random forests, it is crucial to set the adequate deepness of the trees, that is, the number of levels in the flowchart.
Boosting combines the random forest approach with the minimization of an error function, like MSE. They use the gradient descent technique for such minimization. Knowing that the gradient gives the direction of maximum growth of a function, we can move in the opposite direction in order to search for the minimum of such function:
From a broader perspective, random forest can be seen as algorithms that reduce the variance of a model and the boosting algorithms can be seen as a reduction of the bias of models. Both techniques lead to a reduction of the MSE since we know that
These are important on fashion techniques that have showed considerable power when dealing with big and complex datasets (especially the neural network). The details of their implementation and their working are quite convoluted, and here, only a very brief summary of their features is shown. Interested readers can consult [
SVR (singular vector regression) is a technique based on going to higher dimensional spaces in order to convert a nonlinear problem into a linear one. Working in higher dimensional spaces has a cost, the so-called
A feedforward neural network (FNN) is an artificial imitation of a human brain, with hierarchical layers of neurons that receive a set of inputs and compute nonlinear functions or
The training dataset was from November 14, 2016, to January 22, 2018, roughly one year and two months. The observation of the 2-meter temperature came from the METAR reports of the 5 airports. The 4 closest points to the coordinates of the observation stations were the points chosen, covering a grid area of 2.5 × 2.5 km2. The reason of choosing these 4 points is because not always the closest point provides the better information and also because more information can be gained by adding other points. In Figure
Quality of the ridge regression for different number of points (temperature).
Quality of the ridge regression for the 4 closest points (temperature).
In some airports, such as Madrid, the 4 closest points are all land points, so no special measure need to be taken. But in cases like Barcelona or Palma, some of the 4 closest neighbours could be (and were) points over the sea. It is known that the diurnal cycles of the temperature over the ocean and over the land are different. A legitimate approach could be to include the 4 points in the regression, without considering if they are land or sea; the elimination of systematic errors (biases), like the differences in the temperature between land and sea, is something that ML algorithms are especially good at. However, it was decided that an extra “help” could be provided to the algorithms filtering those points over the sea. In the ML literature, this is called
Data from the model from
After these quality controls (and other basic checks such as deleting repeated values for the same hour), the 1 column for the observations and the 4 columns for the forecasts were confronted and prepared for the calibration or training with the different ML methods. This joining and preparation of the dataset was done using the very useful
The results of the training for the 5 airports chosen with one of the members of the LAM-EPS AEMET-
The graphs for the 5 selected airports for the member 019 are shown. The MSE is in the vertical axis and in the horizontal the different ML methods. The average of the performance for each method is shown, and its standard deviation is calculated with cross validation. For each MSE, it represents the MSE plus the standard deviation as the top of the bar and the MSE minus the standard deviation as the lowest part of the bar, to give an idea of the range of variability. As previously stated, the red line is the model output without postprocessing for the minimum MSE of the 4 points. The green line is the MSE of the closest point to the observation, so it is really the point (or the line) to use to compare between the model and the ML methods. The calibrations of the 5 airports are shown in Figures
T2m calibration plot for Madrid airport. Horizontal axis, from left to right: ridge, lasso, elastic net, Bayesian ridge, random forest regression, gradient boosting regression, XGBoost regression, AdaBoost regression, polynomial of order 2, and feedforward neural network.
T2m calibration plot for Barcelona airport. Horizontal axis as in Figure
T2m calibration plot for Palma airport. Horizontal axis as in Figure
T2m calibration plot for Vigo airport. Horizontal axis as in Figure
T2m calibration plot for Málaga airport. Horizontal axis as in Figure
As an extra, a scatter plot between the point with the minimum MSE and the observations is shown (Figure
T2m scatter plot for Madrid airport,
It is necessary to comment that the same weights and biases have been used in the algorithms for all hours from
As it can be seen, the classical statistical and linear methods perform very well. The ridge method seems to stand out. Comparing with the green line (the closest point to the observation) it is possible to see there are improvements (in some cases high, in some cases not so high). For the cases of two airports (Vigo and Málaga), there is no clear improvement, but the ridge method in such case is similar to the model performance, so there is no spoiling either.
As with the temperature, the dataset was from November 14, 2016, to January 22, 2018. The main ideas implemented for the case of the temperature are also applied for the wind speed at 10 meters. However, some caveats need to be considered. First of all, as the LAM-EPS AEMET-
The wind components
An important thing to comment is that the training was done for the wind speed only, that is, the scalar value, not for the wind vector with its magnitude and direction. The reason of this was purely a matter of choice. It was checked that when training for the wind as a vector, part of the learning went to learn the direction and part of the learning went to learn the magnitude. It was decided that a wind vector whose direction differs from the METAR in some degrees was not very relevant, but that a difference in a couple of knots (or m/s) is more substantial and more helpful to the forecasters. So the modulus of the wind vector was calculated from the
It is also relevant to remark that the only quality control performed over the observations and the model was a basic quality control to delete the presence of gross outliers, in a similar spirit to the case of the temperature, that is, balancing out the necessity of avoiding gross outliers that would ruin the learning while at the same time penalizing the model for bad values. The threshold was put in 100 m/s. Unlike what happens with the temperature, it is quite difficult to perform a quality control over the wind that does not discard numerous valid measures. Also, unlike with the temperature, it is unrealistic to expect a regular pattern in the evolution of the wind that allows us to make valid comparisons between the values some hours before and later.
The 4 closest grid points to the observation point were searched, and a multivariable regression was performed. Not all the points were land points. For the case of the wind, measured at 10 meters, there is also a difference between land and sea, but it was thought that this difference was not as important as in the case of the temperature (with its strong diurnal and nocturnal cycle for land points) and that other factors were more important (type of terrain, for instance). As in the case of the temperature, it is possible to see in the example of Figure
Quality of the ridge regression for the 4 closest points (wind speed).
Quality of the ridge regression for different number of points (wind speed).
It is also shown the scatter plot (Figure
10 meters total wind scatter plot for Madrid airport, LAM-EPS AEMET-
For each MSE, the MSE plus the standard deviation is represented as the top of the bar and the MSE minus the standard deviation is represented as the lowest part of the bar, to give an idea of the range of variability. The red line is the model output without postprocessing for the minimum MSE of the 4 points. The green line is the MSE of the closest point to the observation, so it is really the point (or the line) to use to do the comparison between the model and the ML methods. As it is shown in the calibration graphs (Figures
10 meters total wind calibration plot for Madrid airport. Horizontal axis, from left to right: ridge, lasso, elastic net, singular vector regression, Bayesian ridge, random forest regression, gradient boosting regression, XGBoost regression, polynomial of order 2, and feedforward neural network.
10 meters total wind calibration plot for Barcelona airport. Horizontal axis as in Figure
10 meters total wind calibration plot for Palma airport. Horizontal axis as in Figure
10 meters total wind calibration plot for Vigo airport. Horizontal axis as in Figure
10 meters total wind calibration plot for Málaga airport. Horizontal axis as in Figure
As for the cases of the wind and the temperature, the same dataset was used, from November 14, 2016, to January 22, 2018. Calibrating the precipitation is a very subtle issue. It is thoroughly known that precipitation does not follow a Gaussian distribution. It is also known that when calibrating, the precipitation is necessary to take into account that, besides the numerical quantities, the structure of the precipitation is also important. That is why the approach followed was different. Unlike with the cases of the wind and the temperature, the points used were the 12 closest neighbour points of the model, not the 4 closest ones. It was thought that, with this number of points, the high spatial uncertainty that affects the precipitation was taken into account. With this number of points, the features of a precipitation structure are collected and that at the same time, there is not a renounce to the high resolution properties of the LAM-EPS AEMET-
Irregular octagon representing the 12 points considered to calibrate the precipitation. The blue points are the grid points from the model, and the red point is the observation point.
The temperature at 2 meters and the
As in the case of the wind field, the quality control was to discard gross outliers (if any) in both the model and the observations. In the case of the observations, quality controls are done before incorporating any data to the Spanish climatological database. For the model and as in the case of the wind speed, it is possible to discard only gross outliers that are clear indication that something was wrong when computing or storing the data; these are outliers due to mechanical or operational issues, not related to the model design and performance. Except for gross outliers, bad values from the model were included and would be a penalization in the training. The level was put in 2000 millimetres in 24 hours for the precipitation and in 100 m/s in wind speed and ±80 degrees in temperature, as before. As a safety check, rows which had negative values were deleted: this can happen when transforming data (for instance, in the Spanish climatological database, values are stored as tenths of millimetres and for this work, they were converted to millimetres); this phenomenon is called
When dealing with this type of regression, the possibility of standardizing the dataset was considered.
In the graphs, for each MSE, the MSE plus the standard deviation is represented as the top of the bar and the MSE minus the standard deviation is represented as the lowest part of the bar, to give an idea of the range of variability. The red line is the model output without postprocessing for the minimum MSE of the 12 points. The green line is the MSE of the closest point to the observation, so it is really the point (or the line) that should be used to do the comparison between the model and the ML methods. The results are shown for the precipitation without standardization (Figures
Precipitation in 24 hours, calibration plot for Madrid airport. Horizontal axis, from left to right: ridge, lasso, elastic net, singular vector regression, Bayesian ridge, random forest regression, gradient boosting regression, XGBoost regression, AdaBoost regression, and feedforward neural network.
Precipitation in 24 hours, calibration plot for Barcelona airport. Horizontal axis as in Figure
Precipitation in 24 hours, calibration plot for Palma airport. Horizontal axis as in Figure
Precipitation in 24 hours, calibration plot for Vigo airport. Horizontal axis as in Figure
Precipitation in 24 hours, calibration plot for Málaga airport. Horizontal axis as in Figure
Precipitation in 24 hours, standardized calibration plot for Madrid airport. Horizontal axis as in Figure
Precipitation in 24 hours, standardized calibration plot for Barcelona airport. Horizontal axis as in Figure
Precipitation in 24 hours, standardized calibration plot for Palma airport. Horizontal axis as in Figure
Precipitation in 24 hours, standardized calibration plot for Vigo airport. Horizontal axis as in Figure
Precipitation in 24 hours, standardized calibration plot for Málaga airport. Horizontal axis as in Figure
Note that sometimes the blue bars that denote the standard deviation have negative values. Of course, this does not mean that the MSE is a negative magnitude. It is simply a reflection of the fact that the cross-validation technique has showed a wide variability of our MSE. The MSE varies a lot depending on what slice is the validation set and what slices are the training set. Blue bars are by definition symmetric around the average of the MSE, that is, the top of a bar is the average of the MSE plus the standard deviation and the lowest part of a bar is the MSE minus the standard deviation. So, bars in the negative values are really MSEs with a great positive value.
As it is possible to see, precipitation is a very subtle variable to calibrate. For the case of the precipitation, each point has its peculiarities in an even stronger way than with the wind speed or the temperature. What it is possible to say is that standardization helps (however, perhaps not always). For the precipitation, the most sophisticated methods such as singular vector regression and neural networks begin to show their strength although still reasonable results are achieved with ridge.
As it has been shown, ML methods are a great tool for the calibration of meteorological models. Classical linear regression, with the added help of regularization, works very well for the temperature and the wind speed. In the case of the precipitation, there is no preferred method and things seem to depend on the point and on the nature of the dataset, something that is not surprising, because it is known that there is not a universally valid ML method, valid for all the datasets [
It is legitimate to ask oneself what these ML methods are really doing (at least, what they are
Why do some methods perform better than others? In most of the occasions, when doing ML, it is hardly known a priori which method will be the right one. It is proof and error what finally determines what method has the best performance. However, and from purely physical considerations, for the wind speed and the temperature, the success of relatively simple methods like ridge, elastic net, lasso, or Bayesian ridge, which are basically extensions of a linear regression, is probably linked to the facts mentioned in the previous paragraph: the correction of mainly systematic errors due to relatively few and controlled sources of error for these variables. In the case of the precipitation, with all the uncertainties and complexities involved, more sophisticated methods like the FNN, that are capable of discerning more subtle signals in the data, begin to give better results. FNN and the rest of sophisticated methods are harder to train, with a tendency to overfit among other subtleties; these methods are not geared to relatively better determined problems like the forecast of the wind speed or the temperature.
It is important to remark that the calibration goes well when the ML methods deal with values that are in the range of the minimum or maximum values in the dataset, in other words, values that are in the range of what the algorithm has “seen.” When a calibrated algorithm faces a value that is outside the trained range, anything can happen. Depending on their nature, some algorithms will perform a linear extrapolation and others could fit the value to some complex, high-order polynomial curve. To avoid this behaviour, it is possible to establish a flag or similar warning advice to deactivate the algorithm for such a value, letting the direct (uncalibrated) output of the model to be the definitive value. At least the extreme value is incorporated to the dataset and it will be part of a future training process.
With respect to the calibration with ML, there are many lines of research that can be explored in the future. It is possible to dive deeper in the realm of ML methods, searching for instance how
Huge amounts of data have been used for this work, and parts of them could be released (although we cannot guarantee it) if needed by contacting the corresponding author via
A previous version of this article appeared in a Spanish book about different strategies regarding weather forecasting. It was a summary of what it has been shown here, and entire sections were omitted, like the analysis of the precipitation. The authors did not earn any amount of money with the publication of the book.
The authors declare that there are no conflicts of interest.
The authors thank the Spanish weather service, AEMET, for its funding and support, both the headquarters office and the local office of the Canary Islands. They also thank the ECMWF, the Canadian CMC, the French Météo‐France, the Japanese JMA, and the North American NOAA for their kindness providing the boundary conditions for the LAM-EPS‐AEMET‐γSREPS, and also they thank the North-American NOAA, NCEP, NCAR, NWS, and related communities of WRF-ARW and NMMB models and the Harmonie community. The authors also thank all the team of the LAM‐EPS AEMET‐γSREPS ensemble, for its kind support and help in many topics, especially to José Antonio García‐Moya Zapata: he has been very kind, showing the path when David Quintero Plaza was a total beginner; more than a team leader he has been a mentor. Special thanks go to José Luis Casado Rubio, for his design of a very great software library. We also want to thank Álvaro Subías Díaz‐Blanco and Alfons Callado Pallarès for their help and useful comments. We thank the community of the data science and Machine Learning in Python for developing really great tools.