An Improved Ensemble Method for Completely Automatic Optimization of Spectral Interval Selection in Multivariate Calibration

In our recent work, Monte Carlo Cross Validation Stacked Regression (MCCVSR) is proposed to achieve automatic optimization of spectral interval selection in multivariate calibration. Though MCCVSR performs well in normal conditions, it is still necessary to improve it for more general applications. According to the well-known principle of “garbage in, garbage out (GIGO)”, as a precise ensemble method, MCCVSR might be influenced by outlying and very bad submodels. In this paper, a statistical test is designed to exclude the ruinous submodels from the ensemble learning process, therefore, the combination process becomes more reliable. Though completely automated, the proposed method is adjustable according to the nature of the data analyzed, including the size of training samples, resolution of spectra and quantitative potentials of the submodels. The effectiveness of the submodel refining is demonstrated by the investigation of a real standard data.


Introduction
Mutivariate spectroscopic calibration is an old and yet ever-growing research field in chemometrics. Multivariate calibration technique is very comprehensive and a successful application of this technique requires practitioners' experience and expertise. Multivariate calibration modeling involves many steps, such as outlier diagnosis, selection of representative training samples, data preprocessing, model optimization and validation [1]. Due to the complexity and uncertainty of the data analyzed, each of the above processes has much to do with the success of calibration and thus should be performed properly. Moreover, with increasing needs for quickly quantifying sought-for components in various complicated chemical systems involved in different subjects, automatic optimization of multivariate calibration modeling will undoubtedly boost the applications of chemometrics to analytical chemistry.
Modern spectroscopic instruments can provide a spectrum measured at hundreds and even thousands of wave-lengths in a few seconds. An important step in multivariate calibration is wavelength selection. Taking the most popular method, partial least squares (PLSs), for example, wavelength selection and model optimization are usually performed simultaneously. Determination of model complexity of PLS should be based on a best subset of the measured wavelengths. Moreover, it is supported by both practical experiences [2][3][4][5] and theoretical research that proper wavelength selection is necessary for multivariate spectroscopic calibration [6,7]. There have been many literatures devoted to this problem; for a comprehensive review one can see [8,9].
The present paper is oriented to interval selection. Firstly, for such spectral data like near infrared (NIR) ones, an important feature of the analytical channels is their continuity [10,11]. Spectral continuity for calibration means that when a certain wavelength contains useful quantitative information or is contaminated, so very likely are its neighboring wavelengths. Therefore, different spectral intervals will have different data structures, namely, different optimized interval PLS models are very likely to have different model complexity. Secondly, for spectral data with hundreds and even thousands of wavelengths, it makes the wavelength selection procedure simpler to tackle the wavelengths as intervals, because the number of intervals will be much smaller than that of total wavelengths. As two pioneer methods for interval selection, interval PLS (iPLS) models [10,11] are built on evenly split spectral intervals, while moving-window PLS (MWPLS) [12] develops interval PLS models based on a spectral window moving along the total spectral range. Both of these two methods can present a graphical demonstration of the quantitative potential and complexity of local intervals and provide a straightforward tool for interval selection and model optimization. The original iPLS and MWPLS select the intervals with low errors and less model complexity. This strategy is very reasonable and intuitive, but the selection of intervals included in iPLS or determining interval borders in MWPLS still depends much on experiences. Some researchers have also contributed to improving and optimizing the iPLS or MWPLS [13][14][15]; however, many of these methods are computationally expensive or do not achieve the optimum models. Considering the local data structure, putting all the seemingly "good" intervals into one PLS model might not be the best choice. For the above reasons, combining and optimizing the proper small interval models by ensemble learning methods seems very attractive.
In our recent work, an improved ensemble learning method, Monte Carlo Cross Validation (MCCV) [16] Stacked Regression (MCCVSR) [17] is used to optimize interval selection. Unlike other common ensemble methods, which achieve model combination by averaging, selecting a median and so on, MCCVSR has its peculiar optimization objective, namely the lowest root mean squared error of MCCV (RMSEMCCV). Moreover, MCCVSR gracefully combines the MCCV of models on small spectral intervals with nonnegative least squares (NNLSs), which is very computationally economic. Optimization of interval selection is achieved by weighting the submodels according to the criterion of lowest RMSEMCCV.
MCCVSR performs very well when the submodels are reasonable or not very bad. Moreover, it can exclude poor models by giving them zero weights in NNLS. However, a concern with general use of this method is when it is applied to data sets with more uncertainty, very bad submodels might spoil the prediction results. According to the well-known "garbage in, garbage out" principle, just one outlying submodel with nonzero weight in the combination will lead to poor predictions in the final ensemble model. Moreover, if many outlying or very poor submodels exist, they can mask each other and have nonzero weights in the ensemble model. So, for the purpose of obtaining an automatic, and more importantly, a generally reliable algorithm, it is necessary to preselect the submodels before combination in MCCVSR. In this work, a statistical test is designed to preselect interval models to develop a completely automatic algorithm for interval selection and model optimization. [18,19] is an interesting ensemble method to combine submodels without suffering of correlation. Considering the fact that a large number of combination coefficients can increase the model's degree of freedom and lead to overfitting, MCCV [16] is introduced into SR to improve it. Because MCCV allows a large number of sampling times and a high percent of leaveout samples, it can effectively reduce the risk of overfitting in both submodels and combination. MCCVSR optimizes the combination model as follows:

MCCVSR. Stacked regression (SR)
where the column vector y MCCV contains the reference concentration values of leave-out samples during MCCV sampling and y MCCV,i contains the corresponding predicted values by submodel i (i = 1, 2, . . . , K). The K × 1 vector w contains the model combination coefficients and K is the number of submodels. The combination coefficient vector, w , in (1) is readily computed by NNLS, which has been proved to be more suitable for combination than normal least squares by avoiding too large weights of some submodels [19]. The prediction by combined model can be expressed as: where y un,i is the predicted concentrations of unknown samples by submodel i (i = 1, 2, . . . , K). More details of MCCVSR can be found in [17].

Refining Submodels by Statistical Tests.
Here, a statistical method is introduced to test the significance of correlation coefficient, r, between y MCCV,i (i = 1, 2, . . . , K) and the corresponding reference values, y MCCV . Only the submodels with a sigificantly sufficient correlation coefficient can be included for combination. Because the sample distribution of correlation coefficients is much more complex than that of means or mean differences, Fisher's approximately normal transformation [20,21] of r to Z is used: The new approximately normal statistic Z has an expected standard deviation σ z near to 1/(n − 3), where n is the length of sampling vector. The obtained Z-test value is referred to a normal distribution to test whether r is sigificantly larger than a threshold value. Considering the natures of different data sets, the significance levels and the threshold value of the above one-sided test should be adjustable. For instance, given the frequently used significance level, 0.05, when the spectral intervals are very effective for quantitative analysis, one can adopt a higher threshold value, and vice versa. In this paper, the default threshold value of correlation coefficient is 0.80. spectral intervals [17] are combined. MCCVSR optimizes interval selection by weighting the submodels to achieve the lowest RMSEMCCV value among all combined models with nonnegative constraints. It is just necessary to do MCCV on small interval models and combine them by NNLS, which is very computationally economic.

Optimizing Interval Selection by
In order to achieve more precision in interval selection, the idea of moving window is introduced into MCCVSR. The step of evolving interval models can be adjusted in terms of the resolution of spectral data. For example, the wavelength step can be 1, 2, 3, 4, 5 or other positive integers. A default wavelength step of 5 and a window width of 30 are adopted in this paper. Of course, for spectral data with very high resolution, it is wise to have a larger evolving step to save computation time.

Data Descriptions
To test the performances of the proposed method, a standard real data set is investigated.
Temperature data [22] Spectra of 19 mixtures of ethanol, water and isopropanol and the spectra of the pure compounds are recorded on an UV-VIS spectra HP 8453 spectrometer. Spectra ranged from 580-1091 nm with 1 nm increment are measured at 30, 40, 50, 60, and 70 degrees Celsius. Representative samples measured at the five temperatures are selected to form a training set to predict concentrations of the three components.

Results and Discussions
The data set has 19 mixtures of 3 components, ethanol, water and isopropanol, together with pure components measured at 5 different temperatures, so we have totally 110 samples at hand. To develop global calibration models for predicting percentages of the 3 components, at each temperature, DUPLEX method [23] is used to uniformly select 16 samples for training and 6 samples for test. So we have a training set of 80 samples and a test set of 30 samples. Some of the original training spectra are plotted in Figure 1.
For each component, PLS model with total spectral range, MCCVSR model and improved MCCVSR model with refining step are built. The complexity of PLS model and PLS interval models is determined by MCCV, where the sampling time is 50, and each time 50 percents of the training samples are left out for prediction. The numbers of latent variables are such determined that the RMSEMCCV value is minimized. The root mean squared error of calibration (RMSEC) and the root mean squared error of prediction (RMSEP) are used to evaluate the quality of models. The results of PLS models with total spectral range are listed in Table 1. It can be seen from Table 1 that the numbers of PLS latent variables in these models are much larger than 3, indicating the high complexity of the data. Essentially, influenced by temperature variations and other factors, the spectra are far from the expected ones of a common 3component system. Moreover, it should be noted that the RMSEP values are much higher than RMSEC values. It is very clear that some spectral intervals are complicated and the global models contain many non-concentrationcorrelated variations; therefore, it is very necessary to perform wavelength selection.
For MCCVSR models, PLS submodels are built on a spectral interval moving along the spectral range. The interval contains 30 wavelengths and its step is set to be 5 wavelengths, so we have 97 interval models in all. The complexity of all the interval models is determined by MCCV described above. For each interval model, the number of PLS latent variables is determined to obtain the lowest RMSEMCCV value. Submodels are then combined by w, as in (1). As an example, Figure 2 presents the optimized complexity of interval models and their RMSEMCCV values for the prediction of ethanol. As shown in Figure 2, the local data structures are very complicated, because some interval models with lower complexity have higher RMSEMCCV values while many interval models with higher complexity present better quantitative potentials. Therefore, an intuitive selection of intervals in terms of lower complexity and errors is not easy and automation of this procedure is necessary. The combination coefficients of MCCVSR and MCCVSR with submodel refining for prediction of ethanol are plotted in Figure 3. Here, the significance level of the test is set to be 0.05 and the threshold value of correlation coefficients is 0.80. From Figure 3, it can be seen that with submodel refining, some interval models are excluded from the final combination, including two submodels that have nonzero weights in MCCVSR. This change might seem too trivial but should not be overlooked. Considering the nature of NNLS, when most submodels (predictors) are very accurate, the power of MCCVSR against bad submodels is very strong, which is the case as above. However, when the spectral intervals generally have poor quantitative potentials, MCCVSR is prone to include very bad models.
The calibration results of the three components obtained by combination models are listed in Table 2. Compared with the PLS models with total spectral range, the combination models demonstrate improved training and predicting   Journal of Automated Methods and Management in Chemistry 5 performances in terms of RMSEC and RMSEP. With interval models refined, the number of submodels for combination is reduced but the precision is maintained. Some parameters involved in MCCVSR should be discussed. When performing MCCV, two important parameters are the percentage of left-out samples and the sampling time. Generally speaking, as soon as outliers are removed and the computation time permits, a larger sampling time and a higher percentage of left-out samples are helpful to reduce the risk of overfitting in both single submodels and the combination. On the other hand, the percentage of leftout samples can be adjusted according to the size of training samples in order to have enough representative samples for modeling. The sizes of spectral interval and evolving step are also adjustable. Firstly, an interval should contain enough wavelengths (at least 20 channels) to build a stable calibration model. Secondly, the evolving step can be larger to save time when the spectral resolution is high. When performing the statistical test, given a significance level of 0.05, in order to have enough models for combination, the threshold value of the correlation coefficient can be adjusted according to the quantitative potentials of submodels. An empirical value of 0.80 is recommended, which is enough to eliminate the outlying models.

Conclusions
In our recent work, MCCVSR has been proved to be a computationally economic and effective method for wavelength selection. In order to make the MCCVSR algorithm to be more reliable and completely automated for wavelength selection, a statistical test is designed to exclude the outlying submodels from the final ensemble learning with no or little degradation of the model precision. By studying a real data set, the improved MCCVSR method performs almost as well as the original algorithm in terms of training and prediction. Moreover, with less and refined submodels, the final combination is sure to be more reliable. Moreover, the algorithm is completely automated and adjustable according to the nature of the data analyzed. Though just the problem of wavelength selection is tackled, it is evident that the idea of refining the submodels before ensemble combination is generally beneficial to multivariate calibration with ensemble methods like bagging [24]. Finally, the proposed method can perform reliable wavelength selection automatically and is robust against poor interval models but not outliers in reference concentrations (y) or spectra (X). So, it is not a robust multivariate calibration method like robust principal component regression [25,26] and robust PLS [27], outliers should be weeded before the calibration.