Moving-Window-Improved Monte Carlo Uninformative Variable Elimination Combining Successive Projections Algorithm for Near-Infrared Spectroscopy (NIRS)

The MC-UVE-SPA method is commonly proposed as a variable selection approach for multivariate calibration. However, the SPA tends to select wavelength variables that are sparsely distributed over the wavelength ranges of the variables selected by the MCUVE algorithm, and the MC-UVE-SPA cascade cannot improve the problem of wavelength point discontinuity. It is addressed in this paper by proposing a moving-window- (MW-) improved MC-UVE-SPA wavelength selection algorithm. The proposed algorithm improves the continuity of the selected wavelength variables and thereby better exploits the advantages of the MC-UVE algorithm and the SPA to obtain regression models with high prediction accuracy. The MC-UVE, MC-UVE-SPA, and MC-UVE-SPA-MW algorithms are applied for conducting wavelength variable selection for the NIR spectral absorbance data of corn, diesel fuel, and ethylene. Here, partial least squares regression (PLSR) models reﬂecting the oil content of corn, the boiling point of diesel fuel, and the ethylene concentration are established after conducting wavelength selection using the MC-UVE algorithm, and corresponding multiple linear regression (MLR) models are established after conducting wavelength selection using the MC-UVE-SPA and MC-UVE-SPA-MW algorithms. Experimental results demonstrate that the progressive elimination of uncorrelated and collinear variables generates increasingly simpliﬁed partial-spectrum models with greater prediction accuracy than the full-spectrum model. Among the three wavelength selection algorithms, the MC-UVE-SPA selected the least number of wavelength variables, while the proposed MC-UVE-SPA-MW algorithm provided models with the greatest prediction accuracy.


Introduction
With the characteristics of simple, rapid, noninvasive, and no sample pretreatment, near-infrared (NIR) spectroscopy [1] has been adopted as a popular analytical tool for both qualitative and quantitative analyses in various fields [2][3][4][5].
e quantitative analysis of NIR spectral data is generally conducted through the construction of regression models, such as those based on principle component analysis (PCA) [6], partial least squares (PLS) regression [7], and multiple linear regression (MLR) [8], which take the characteristic wavelengths of the spectral data as input variables. However, the development of modern analytical instruments has led to the capability of acquiring NIR spectral data that can easily contain hundreds to tens of thousands of individual wavelengths [9]. us, the full-band spectral data were adopted for modeling, but the model contained a large amount of redundant information, which resulted in inefficiency [10]. In addition, spectral data usually contain noise, interference, and/or mixed spectral components that can often greatly detract from the prediction accuracy of fullspectrum models developed for spectral data analysis [11]. Yun et al. pointed out that there are three ways to address these problems, namely, regularization, dimension reduction, and variable selection [12]. Among the above-discussed methods, variable selection has become the dominant method of interest in recent years for the development of NIR spectral analysis technology and chemometrics [11][12][13][14]. e goal of wavelength selection is to identify the most informative wavelengths for use as variables in partial-spectrum regression models. Here, uninformative wavelength variables have either no effect or a negative effect on the modeling performance. e wavelength selection process fulfils three purposes, including (1) providing models with greater predicative capability, (2) obtaining wavelength variables that provide greater modeling efficiency, and (3) providing simpler models with improved interpretability [9]. e most commonly employed wavelength selection algorithms developed thus far include uninformative variable elimination (UVE) and the successive projections algorithm (SPA). e goal of UVE, first proposed by Centner et al. [15], is not to select variables directly, but to effectively eliminate uninformative variables in the spectral data, such that only informative wavelength variables remain. e SPA employs simple projection to select variables with a minimum of collinearity, but variables selected by SPA may make little contribution to multivariate calibration, which can affect model prediction [16]. A significant development in recent years has been the combined use of different algorithms through a cascade strategy, where the results of one wavelength selection algorithm are used as the inputs of the next selection algorithm in a stepwise manner.
is can combine the advantages of various wavelength selection algorithms in a complementary way and thereby obtain better and more effective prediction results. e common variable selection method combined with SPA method can greatly simplify the model and improve the prediction accuracy. is strategy has been effectively used in many studies to address the problem associated with the application of the SPA to NIR spectral data by first reducing the dimension of the spectral data by applying some initial algorithm such as UVE, MC-UVE, particle swarm optimization (PSO), or genetic algorithm (GA) optimization [16][17][18][19][20]. Among them, UVE and MC-UVE are commonly used as the primary wavelength algorithms of SPA. For example, Ye et al. proposed the combination of UVE and SPA to integrate the bright side of each, successfully applied to the NIR spectroscopic analysis of nicotine in tobacco lamina and active pharmaceutical ingredients in intact tablets for variable selection, UVE was employed to select informative variables, and SPA was followed to select variables that have minimum redundant information from the informative variables [20]. Li et al. proposed a new combination of MC-UVE and SPA, MC-UVE was employed to select informative variables in the full spectrum, and SPA was also employed as a powerful method for further characteristic variable selection [18].
Nonetheless, most of the informative wavelengths in a molecular NIR spectrum typically exhibit some continuity, where wavelength points adjacent to an informative wavelength point also represent informative wavelengths [21]. However, the MC-UVE algorithm and the SPA are both wavelength selection algorithms based on optimal wavelength points, which are most likely isolated points along the full NIR spectrum. e MC-UVE-SPA cascade cannot improve the problem of wavelength point discontinuity, which may result in the least number of selected wavelength variables, but the modeling effect is not the best. Fan et al. constructed a model for visible/NIR spectral data reflecting the lycopene content based on wavelength variable selection obtained using UVE, SPA, and CARS individually and in various two-stage cascaded combinations [22]. e UVE-SPA combination was found to retain the smallest number of wavelength variables of all the selection algorithms considered, but the prediction accuracy of the model constructed using this wavelength variable set was the worst of all models obtained using all other wavelength selection algorithms. Sun et al. showed that the prediction results of the model constructed by the cascaded wavelength selection algorithm were not always the most accurate, and the prediction results of the improved cascaded wavelength selection algorithm were better than those of the direct twostage cascaded strategy [23].
Few studies have considered improving the continuity of the selected wavelength in the wavelength point selection algorithm. erefore, this paper considers the continuity of the wavelength selected by the MC-UVE-SPA. In this study, this is employed as a moving-window-improved cascade strategy for wavelength selection that is herein denoted as the MC-UVE-SPA-MW algorithm. First, the uninformative variable is eliminated by MC-UVE, the collinear variable is eliminated by SPA, and then the wavelength variables are selected by extending outward from the optimal wavelength points by MC-UVE-SPA in conjunction with a moving window. is reduces the number of isolated wavelength variables, preserves the continuity between informative wavelength points in an NIR spectrum, and expects to improve the accuracy of the established prediction model.

Experiments and Data.
Experiments based on the NIR spectral absorbance data of corn, diesel fuel, and ethylene were employed for verifying the wavelength variable selection performance of the proposed MC-UVE-SPA-MW algorithm and were conducted using the libPLS toolkit [24], while the remaining code was written and executed in the MATLAB R2017b environment.

Corn Spectral Data.
e NIR spectral absorbance data for corn were provided by Eigenvector Research, Inc. (http:// www.eigenvector.com/data/Corn/index.html). e m5 spectra of corn data set consist of 80 corn samples measured over a wavelength range of 1100∼2498 nm in 2 nm intervals. Accordingly, the data set includes a total of 700 wavelength points. It also contains four component reference values of moisture, oil, protein, and starch contents determined by chemical methods for each sample. Table 1 shows the maximum, minimum, and average values of the relative concentrations of moisture, oil, protein, and starch in the 80 corn samples.

Diesel Fuel Spectral Data.
e NIR spectral absorbance data for diesel fuel were provided by the Southwest Research Institute (SWRI) (http://www.eigenvector.com/ data/SWRI/index.html). e data set comprises unprocessed spectra derived from 784 diesel fuel samples measured over a wavelength range of 550∼750 nm in 2 nm intervals. Accordingly, the data set includes a total of 401 wavelength points. e data set also contains various properties including the boiling point, cetane number, density, freezing point, total aromatic hydrocarbon content, and viscosity. Some of the parameter samples have missing values (NaN), which are eliminated during the experiment. Table 2 shows the maximum, minimum, and average values of the boiling point of diesel fuel.

Ethylene Gas Spectral Data.
Ethylene gas samples were prepared within a closed cell filled with nitrogen gas at a pressure of 1 atm and a temperature of 296 K by distributing C 2 H 4 gas into the cell to form samples with 72 known C 2 H 4 concentrations ranging from 60.15 ppm to 200.5 ppm in 2.005 ppm intervals. e C 2 H 4 gas distribution device adopted a gas distribution platform, shown in Figure 1, independently developed by the Hefei Material Science Research Institute of the Chinese Academy of Sciences.
rough visual control software, set the gas distribution proportion according to the requirements, adjust the volume ratio of the auxiliary gas nitrogen and the gas to be distributed through the high-precision gas distribution platform, and configure the required concentration of standard gas according to the requirements. Fourier transform infrared (FTIR) spectroscopy was applied to capture the spectral absorbance intensity of the gas in a sealed sample cell. e optical path length of the cell was 10 m, and the range of the measured wavenumbers was 400∼5000 cm −1 with a resolution of 1 cm −1 . e apodization function used a Hamming window, the number of scans was 16, and a total of 96 spectral data of different concentrations were collected.
Accordingly, the data set includes a total of 4601 wavelength points. e absorption spectrum of C 2 H 4 gas obtained from the HITRAN database (http://hitran.iao.ru/) over a wavenumber range of 400∼5000 cm −1 is shown in Figure 2. Figure 3 presents the background spectral intensity measured after the closed cell was filled with nitrogen gas at room temperature. Figure 4 presents the measured absorption spectral intensity of the cell after adding various concentrations of C 2 H 4 gas. A comparison of Figures 3 and 4 indicates that the spectral intensities in the two regions of 794∼1105 cm −1 and 2917∼3242 cm −1 are drastically different due to the spectral absorption characteristics of the added C 2 H 4 gas.

Evaluation Indices.
e NIR spectral absorbance data are first preprocessed to generate normalized data for facilitating consistent analyses. e normalized data are then divided into a calibration data set and a prediction data set, which are respectively applied for establishing the various regression models and for testing the established models, by adopting the Kennard-Stone method (3 : 1). e extent of information provided by the selected wavelength variables is generally difficult to directly evaluate. erefore, indirect evaluation methods are usually adopted. Typically, the information value of wavelength variables is evaluated according to the prediction accuracy of the model constructed with the selected wavelengths. e indices for evaluating the prediction accuracy of regression models are the root mean square error of cross validation (RMSECV) for calibration set, the root mean square error of prediction, the correlation coefficient (r), and the relative percent deviation (RPD) for prediction set. ese indices are defined as follows: (1) Here, n is the number of samples in the calibration set or the prediction set, y k is the measured value and y k is the predicted value of sample i in calibration set, y i is the measured value and y i is the predicted value of sample i in prediction set, and y AVE and y AVE are the respective average measured value and the average predicted value of all samples in prediction set.
We note that the evaluated prediction performance increases with decreasing RMSE and increasing r and RPD. e RMSE is denoted as the RMSECV when referring to the value associated with the calibration data set and as the RMSEP when referring to the value associated with the prediction data set.

MC-UVE-SPA Method.
e fundamental basis of UVE is to use the stability of the regression coefficient vector characteristic of a constructed PLS multiple regression model as a measure of the significance of a given wavelength. However, the UVE tends to suffer from model overfitting   [25]. is was addressed by the development of Monte Carlo (MC) UVE (MC-UVE), proposed by Cai et al. [26], which replaces the leave-one-out cross-validation (LOOCV) process calculating the regression coefficient matrix β � [β 1 , β 2 , ..., β 2 , ] in conventional UVE with the MC crossvalidation (MCCV) process. e reliability of each variable j can be quantitatively measured by where mean (β j ) and std (β j ) are the mean and standard deviation of the regression coefficients of variable j. e greater the absolute value of stability, the more important the corresponding variable. e stability of uninformative variables should be less than a threshold. e SPA, first proposed by Bregman [27], is a forwardcycling variable selection method. For spectral data analysis, each cycle of the process calculates the projection of a selected wavelength on an unselected wavelength and includes the unselected wavelength with the largest projection vector in the set of selected wavelengths [28]. is process is repeated for each selected wavelength as it is added to the set until the selected wavelength set includes a specified number of wavelengths [16]. More detailed information on the steps of SPA can be seen in literature [16,29]. In selecting the next wavelength, each of the newly selected wavelengths has the lowest correlation with the previous one. erefore, SPA can effectively eliminate collinear wavelength variables and reduce the number of dimensions of the sample spectrum, which accordingly reduces the calculation burden of the model.  [30]. Although the effect of UVE-SPA is better than that of using UVE or SPA alone, there is still something to be improved. In this paper, the UVE-SPA is improved by using the wavelength effective continuity and its effectiveness is verified by experiments.
e proposed wavelength selection algorithm first applies MC-UVE to the calibration data set to construct a PLS regression model. e threshold of the MC-UVE process is set to provide a number of wavelength variables that minimize the RMSECV of the constructed PLS regression model. e largest number of principal components (PCs) was set to 10, and the optimal number of PCs was determined based on the minimum RMSECV value. Subsequently, the wavelength variables retained by the MC-UVE algorithm are applied as the input of the SPA. Here, an MLR model is constructed based on the wavelength variables selected by the SPA for conducting cross-validation analysis, where the number of selected wavelength variables is determined according to the minimum of the RMSECV of the constructed MLR model. In order to reduce the number of isolated wavelength variables and maintain the continuity of adjacent information wavelength points of near-infrared

Corn Spectral Data Experiments.
e wavelength variable stability distribution map of the PLS regression model reflecting the oil concentration in corn constructed for calibration set using the MC-UVE algorithm is presented in Figure 6. Here, all wavelengths greater than the threshold value shown by the horizontal red line in the figure are selected for use in the model. is threshold was selected to provide the number of wavelength variables corresponding to the minimum RMSECV of the constructed PLS regression model. is is illustrated in Figure 7, where the RMSECV of the constructed PLS regression model is plotted with respect to the number of selected wavelength variables. It can be seen from Figure 7 that the RMSECV is relatively large when the number of wavelength variables is small, and the RMSECV drops sharply as the number of selected variables increases.
is is because an overly small number of wavelength variables exclude useful information, and the prediction accuracy of the model is therefore improved as an increasing amount of useful information is incorporated into the model. A minimum value of RMSECV � 0.0289 is obtained when the number of selected wavelength variables is 106, and the RMSECV increases again when the number of variables exceeds 106. is increase results from the impact of selecting an increasing number of uninformative variables on the prediction accuracy of the model. We also note that the RMSECV changes very little when the number of wavelength variables exceeds 300. us, the MC-UVE algorithm eliminates a large number of wavelengths that are not related to the oil concentration of corn, where the final number of selected wavelength variables is just 15.1% of the full-spectrum value of 700. e optimal number of 106 wavelength variables selected by MC-UVE is then used as the inputs of the SPA, which iteratively generates wavelength variable combinations using each wavelength as a starting point and applies them for constructing an MLR model. e wavelength combination corresponding to the minimum RMSECV of the MLR model is then taken as the optimal wavelength combination. e relationship between the number of selected wavelength variables and the RMSECV of the MLR model constructed from variables selected by the MC-UVE-SPA is shown in Figure 8, where we note that the minimum RMSECV is obtained when the number of selected variables is 37. us, the SPA further reduces the number of informative wavelengths mainly by eliminating collinear variables in the MLR model, where the final number of selected wavelength variables is reduced to just 5.3% of the full-spectrum value of 700.
In the original spectrum, the optimal wavelength point selected by the MC-UVE-SPA is used as the starting point or center of a moving window of width w � 2 (Left), 2 (Right), or 3. e results of the PLS or MLR model constructed using the wavelength variables selected by different algorithms are shown in Figure 9, and the details are listed in Table 3 along with the results obtained for different models. In Table 3, the optimal number of PLS principal components was 10. As shown in Table 3, there were 37 characteristic wavelengths selected by the MC-UVE-SPA, accounting for only 5.3% of the total number of wavelengths, and the accuracy of the algorithm is better than that of MC-UVE algorithm, which is due to the elimination of wavelength collinearity.  Figure 10. e results in Figure 10 are derived from the fact that oil is a complex organic molecule with infrared and NIR spectral absorption that occupies a wide wavenumber band ranging 3900∼12000 cm −1 (833∼2564 nm).
is is mainly caused by the frequency doubling and frequency combinations of the stretching and vibrational energy level transitions of hydrogen-containing groups. From the results of Figure 10, we note that the wavelength variables selected by the MC-UVE, MC-UVE-SPA, and proposed MC-UVE-SPA-MW algorithms are mainly distributed between 1662∼1790, 2222∼2268, 2288∼2316, 2390∼2428, and 2476∼2498 nm, which is exactly the range of the spectral absorption peaks generated by the first and second frequency doubling of the -C-H stretching vibrations of the -CH 2, -CH 3, and -CH-CH-functional groups of oil [31].
We note from Figure 10 that the moving window employed by the MC-UVE-SPA-MW algorithm expands the wavelength variables selected by the MC-UVE-SPA, resulting in a greater number of wavelength variables than that obtained by the MC-UVE-SPA, and the improved continuity of the wavelength variables selected by the MC- Journal of Spectroscopy UVE-SPA-MW algorithm is very apparent in Figure 10 compared with the wavelength variables selected by the MC-UVE-SPA. We can also note from Table 3 that the fullspectrum model was relatively complicated, and its prediction accuracy was the worst of all models considered due to the impact of the large number of uninformative wavelength variables included within the model. In comparison, the models established with spectral data selected by the MC-UVE, MC-UVE-SPA, and MC-UVE-SPA-MW (w � 2L, 2R, 3) algorithms are all greatly simplified, and better model prediction accuracies are uniformly obtained. We also note from the table that, of the five wavelength selection algorithms, the MC-UVE-SPA selected the least number of wavelengths and the MC-UVE-SPA-MW (w � 2L) algorithm provided a model with the greatest prediction accuracy.

Diesel Spectral Data
Experiments. e number of wavelength variables selected from the NIR spectral data of diesel fuel reflecting the boiling point by the MC-UVE, MC-UVE-SPA, and MC-UVE-SPA-MW (w � 3, 2L, 2R) algorithms were, respectively, 262, 30, 83, 58, and 59, as shown in Table 4. ese respectively represent 65.3%, 7.5%, 20.7%,  14.5%, and 14.7% of the 401 wavelength variables included in the full spectrum. e prediction results of the PLS or MLR models constructed from the selected wavelength variables are shown in Figure 11, and the details are listed in Table 4 along with the results obtained for a fullspectrum PLS model. We note from Table 4 and Figure 11 that the models established with spectral data selected by MC-UVE, MC-UVE-SPA, and MC-UVE-SPA-MW (w � 2L, 2R, 3) algorithms are greatly simplified compared with the full-spectrum model. MC-UVE retains 262 wavelength points, and the prediction accuracy is the worst of all the models considered, which may be due to the existence of wavelength collinearity. When SPA algorithm is used to further screen the wavelength points selected by MC-UVE, only 30 wavelength points are retained, while the prediction accuracy of the model is greatly improved, RMSEP is reduced to 8.8676, r value is increased to 0.9341, and RPD value is increased to 2.4650. We note from Figure 12 that the moving window employed by the MC-UVE-SPA-MW expands the wavelength variables selected by the MC-UVE-SPA and improves the continuity of the wavelength variables selected by the MC-UVE-SPA-MW. When the window width w � 2 (Left), 2 (Right), and 3, the accuracy of the three models obtained by the MC-UVE-SPA-MW are all improved. When w � 3, the MC-UVE-SPA-MW expands 30 wavelength variables selected by the MC-UVE-SPA to 83. At this point, RMSEP is reduced to 5.9694, R value is increased to 0.9752, RPD value is increased to 3.9994, and the model is optimal. We can also note from Table 4 and Figure 11 that of the five wavelength selection algorithms, the MC-UVE-SPA selected the least number of wavelengths and the MC-UVE-SPA-MW (w � 3) algorithm provided a model with the greatest prediction accuracy.

Ethylene Gas Spectral Data
Experiments. e number of wavelength variables selected from the spectral data reflecting the C 2 H 4 concentration by the MVUVE, MC-UVE-SPA, and MC-UVE-SPA-MW (w � 3, 2L, 2R) algorithms were respectively 214, 17, 48, 34, and 34 as shown in Figure 13. ese respectively represent 4.7%, 0.37%, 1.0%, 0.74%, and 0.74% of the 4601 wavelength variables included in the full spectrum. It can be determined from Figure 13 that greater than half of the selected wavelength variables fall within the strong absorption regions in the wavenumber ranges 794∼1105 cm −1 and 2917∼3242 cm −1 . ese results can be explained according to the description given on the HITRAN web page, which states that the absorption spectral band of C 2 H 4 gas is in the range of 614∼3242 cm −1 , and that the two isotopes H 2 12 C 12 CH 2 and H 2 12 C 13 CH 2 of C 2 H 4 present strong absorption bands in the wavenumber ranges of 794∼1105 cm −1 and 2917∼3242 cm −1 , respectively. From Figure 4, it can be seen that in some areas that are not C 2 H 4 absorption bands, the spectral intensity has a significant linear relationship with C 2 H 4 content, which may be due to the interference caused by the background spectrum with the change of C 2 H 4 concentration, so in some areas that are not C 2 H 4 absorption bands, the wavelength point is also selected. e details regarding the prediction results of the PLS or MLR models constructed from the selected wavelength variables are listed in Table 5 along with the results obtained for a full-spectrum PLS model. We again note from the table that the full-spectrum model is more complicated, and its prediction accuracy was the worst of all models considered. In comparison, the models established with spectral data selected by the MC-UVE, MC-UVE-SPA, and MC-UVE-SPA-MW algorithms are all greatly simplified, and better model prediction accuracies are uniformly obtained. Of the five wavelength selection algorithms, we again note that the MC-UVE-SPA selected the least number of wavelengths and the MC-UVE-SPA-MW (w � 3) algorithm provided a model with the greatest prediction accuracy.

Conclusions
e present study addressed the sparsity of wavelength variables selected by the cascaded MC-UVE-SPA through the application of a moving window, which improved the continuity of the selected wavelength variables, and thereby better exploited the advantages of the MC-UVE algorithm and the SPA to obtain regression models with high prediction accuracy. e advantages of the proposed MC-UVE-SPA-MW were demonstrated by applying the MC-UVE, MC-UVE-SPA, and MC-UVE-SPA-MW algorithms to the selection of wavelength variables from the NIR spectral absorbance data of corn, diesel fuel, and ethylene, and PLS and MLR models reflecting the oil content of corn, the boiling point of diesel fuel, and the ethylene concentration were thereby established and tested.
e experimental results demonstrated that the progressive elimination of uncorrelated and collinear variables generated increasingly simplified partial-spectrum models with greater prediction accuracy than the full-spectrum model. Among the three wavelength selection algorithms, the MC-UVE-SPA selected the least number of wavelength variables and the proposed MC-UVE-SPA-MW algorithm provided models with the greatest prediction accuracy.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.