A New Local Modelling Approach Based on Predicted Errors for Near-Infrared Spectral Analysis

Over the last decade, near-infrared spectroscopy, together with the use of chemometrics models, has been widely employed as an analytical tool in several industries. However, most chemical processes or analytes are multivariate and nonlinear in nature. To solve this problem, local errors regression method is presented in order to build an accurate calibration model in this paper, where a calibration subset is selected by a new similarity criterion which takes the full information of spectra, chemical property, and predicted errors. After the selection of calibration subset, the partial least squares regression is applied to build calibration model. The performance of the proposed method is demonstrated through a near-infrared spectroscopy dataset of pharmaceutical tablets. Compared with other local strategies with different similarity criterions, it has been shown that the proposed local errors regression can result in a significant improvement in terms of both prediction ability and calculation speed.


Introduction
Near-infrared (NIR) spectroscopy plays an important role in the analysis of complex samples or chemical process due to its simplicity, rapidity, and nondestructive measurements [1][2][3][4]. As a result, multivariate calibration methods that relate property ( ) and spectra ( ) have been extensively used in the quantitative analysis of NIR spectroscopy. Many authors have stated that the choice of the appropriate calibration method is one of the key factors that influence the performance in prediction of the property of query samples [5].
The multivariate calibration methods, such as multiple linear regression (MLR), principal component regression (PCR) [6], and partial least squares (PLS) [7] regression, have been adopted for the first time [2,8]. However, these methods are based on statistical linear models which are not always met in real-life situations and therefore are not able to efficiently model the nonlinear relationship between and . To solve this problem, different authors have demonstrated that nonlinear algorithms such as Artificial Neural Network (ANN) [9] and Least Squares Support Vector Machines (LS-SVM) [10] can produce better results than traditional linear methods especially used together with large NIR spectral libraries [11].
The approach presented in this paper is local method, which has been widely used in near-infrared spectroscopy analysis because of its advantages in terms of the simplicity of the model constructed and the ability to cope with nonlinearities [12][13][14]. The essential idea of local method is to develop specific calibration subsets spectrally similar to each query sample whose properties are to be predicted and build a calibration model on the selected relevant samples for the query sample.
The key issue in local learning is how the similarity criterion should be constructed. In general, the most commonly used similarity checks in NIR spectroscopy are the Euclidean distance (ED) [15], the Mahalanobis distance (MD) [16], and spectral angle mapper (SAM) [17] distance on the spectral space or principal component space [18]. Moreover, 2 Journal of Analytical Methods in Chemistry the computation of the principal components-Mahalanobis (PC-M) [19] distance has become the standard procedure for NIR distance measurements. However, the samples are usually multivariate and influenced by several compositional attributes, which are expressed as highly overlapped and nonspecific NIR absorption or reflectance [20]. For this reason, the samples that are very close in space are frequently not close (or similar) in terms of space. Therefore, traditional similarity criterion uses only information to select relevant samples, which may result in a waste of information and inaccurate sample selection [21]. In order to overcome such problem, some supervised or semisupervised methods have been developed, which take account of information of both and . Recently, another reliable similarity estimation called supervised locality preserving projection (SLPP) [22] has been successfully used in local approach to enhance the similarity measurement accuracy [21,23].
As a simple linear approximation technique, local method could address the nonlinearity through the locally linear models. Therefore, the selection step of local method is aimed at finding some samples whose and meet linear relationship. Unfortunately, similarity criterion mentioned above can only represent the closeness but not the true linear relationship between samples in spectra and property spaces.
The objective of the work described in this paper is to develop a high-performance local method for modelling NIR spectral data. The local errors regression utilizing SLPP and similar errors strategy in finding the optimized calibration subset for each query sample is described. The main contribution of this paper is to take into account the prediction errors from global method during the selection step of local method. This step ensures that the relationships between and are highly linear in both calibration subset and query sample. After such a selection, PLS without cross-validation is adopted for local modelling as it can accelerate the prediction speed and reduce the computational complexity. With pharmaceutical tablets datasets of NIR spectra, the effectiveness and accuracy of the proposed method are investigated and compared.

Methodology
The local errors regression consisted of two steps: the first step is the selection of relevant samples and the second step is to build a calibration model for each query sample. In this section, the details of the proposed method and other algorithms for comparison are described.

Calibration Subset Selection.
The main goal of this step is to discover which samples in calibration set "resemble" the query samples to be predicted. In the proposed method, the search process of calibration subset is carried out by using similarity of prediction errors which indicates how linear the spectra and property are to the samples for prediction. In this context, the parameter of error ranges need to be determined to ensure that there are sufficient selected samples to obtain reliable models. It should be noted that there might not be enough samples that are linear with the query sample to give an acceptable prediction. In this case, the definition of similarity, considering that information of both and has been adopted to construct calibration subset. Although the proposed similarity criterion should be better than traditional method, it cannot be used directly to select the relevant samples for query samples as query samples only contain the spectra information. In this paper, an SLPP technique has been employed to find the nearest sample, whose properties are approximate and interchangeable with query sample.
Since the spectra and property have different dimensions, it is not appropriate to find the nearest sample by the linear combination of EDs in spectral space and property space. Thus, the SLPP is used in this study, which not only selects most similar sample considering both spectra and property information but also reduces the computational load.
The rationale behind this approach is based on the assumption that the samples that are most close in the property space are very similar in terms of spectra space. Given a set of -dimensional calibration spectral data Here is × matrix containing spectral responses of samples; is × prediction matrix containing spectral data response of samples; is × matrix; and × matrix̂is the predicted properties of prediction set with global PLS regression, being the number of components. Let spectra set ( + )× = { ∪ }. The algorithmic procedure for the calibration subset selection is stated as follows: (1) Constructing a neighborhood graph as follows: K Nearest Neighbors. The th and the th samples are connected by an edge if the th sample is among nearest neighbors of the th sample or the th sample is among nearest neighbors of the th sample. Here the distance between the samples is calculated in the property space.
(2) Computing the Weights. is a sparse symmetric ( + ) × ( + ) matrix with having the weight of the edge joining the th and the th samples, and = 0 if there is no such edge, and = 1 if and only if the th and the th samples are connected by an edge.
(3) Finding the Basis Vectors of the Subspace. This step aims at finding a transformation matrix × to project spectra set to a low-dimensional set

Suppose that there exists a linear transformation =
, where is the basis vector. The basis vector is computed by solving the following minimization problem: Journal of Analytical Methods in Chemistry 3 It is observed that the minimization problem can be calculated by solving the following generalized eigenvalue problem: where is a diagonal matrix whose entries are column sums of and = ∑ ; = − is the Laplacian matrix. Let the column vectors 1 , . . . , be the solutions to (2) which are ordered according to their eigenvalues, 1 < ⋅ ⋅ ⋅ < . Thus, transformation matrix and low-dimensional matrix can be calculated as follows: = [ 1 , . . . , ], = , and = .
Finally, for each query sample, the most similar sample will be selected according to the ED in the low-dimensional space .

Partial Least Squares Regression.
The PLS regression has been extensively employed to obtain a quantitative model for prediction of analytes based on spectral data. In this paper, the PLS is used in the procedure of local selection and the establishment of regression model. In addition, the critical step is the determination of the number of factors for achieving the best prediction. Generally, the optimum PLS factors can be determined by minimizing the prediction error of cross-validation groups. However, the cross-validation is time-consuming and does not consider the information from the query sample. For these reasons it would not be a desirable choice to local method. In this paper, the selection of calibration subset is based on the prediction error with global PLS, which are expressed as highly linear between samples in subset. Consequently, the PLS factor would have little effect on the performance of calibration models. Here the PLS factor fixed at a constant value by minimizing the root mean squared error of prediction (RMSEP) of validation dataset and the experimental verification is presented in Section 4.1.
The performance of the final calibration models was evaluated in terms of the RMSEP, Residual Predictive Deviation (RPD), and correlation coefficient ( 2 ) in prediction set. It should be noted that RPD is the ratio between the standard deviation of the reference data and RMSEP of prediction.

Other Algorithms.
In order to compare the predictive performance of our local errors method with other local approaches, the following similarity criterions were used: ED, angle distance, PC-M distance, and supervised or semisupervised methods. A brief description of these criterions is given as follows.

Euclidean Distance.
The ED between query sample and calibration sample is given by where is the transpose operation, ∈ , and ∈ .

Angle Distance.
In this method, the Cosine is used to evaluate the similarity between query data and in the spectral space, and the angle distance is defined as . (4)

Mahalanobis Distance in the Principal Component.
The PC-M distance is obtained through computing the Mahalanobis distance (MD) between samples in spectral principal component space. In this case, the appropriate number of principal components is determined by minimizing the root mean square of the compositional differences between property and predicted valuêof samples in calibration set. The Mahalanobis distance between query sample and calibration sample is defined as where is the covariance matrix in spectral principal component space of calibration set and PC and PC are spectra of query sample and calibration sample in spectral principal component space. For details, see [20].

Supervised or Semisupervised Methods (1) Euclidean Distance Based on Both Spectra and Property.
In this method, both spectra and property are used to evaluate the similarity between query data and in the calibration dataset, and the similarity value is defined: where is a weight parameter to balance the importance of spectra and property , 0 ≤ ≤ 1, and is the Euclidean distance between and .
(2) Euclidean Distance in the Low-Dimensional Space. Here, the similarity measurement considers both and information and utilizes SLPP technique to select relevant samples. For details, see [21].
In summary, the detailed implementation of local errors regression is described as follows.
Step 1. Compute the predicted properties of prediction setâ nd predicted properties of calibration set̂with global PLS regression.
Step 2. For each query sample, select the most relevant sample from calibration set with SLPP.
Step 3. Predefine parameters such as PLS factors and error ranges by minimizing the RMSEP of validation dataset. Here PLS factors are set to 2, 3, 4, or 5 and error range is specified as 1.5. A much detailed description will be given in Section 4.1.
Step 4. For each query sample , calibration subset is selected based on predicted error similarity criteria, and the where is the property of the th sample in calibration set.
Step 5. Build PLS calibration model on the selected relevant samples and predict each query sample.
It should be noted that the number of nearest neighbors and dimension of transformation matrix had little effect on the selection of the most similar sample for each query sample. Therefore, in Step 2, parameters and are set as 50 and 20, respectively. Such a selection is based on the authors' experience without further discussion.

Materials and Experiment
The dataset was obtained from the web of http://software .eigenvector.com/Data/tablets/index.html. The spectra were recorded by Instrument I of Foss NIR Systems 6500 Spectrometer. It contains 655 transmittance NIR spectra of pharmaceutical tablets, measured between 600 and 1898 nm with a 2 nm sampling step. The initial dataset was split into a calibration set of 460 samples and prediction set of 155 samples and validation set of 40 samples. Table 1 showed the statistical results of pharmaceutical tablets. All sets have similar averages and standard deviations which indicate that all the datasets can be used to represent the main variability of pharmaceutical tablets. A microcomputer (Lenovo) with an Intel Core 2 processor was used for all the calculations. All the algorithms were implemented using MATLAB 2012b. The raw NIR spectra of pharmaceutical tablets were presented in Figure 1.

Parameter Setting.
The determination of the PLS factors and error ranges is an essential step in this study, which decides the accuracy of the model. Here these parameters were determined with the validation set so that the RMSEP was minimized. Therefore, the variation of the RMSEP of the validation set with PLS factors and error ranges is investigated. Figure 2 shows the RMSEP obtained with PLS factors from 2 to 12 and a step of 1. For each query sample, the calibration subset is selected based on error ranges = 1, 1.5, 2, and 2.5, respectively. Different from the above narrative in Section 2.1, for each query sample, the most relevant sample is used instead of the real sample whose reference property is available in validation set. Moreover, when there are no  enough samples with close similarity to the query sample, we assume that the predicted value is equal to the real reference property. The purpose of this assumption is to reflect the accurate relationship between the prediction performance and PLS factors with a fixed range. As shown in Figure 2, the smaller error range is employed, the smaller RMSEP can be achieved. Furthermore, when the PLS factors are 2 to 5, the RMSEPs change gradually with a little fluctuation. Other big PLS factors (i.e., greater than 7) are usually not necessary, and they do not lead to essential improvement of the final result. The probability of insufficient selected samples with the PLS factors and error ranges for the validation set is shown in Figure 3. It can be seen that the probability of insufficient selected samples for each query sample decreases along with error range, especially when the PLS factor is set as 2, 3, 4, and 5. In this study, considering both minimized RMSEP and the adequate size of the calibration subset, the error range is set as 1.5 and the PLS factors is set as 2, 3, 4, and 5, respectively. This means that the PLS factor will be set as 3, 4, and 5 in turn if there is no enough samples close to the query sample with PLS factor = 2.

Size of Calibration Subset.
The final results of the local errors strategy for each query sample in prediction set are summarized in Table 2. It can be seen that each number of selected samples in calibration subset is smaller than 210, which is less than the half size of calibration set. Moreover, 78.7% sizes of calibration subsets are between 13 and 50. That means that the computational complexity of the prediction model can be reduced a lot compared with global method. It is noted that here the minimum size of the subset is set  ED: Euclidean distance; PC-M: Principal components-Mahalanobis distance; + + ED: Euclidean distance considering both spectra and property ; + + SLPP: Euclidean distance in the low-dimensional space obtained with supervised locality preserving projection method; errors + ED: Euclidean distance between predicted errors; RMSEP: root mean squared error of prediction; 2 : correlation coefficient in prediction set; RPD: residual prediction deviation; PC factors: Principal component factors; symbol : a trade-off parameter to balance the importance of spectra and property ; : dimension of transformation matrix; and s: second. to 13 so as to avoid insufficient selected samples for building calibration model. When the number of selected samples is not reaching 13, the calibration subset has been made up of 50 samples which are the spectral similarity to query sample. Figure 4 is the scatter plot that shows a correlation between reference values and NIR prediction values in the prediction set using global PLS method and proposed method. According to the scatter plots, the predicted values associated with local errors regression best match the reference ones. It can thus be concluded that the proposed method maintains good precision compared with global PLS method. Furthermore, the method proposed here was compared with other local methods.

Comparison.
The performance of local errors regression, global PLS, and other local methods with different similarity criterions was listed in Table 3. In each local-PLS model, two parameters have to be optimized: size of calibration subset and PLS factors. In this study, the number of calibration subset samples included was evaluated at 30, 50, 100, 150, 200, and 250 samples and checked so as to find which value for this parameter is better. The optimum number of PLS factors was identified by minimizing the prediction error of crossvalidation groups. Best prediction results from local-PLS approaches with different similarity criterions are listed in the table. On the contrary, the PLS factors used in the proposed method, as discussed in the previous section, was fixed at either 2 or 3 or 4 or 5.
In terms of RMSEP, with the same spectra data the prediction performance of the local methods is better than that of global method, showing lower RMSEP values. Moreover, the RMSEP obtained with the proposed method is the lowest compared to the best values obtained with other local methods. Furthermore, Table 3 also shows that the difference of 2 with different methods is not significant and all the values of 2 are approximated to 1. The RPD is proportional to the RMSEP and with a similar variation tendency. Another important issue is the prediction speed. In this context all the local methods were slower than global method, showing higher time consumption. As shown in Table 3, global method requires around 4.4 s to calculate all the samples in prediction set and local methods are much more time-consuming, from 53.5 s to 233.9 s depending on subset size. The reason is that the local methods will select different calibration subset and perform new calibration models specifically to different query samples. Thus, in the proposed method, instead of cross-validation, the fixed PLS factor strategy is adopted, which resulted in a noticeable improvement of calculation speed, showing a lower time consumption about 9.5 s.

Conclusion
In this paper, a new local method for nonlinear spectra is proposed. The main contributions include a new similarity measurement utilizing ED of predicted errors, a selection method extended from SLPP aiming to search the most relevant sample for each query sample and fixed PLS factors prediction model without cross-validation. Also, a comparison has been carried out in this study, to compare the predictive performance of the proposed method, global PLS, and other local strategies with different similarity criterions such as ED, Cosine, PC-M, + + ED, and + + SLPP by using the pharmaceutical tablets dataset. The prediction results have demonstrated that the proposed method could bring higher calibration performances and lower time consumption.