Improving the Classification Accuracy for Near-Infrared Spectroscopy of Chinese Salvia miltiorrhiza Using Local Variable Selection

In order to improve the classification accuracy of Chinese Salvia miltiorrhiza using near-infrared spectroscopy, a novel local variable selection strategy is thus proposed. Combining the strengths of the local algorithm and interval partial least squares, the spectra data have firstly been divided into several pairs of classes in sample direction and equidistant subintervals in variable direction. Then, a local classification model has been built, and the most proper spectral region has been selected based on the new evaluation criterion considering both classification error rate and best predictive ability under the leave-one-out cross validation scheme for each pair of classes. Finally, each observation can be assigned to belong to the class according to the statistical analysis of classification results of the local classification model built on selected variables. The performance of the proposed method was demonstrated through near-infrared spectra of cultivated or wild Salvia miltiorrhiza, which are collected from 8 geographical origins in 5 provinces of China. For comparison, soft independent modelling of class analogy and partial least squares discriminant analysis methods are, respectively, employed as the classification model. Experimental results showed that classification performance of the classification model with local variable selection was obvious better than that without variable selection.


Introduction
Salvia miltiorrhiza (S. miltiorrhiza), named "Danshen" in China, has been considered as an important component of traditional Chinese medicines (TCMs) [1]. e most important and frequent clinical application of S. miltiorrhiza is mainly employed for treatment of various cardiovascular diseases, including coronary artery disease, hyperlipidemia, hypertension, arrhythmias, and stroke by either alone or in combination with other herbal ingredients [2]. As is known, the e cacy and quality of S. miltiorrhiza can vary considerably according to the di erent origins with different climate, soil, and planting conditions [3,4]. So, the discrimination of the geographical origins and planting conditions is a crucial issue. Unfortunately, in order to make storage and distribution easier, S. miltiorrhiza is typically preserved in the powder form. It is di cult to identify and discriminate the S. miltiorrhiza powder without expert training, which is becoming an obstacle to the modernization and internationalization of S. miltiorrhiza [5]. With the growing demand for S. miltiorrhiza, the discrimination of the cultivation area with easy, reliable, and rapid analytical approaches has been paid more and more attention.
Due to the complexity of herbal medicines, spectra su er from a high grade of band overlapping. For this problem, variable selection was used in quantitative analyses [21][22][23]. Nevertheless, most of the variable selection methods applied in qualitative analyses are just a simple choice of waveband according to the spectral variation or absorption feature [4,17,24]. In this way, the selected variables are not proper for every observation, and the improvement of classi cation accuracy is not obvious.
In this paper, a novel local variable selection (L-VS) strategy was proposed for discrimination of Chinese S. miltiorrhiza according to its geographical origins and planting conditions. With this method, the multiclassication was transformed into several two-classi cation problems. Firstly, the most proper variables for each pair of classes were selected based on a new evaluation criterion considering both classi cation error rate and best predictive ability under the leave-one-out cross validation scheme. en, applying the new observation into each local classi cation model which was built on selected variables, a series of class labels have been supplied. Finally, the classi cation decision of the new observation is identi ed by the statistical analysis of class labels. As comparison, the SIMCA and PLS-DA models combined with L-VS and SIMCA and PLS-DA based on the full spectrum were also established, respectively.

Sample Preparation.
In this work, a total of 94 S. miltiorrhiza samples grown in wild or cultivated conditions were collected from 8 geographical origins in ve provinces of China (Shandong, Henan, Sichuan, Hebei, and Yunnan) and used to perform the analysis. e origins, planting conditions, and numbers of all observations are summarized in Table 1. All samples were identi ed by experts from the China Institute of Traditional Chinese Medicine. Each sample of S. miltiorrhiza was dried and grounded into powder (200 meshes) before NIR measurement. It is worth pointing out that raw spectral data have been divided into 13 classes in the vertical direction (sample's direction) and 57 intervals in the horizontal direction (wavelength's direction).
2.2. NIR Spectra. NIR spectra of sample powders were acquired by a di use re ectance mode with a Spectrum One NTS Fourier-transform near-infrared spectrometer (PerkinElmer, USA). Spectra were recorded from 10000 cm −1 to 4000 cm −1 at 2 cm −1 resolution by coadding 32 scans using an integrating sphere. To reduce noise, each sample was measured 5 times, and the mean spectrum for each sample was used in the next analysis. Meanwhile, the temperature was kept around 25°C and at a steady humidity level in the laboratory.

Spectral
Processing. All the computations were carried out in MATLAB software, version R2012b. e partial least squares (PLS) and principal component analysis (PCA) model calculations were performed with MATLAB functions that came from Stats' toolbox. And homemade routine was also employed for algorithms of SIMCA and PLS-DA. e raw spectra contain interference, such as high-frequency noise, baseline drift, background, and so on. us, before the classi cation stage, some preprocesses must be taken to weaken interference in spectra.
e Savitzky-Golay (SG) method and the 2nd derivative spectra were adopted in this work. In the Savitzky-Golay smoothing process, polynomials of orders 1 and 21 on both sides were performed. And the wavelength interval used to calculate the 2nd derivative spectra was xed at 40 cm −1 . e raw NIR spectra and the 2nd derivation preprocessed spectra of the 94 S. miltiorrhiza observations are shown in Figure 1. Figure 1(a) displays the raw NIR spectra of the S. miltiorrhiza observations in the region between 10000 and 4000 cm −1 . From Figure 1(a), it can be appreciated that spectra of the S. miltiorrhiza observations show a very high similarity, and signi cant shifts were observed. However, by implementing the 2nd derivative, baseline drifts are eliminated and variations in  Journal of Analytical Methods in Chemistry the raw spectra are enhanced. e 2nd derivative spectra which are calculated after the SG smoothing are shown in Figure 1(b). As shown, it is still di cult to distinguish the spectral di erences based on cultivation area and planting conditions, so the qualitative classi cation method is necessary.

Local Variable Selection Strategy.
Classi cation method based on full-spectrum data range ignores directions in the variable space which are spanned by irrelevant, noisy variables. From a classi cation perspective, the large number of irrelevant variables may yield incorrect classi cation results. In order to minimize the in uence of such irrelevant variables, variable selection is usually necessary.
In this paper, a novel L-VS strategy is proposed. According to the fundamental concept of local strategy, the spectral data have been divided into several pairs of classes, reducing the complexity by transforming the multiclassi cation into integration of several two-classi cation problems. Here, let N 1 be the total number of classes and the calculation of the number of the class pair N 2 be the mathematic combination problem, which is given by e main idea of the proposed L-VS is that, for each pair of classes, a speci c set of variables has been selected from the full-spectrum data on the basis of performance of the local classi cation model between these two classes.
For the purpose of decreasing computation, the L-VS strategy also shares conceptual similarities with interval PLS (iPLS) [25], which develops local classi cation models on equidistant subintervals of the full-spectrum region, thereby focusing on the spectral region where information is optimally preserved. e critical issue that is important for the successful implementation of the L-VS is the size of the interval. Obviously, improper size of the interval destroys the model. Too many variables lose the e ects of smaller peaks. On the contrary, too few variables may eliminate some useful redundancy and amplify the in uence of each variable in the nal model. In either case, the false classication will be caused. ese two respects may be somewhat contradictory, so a reasonable compromise solution is applied in this paper, where the size of the subinterval is a relative big number and the resolution has a relative small number. For example, the spectral data of S. miltiorrhiza which has 3000 variables were divided into 57 intervals with a variable resolution of 50, and each interval will be made by 200 variables. is means that the rst subinterval ranges between the 1st and 200th variables, the second subinterval between the 51st and 250th variables, and so on.
In addition, a novel evaluation criterion which incorporates both classi cation error rate and predictive ability of the classi cation model has been employed in the process of variable selection. And the optimized spectral region has been selected along the direction of the minimum incorrect classi cation rate and best predictive ability under the leave-one-out cross validation scheme for each pair of classes.
It should be noted that the proposed method is simply a "univariate" approach that does not consider combinations between spectral regions. Concomitant with the combination is a global optimization problem which corresponds to a much longer computation time even in small datasets. erefore, it is proposed to select the optimal spectral region using the "univariate" approach in this paper. Subsequent experimental results have also proved that the use of the "univariate" approach makes L-VS more e cient and fast for large datasets, while obtaining an acceptable classi cation. In addition, di erent classi cation models have different ways to represent the predictive ability, and deciding which to choose is described in the following section.

Classi cation Decision.
For each pair Class (i, j), the local classi cation model based on selected spectra is utilized to assign the new observation to Class i or Class j . Traverse through all class pairs, the classi cation result of each new observation is a series of class labels. To make objective classi cation decisions using the class labels, there is a need to develop some decision rules for this purpose. Here, the classi cation decision is based upon the statistic occurrence frequency, and each observation can be assigned to belong to the class with the highest occurrence frequency of class labels.
2.6. SIMCA. SIMCA is one of the most commonly used class-modelling techniques for multivariate classi cation. In SIMCA classi cation, the residual of PCA modelling performed for each class is utilized to identify whether an unknown observation belongs to that class or not. During the step of local variable selection, the best classi cation performance for SIMCA is obtained when there are both lowest classi cation error rate and largest distance between classes.
where IC is the classi cation error rate, N c is the number of correct classi cations, and N ic is the number of incorrect classi cations.
where D 2 ij is the distance between Class i and Class j , S 2 ij is the residual variance of the PCR model which is developed for tting the Class j and based on Class i , and S 2 i is the residual variance of the PCR model based on Class i with leave-oneout cross validation.

PLS-DA.
In PLS-DA, all observations are assigned a class-speci c numerical value y based on the geographical origins and planting conditions. For the two-class case, the y value should be as close to 1 as possible to belong to a certain class or close to −1 otherwise, and a simple 0 threshold has been employed to determine class belonging. e lowest classi cation error rate and root mean square error of prediction (RMSEP) in cross validation during the leaveone-out procedure are the criteria for guiding the variable selection to the optimum. Here, the classi cation error rate, as mentioned above, is calculated by (2), and RMSEP is given by where y j is the class-speci c numerical value of the jth sample in the training set (union of two classes), y (j) is the predicted value with PLSR in cross validation without the jth sample, and N is the total number of samples in the training set.

Results and Discussion
3.1. Data Exploration Using PCA. PCA was utilized initially to examine the qualitative di erence of all the S. miltiorrhiza samples in the principal component (PC) space. As the explained variance of per principal component decreases with increasing number of PCs, the rst two PCs encompass almost all variance, whereas the higher PCs are dominated by noise. In this paper, therefore, the PCA model was built from the total samples using 2 PCs explaining the variance in the data after applying the preprocessing described in the experimental section (S-G ltering, 2nd derivative). In Figure 2, two obtained score plots are depicted. It can be seen that the rst PC (PC1) explained 68.7% and the second one (PC2) 14.4%. However, the S. miltiorrhiza samples (with di erent cultivation areas or planting conditions) usually have same compositional attributes, which are expressed as highly overlapped, and similar NIR absorption or reectance. Due to this, the rst two PCs, although representing the main part of the data variance, might not be powerful enough to achieve class separation. As shown in Figure 2, the distribution of PC1 and PC2 score plots was partly overlapped between di erent classes, especially Class 6, 7, 8, 11, and 13, which has no or only a slight clustering tendency. is is also proved by the results of the classication model based on entire spectra in Section 3.3. e correct identi ed rates of Class 6, 7, 8, and 11 are obviously lower than those of other classes. As can be seen, it is di cult to di erentiate all the S. miltiorrhiza samples using the entire NIR spectra, particularly when the overlapping of PC1 and PC2 score plots for classes is observed, which motivates the use of the variable selection method.

Variable Selection.
In order to improve the discrimination performance of the classi cation model, selecting the speci c wave bands for each class is necessary. Firstly, the full-spectrum region was divided into 57 intervals with a variable resolution of 50, and each interval will be made by 200 variables. So, the ith interval related to the wavenumber from 10000 − 2 × (50 × i + 150) cm −1 to 10000 − 2 × (50 × i − 49) cm −1 .
en, SIMCA and PLS-DA have been applied as the local classi cation model, respectively, to select the most proper region for each class pair.

Local Variable Selection Based on SIMCA.
Here, the classi cation performance of SIMCA is characterized by the classi cation error rate and distance between classes. e optimal interval for each pair of classes, which provides the maximum distance between classes with the minimum classi cation error rate, has been selected for further study. Figure 3 illustrates this concept using the Class 1 and 2 sample NIR data. In this plot, the classi cation error rate and distance between Class 1 and 2 are calculated by the SIMCA classi cation model with the spectra data in each interval. By using the principle of the minimum classication error rate, the 25th-30th intervals have been selected as the feasible region of the optimal interval. Furthermore, compared with the distances between Class 1 and 2 in the feasible region, the 26th interval with the maximum distance has been determined as the optimal interval for Class 1 and 2. e comparison of Class 1 and 2 in the spectral space and principal component space after L-VS (based on SIMCA) is depicted in Figure 4. As shown in Figure 4(a), the spectral di erence of Class 1 and 2 between 7100 cm −1 and 7498 cm −1 is more apparent as compared to that in the full spectrum. Further principal component analysis shows that the rst two principal components together explained 84.7% of the variation, and clustering according to origins was clearly observed in Figure 4(b). As discussed, the result further indicates that the range from 7598 cm −1 to 7200 cm −1 is the most proper region which determines the cultivation area or planting conditions of the S. miltiorrhiza samples in Class 1 and 2. e 2nd derivative spectra of all the S. miltiorrhiza samples are presented in Figure 5. At the lower end, the curves show a great variety, while at the higher end, the curves vary less. And the distribution of the selected wavenumber for each pair of classes based on the SIMCA classi cation method and the tting curve of the frequency of selection (blue solid line) are also shown in Figure 5. It can be seen that the optimal selected variables basically focus on the region with larger spectral di erences which are mainly concentrated nearby 8200 cm −1 , 7200∼6800 cm −1 , and 6000∼4200 cm −1 . It is worth to note that 1&2, 7 represents two pairs of classes: Class 1 and 2 and Class 1 and 7, and they have the same optimal variables between 7498 cm −1 and 7100 cm −1 .

Local Variable Selection Based on PLS-DA.
e same analysis was performed using PLS-DA. In this case, the PLS-DA was used as the classi cation model for local variable selection. In the L-VS process, the performance of PLS-DA built upon di erent regions was evaluated by both RMSEP and classi cation error rate via leave-one-out cross validation, the results of which are summarized in Figure 6. Based on the minimum classi cation error rate, two spectral regions have been selected: the rst one is between the 25th and 31st intervals (7598∼6600 cm −1 ) and the second one includes the 52nd and 57th intervals (4898∼4038 cm −1 ). Furthermore, the RMSEP of the PLSR model between each pair of classes on di erent regions has been considered, and the 25th interval (7598∼6600 cm −1 ) was regarded as the most desirable region to generate the favorable classi cation performance.
As shown in Figure 7, the di erence of Class 1 and 2 between 7598 cm −1 and 7200 cm −1 is also more apparent as compared to that in full spectrum both in the spectral space and principal component space. is means that the rst two principal components based on the range from 7598 to 7200 cm −1 explained high variance and contained su cient information to achieve complete Class 1 and 2 separation. erefore, we conclude that the narrow range from 7598 to 7200 cm −1 was best suited to discriminate the S. miltiorrhiza samples in Class 1 and 2, when the PLS-DA was applied as the classi cation method. Figure 8 displays the distribution of the selected regions by L-VS based on the PLS-DA classi cation method. It is obvious that the regions with larger spectral di erences are nearly selected as proper regions for class pairs. Additionally, along with the frequency of selection curve, some variables located in the region with lower spectral di erences (such as 8500∼8000 cm −1 ) are also selected by L-VS.
Generally speaking, the selected regions with the L-VS method based on SIMCA and PLS-DA have some similarities. And spectral features contained in these selected regions and corresponding assignments are listed in Table 2. Basically, the main spectral contributions are derived from stretching vibrations of C-H, O-H, and N-H bonds. e region between 4400 and 4200 cm −1 is dominated by the combination bands of C-H. e region 4700∼4600 cm −1 is characteristic for the N-H stretching vibration and combination bands of C�O. e two bands at 5200∼5000 cm −1 and 7200∼7000 cm −1 correspond to the combination and the rst overtone of the O-H stretching vibration. Near 8200 cm −1 , the second overtone of C-H in CH 2 can be found.

Comparative Performance of Various Methods.
As comparison, the correct identi ed rates of all the S. miltiorrhiza observations obtained by SIMCA and PLS-DA   Table 3. Limited by the amount of the data, the classi cation procedure is similar to cross validation, in that one sample is left out from the training set and the remaining observations are used to discriminate the leftout observation.
It is generally believed that PLS-DA typically outperforms SIMCA in the correct classi cation rates, because the PLS-DA is a supervised method, which takes account of information of both spectra and class properties. So, the classi cation rates from PLS-DA based on full spectrum are generally larger than those from SIMCA in Table 3. However, there are still many class with lower correct identi ed rates, such as Class 6, 7, and 8, whose score plots are seriously overlapped. With the traditional variable selection, only the 7400 to 3000 cm −1 range was used since most sharp spectral features appeared to be located in this   Journal of Analytical Methods in Chemistry region. But, in fact, the performance of the classi cation model with these selected variables showed no improvement. Compared with full-spectrum and traditional variable selection, the L-VS has the remarkable improvement on the correct identi ed rates of both SIMCA and PLS-DA. Especially adopting PLS-DA combined with L-VS, all the S. miltiorrhiza observations are perfectly classi ed according to their geographical origins and planting conditions. erefore, we conclude that the proposed L-VS strategy is a powerful tool for improving the classi cation accuracy of Chinese Salvia miltiorrhiza.

Conclusions
roughout this paper, a local variable selection strategy has been proposed to improve the classi cation accuracy for discrimination of Chinese S. miltiorrhiza. And a novel evaluation criterion, considering both classi cation error rate and predictive ability of the classi cation model, has been employed in the process of variable selection. For comparison, SIMCA and PLS-DA have been used as the classi cation method, respectively. With the NIR spectral data of S. miltiorrhiza collected from 8 geographical origins in ve provinces of China, it has been proved that the optimized variables containing su cient clustering information can be selected by L-VS for each pair of classes. Frequency of selection

Conflicts of Interest
e authors declare that they have no con icts of interest.