Rapid Recognition of Geoherbalism and Authenticity of a Chinese Herb by Data Fusion of Near-Infrared Spectroscopy ( NIR ) and Mid-Infrared ( MIR ) Spectroscopy Combined with Chemometrics

1 e Modernization Engineering Technology Research Center of Ethnic Minority Medicine of Hubei Province, College of Pharmacy, South-Central University for Nationalities, Wuhan 430074, China College of Material and Chemical Engineering, Tongren University, Tongren 554300, Guizhou, China State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310032, China


Introduction
Herbal medicines are of effective pharmacological functions, low toxicity, and less side effects to human body, so they have been widely used all over the world [1][2][3].However, herbal medicines with different geographical origins have different chemical compositions and pharmacological activities [4,5].In addition, the processing of herbal medicines often removes morphological properties of species, and some herbal medicines at high cost are often the subject of fraudulent practices by replacing them with ones at low cost [6,7], which may lead to an unfair competition in the pharmaceutical and harm the interest of consumers.us, the quality analysis method of herbal medicines to distinguish the origins is an important concern for consumers [8][9][10].Traditional methods such as high-performance liquid chromatography and mass spectroscopy are time-consuming, expensive, and laborious and have to be performed by highly trained technicians [11,12].erefore, a rapid, more accurate, and sensitive identification method is required to determine herbal medicines.
Most studies focused on specific pharmacological ingredients in herbal medicines; however, the pharmacological activity of herbs is the result of the interaction of all ingredients rather than specific ingredients.erefore, the specific ingredients could not be used as a proper criterion for characterization of the overall quality of the herbs [13,14].Fourier transform mid-infrared (MIR) [15] and near-infrared (NIR) [16,17] techniques are efficient tools for studying food and pharmaceutical quality control because of their fast and nondestructive analytical characteristics.For example, Zhu et al. used FT-IR and 2DCOS-IR methods to discriminate the cultivated Codonopsis lanceolata in different ages [18], and in the research done by Gayo and Hale, near-infrared spectroscopy was applied to detect and quantify the species authenticity in Crabmeat [19].By studying the characteristic information of the spectra, different types of samples can be accurately distinguished.Nonetheless, the information obtained from by NIR spectra may be difficult to interpret directly because of the highly overlapped spectra.Although MIR spectra provide some significant differences of spectral peaks, they do not give abundant chemical and structural information of samples like NIR spectra.erefore, establishing effective and robust chemometric methods has been extensively concerned [20,21].For example, Woo's team used Mahalanobis distance and discriminant PLS2 combined with NIR spectroscopy to discriminate herbal medicines according to geographical origins, but there are only two different classes from different geographical origins [22].Frizon et al. used the PLS in determination of total phenolic compounds in yerba mate and predicted total phenolics with associated errors of 12% [23].Liu et al. studied on the differentiation of the root of various ginseng by FT-IR and two-dimensional correlation IR spectroscopy, and the cluster analysis demonstrated that the three kinds of ginseng can be distinguished clearly from each other but with an exception [24].PCA is a multivariate statistical technique that reduces the multidimensionality of data while minimizing information loss [25].LDA can establish linear transformations to find the best boundary and achieve maximum separation between classes by constructing discriminant functions [26].From another aspect, as a powerful pattern recognition method, PLSDA has successfully been applied to solve classification problems in many scientific fields [27,28].Furthermore, a global model with moving window partial least-squares (MWPLS) [29,30] like other variable selection methods, MWPLSDA was successfully applied to spectra interval selection for calibration problems, and desirable results were obtained [31].A subset of the whole wavelengths to develop the calibration model, the wavelengths carrying serious heteroscedastic noises, and especially the spectral ranges contaminated by external factors are excluded from the model, and wavelength ranges sensitive only to the chemical compositions of the samples are selected to develop a simplified yet stable calibration model.
Sometimes, it is difficult to discriminate the origins of herbal medicines only through the pattern recognition method by single NIR or MIR spectra [32] combined with chemometrics, and it is necessary to extract from the data fusion of NIR and MIR spectroscopy [33].ere is abundant information related to combinatory MIR and NIR spectroscopy coupled with chemometrics for quality control of herbal medicines.
In this study, different supervised pattern recognition algorithms including principal component analysis (PCA), linear discriminant analysis (LDA), and partial least squares discriminant analysis (PLSDA) with raw NIR spectra were used to discriminate five different geographical origins of Angelica dahurica.Moreover, moving window partial least-squares discriminant analysis (MWPLSDA) and the fusion spectra variables evaluate authenticity and adulteration of Corydalis yanhusuo W. T. Wang.e result shows that PLSDA model is of great performance than PCA and LDA in identifying geographical origins of herbal medicines.In addition, the full spectra information fused by NIR and MIR combined with MWPLSDA showed the best ability in determination of authenticity of herbal medicines.
is method provides pattern recognition models that can be applied in geographical origin discrimination or authenticity and adulteration recognition at the same time and can further be widely used in various herbal medicines.

Collection of Raw Materials.
A total of 50 Angelica dahurica samples from five geographical origins (Hebei, Anhui, Yunnan, Zhejiang, and Sichuan) were purchased from the Derentang pharmacy, and each region included 10 batches.Besides, two kinds of authentic Corydalis yanhusuo W. T. Wang (Zhejiang) were purchased from the Derentang pharmacy and the Kangderuiqi flagship store, while three kinds of adulterations Corydalis decumbens ( unb.) Pers., Typhonium flagelliforme (Lodd.)Blume, and Dioscorea opposita ( unb.) were, respectively, collected from Anhui, Jiangsu, and Fujian, and the aforementioned five samples for identification of adulteration were collected in 10 batches.

Methods of Sample Measurement and Data Preprocessing by NIR and MIR.
All samples used in NIR were crushed with the grinder, sieved into fine powders by a 200 mesh sieve, then vacuum-dried at 60 °C for 24 hours, and stored in a dryer spare.e sample powder was placed directly into the quartz cup, and the air background was subtracted.Spectra were collected by integrating sphere diffuse reflectance with the collecting region at 4000-10000 cm −1 and a resolution of 8 cm −1 .Data processing was performed using the average of the five measured spectra for each sample.In total, 250 spectra from different geographic origins (5 samples × 10 batched × 5 measurements) were obtained.And 250 spectra were discriminated for the authenticity and adulteration of Corydalis yanhusuo W. T. Wang.

2
Journal of Spectroscopy e principle of moving window partial leastsquares discriminant analysis (MWPLSDA) is that a suitable window moves along the full spectral interval according to our past study [34,35].In MWPLSDA, a suitable window of width H is constructed and moved along the entire spectrum to select useful wavelength intervals, and then the selected spectral spacing is used to construct the PLSDA model.e principle of MWPLSDA is based on the virtual setting of a window, which contains the number of variables from the first wavelength to the end of (i + H − 1) wavelength.A series of submatrices are obtained continuously by moving the window.According to the variables in the moving window, a series of PLS submodels are constructed.
en, according to the principle of least residual square (SSR), the interval of measurement matrix with smaller classification error and latent variable is selected as the final MWPLSDA model.

Geographical Origin Discrimination of Angelica dahurica by NIR.
In order to analyze the five different samples more effectively, the classical quick data analysis, and nondestructive analytical technique, NIR was used in the measurement.e average NIR spectra of each group are displayed to reflect the overlay in Figure 1. e peaks located at 8319 cm −1 might be associated with the second overtone of C-H, O-H, and N-H stretching modes and those around 6780 cm −1 were caused by the C-H deformation vibration of CH 3 .Due to the second overtone of the C�O stretching vibration, bands at 5164 cm −1 emerge and the C-H combination and second overtone can be seen at 4200-4300 cm −1 .However, owing to the overlaps and the systematic noise in NIR spectra, chemometric methods were required to extract useful information for the recognition of Angelica dahurica samples.Herein, three classical chemical pattern recognition methods using principal component analysis (PCA), linear discriminant analysis (LDA), and partial least squares discriminant analysis (PLSDA) models were associated with virtual coding of original NIR spectral variables of different sample sets.
e 250 sample spectra of five different Angelica dahurica samples were randomly divided into a training set and a prediction set (Table 1).e model was built using the training set, the number of latent variables (LVs) was determined to be 5 by eightfold cross-validation using the prediction set, and the discrimination results were analyzed for comparison.
Firstly, as a common method in the chemical pattern recognition which is mainly used for classification in the analytical processes of Chinese herbal identification, principal component analysis (PCA) is one of the most classic high-dimensional methods, which reduces the highdimensional data of FTNIR and converts 1557 raw variables into fewer new principal components.PCA used fewer principal component features to represent the original features of the sample by decomposing the sample matrix in the training set and prediction.Based on the PCA technology, the vector scores of the training and prediction sets of the aforementioned samples are reflected in Figure 1(b), and all samples from five different geographic origins in the training and prediction sets could not be clearly distinguished, but these samples were with same shape.
is phenomenon could be attributed to small differences in the chemical properties reflected in its geographical origin.e results demonstrated that the PCA method can effectively reduce and extract fewer new variables from the original high-dimensional data, but the restoration process also leads to loss some information useful for sample differentiation.
Other than looking for the vector space that can best describe the original data like PCA, linear discriminant analysis (LDA) is a linear discriminant function based on input response variables for searching linear transformations and dimensionality reduction.e axes of interest for LDA can maximize the distinction between classes, projecting feature spaces (multidimensional samples in the dataset) into smaller dimensional k-dimensional subspaces while maintaining information that distinguishes categories.Figure 1(c) shows the vector scores of the first two latent variables based on the LDA model for the training and prediction sets of samples.It clearly distinguished samples from different geographical origins in the training set, while those in the prediction set were not clearly distinguished.e result may be due to some special requirements of the LDA model, of which at least one of the needs to be nonsingular.In addition, when the so-called outlier class dominates in estimating the scattering matrix, the LDA model cannot guarantee that the optimal subspace is found [36].Furthermore, PLSDA can reduce the effects of multicollinearity between variables, and it can simultaneously decompose the extraction factors of the prediction measurement matrix and the class matrix and arrange them according to the correlation between them.Five different geographical sources of Angelica dahurica are identified based on the maximum virtual coding position of the NIR spectral data.In order to optimize the predictive power of the PLSDA model and simplify the complexity of the PLSDA model, we selected the number of latent variables (LV) as 5 by 8-fold cross-validation.Figure 1(d) shows the plots of dummy codes of the training and prediction sets for five group samples of different geographic origins.Table 1 shows the virtual code attribution maps for the training and prediction sets of the original spectra in the PLSDA model.We encode five sets of samples into f1 (1, 0, 0, 0, 0), f2 (0, 1, 0, 0, 0), f3 (0, 0, 1, 0, 0), f4 (0, 0, 0, 1, 0), and f5 (0, 0, 0, 0, 1), respectively, according to the position of the largest virtual code.As shown in Figure 1(d), all training and prediction samples belonging to Journal of Spectroscopy all groups of Angelica dahurica by original NIR spectra combined with PLSDA were identified accurately with a perfect recognition rate of 100%. is demonstrated that the PLSDA model successfully discriminates Angelica dahurica samples of different geographic origins. is further revealed that NIR spectroscopy combined with PLSDA method can be used to identify herbal medicines more rapidly, effectively, and reliably than the traditional ones.

Authenticity and Adulteration Discrimination of Corydalis yanhusuo W. T. Wang by NIR and Combinatory of NIR.
Herbal medicine processing often removes morphological properties of species, which leads to failure of distinguishing one type from another.For this reason, NIR spectra were used to discriminate the authenticity and adulteration of Corydalis yanhusuo W. T. Wang.As is shown Figure 2(a), the peaks around 6826 cm −1 were due to the C-H deformation vibration of CH 3 .Due to the C-H first overtone of -CH 2 -groups, bands at 5800 and 5600 cm −1 were observed and bands at 5172 cm −1 were the second overtone of the C�O stretching vibration.Furthermore, the C-H combination and second overtone can be seen at 4200-4300 cm −1 .
e seriously overlapped raw spectra hardly reflect the differences between samples.us, PCA technology and LDA and PLSDA models were used to relate the dummy code for the full original and preprocessing spectral variables.250 sample spectra of two kinds of authenticity,   2).However, both PCA technology and LDA model failed to show the correct results in prediction sets for five different groups by NIR (not shown here).us, PLSDA was adopted for the identification of authentic Corydalis yanhusuo W. T. Wang.In our work, all training and prediction samples were correctly identified except for the two samples in the training set (34th and 88th) and the two samples in the prediction set ( 35th and 82nd).e 34th sample in the training set of f2 is incorrectly discriminated as f1, and the 84th sample of f5 is erroneously classified as f2.Furthermore, the 35th sample in the prediction set of f2 is incorrectly assigned as f3, and the 82nd sample in the prediction set belonging to f5 is incorrectly classified as f2.It may account for the useless information of some spectral variables.e total correction rate was 97.94% on the test set in PLSDA models.On the other hand, MIR spectroscopy provides more specific and distinct absorption bands than NIR spectroscopy.As is shown in Figure 2(b), the band centered at 2931 cm −1 is due to a stretching vibration of aliphatic C-H in terminal CH 3 groups.e strong single peak of the C�O stretching vibration of ketone groups is observed at about 1635 cm −1 , whereas the band centered at 1250 cm −1 is due to the antisymmetric stretching vibrations of �C-O-C.
In order to better identify the origin of Chinese herbal medicines, we combined the mid-infrared spectrum with the near-infrared spectrum to obtain fusion spectra with more abundant sample information (Figure 3).e PLSDA was also applied to relate the dummy code for the full fused spectral variables.
As is shown in Figure 3(a), only the 17th sample in the prediction set of f1 was misclassified as f2 in fusion spectra (Table 3).It suggested that fusion spectra of NIR and MIR spectroscopy combined with PLSDA has better use in authenticity and adulteration discrimination of Corydalis yanhusuo W. T. Wang.But, it also failed to get 100% predictive accuracy.
In MWPLSDA, the appropriate window with H width is constructed, and the useful wavelength range is selected by moving the whole spectrum.en, all the selected windows are constructed into the PLSDA model.Finally, according to the minimum SSR principle of the MWPLSDA algorithm, the feature differences among the five samples are extracted.As shown in Figure 3

Conclusions
Supervised pattern recognition methods based on PLSDA and MWPLSDA algorithms by NIR and the data fusion of both NIR and MIR has been established to study Angelica dahurica and to identify the authenticity of Corydalis yanhusuo W. T. Wang.In addition, it was clarified from the results that other than PCA and LDA that can merely have well learning performance and do well in the training sets, the PLSDA model shows good performance in the area of identification of Angelica dahurica or Corydalis yanhusuo W. T. Wang and can be employed in the analysis of the geographical origins of Angelica dahurica and the authenticity or adulteration of Corydalis yanhusuo W. T. Wang.Furthermore, the full spectrum information of NIR and MIR spectroscopy combined with MWPLSDA performed much better than the single NIR spectra or PLSDA model and demonstrated an unparalleled ability of herbal medicine discrimination.is new recognition method provided a promising approach for the identification of herbal medicines widely.

Figure 1 :
Figure 1: e raw NIR spectra of five different origins of Angelica dahurica (a) and the results by PCA (b), LDA (c), and PLSDA (d).

Figure 3 :
Figure 3: Data fusion of NIR and MIR spectroscopy for authenticity or adulteration discrimination of Corydalis yanhusuo W. T. Wang (a) and the residue line obtained by MWPLSDA for the training sets (b).
2.4.Method of Chemometrics.PCA, LDA, PLSDA, and MWPLSDA methods were written and performed through a Matlab 2010a (MathWorks, Natick, MA.USA).All preprocessing in those chemometrics only used the original spectra.PLSDA is based on the simultaneous decomposition response matrix and the class matrix extraction factor.By arranging the extraction factors in order of their correlation, the virtual vectors are encoded to represent different classes, wherein the virtual vector fj encoded for the jth element is 1.e other elements are 0 for the jth class, and then each column of the response matrix is associated with the class matrix.

Table 1 :
A detailed list for the training set and the prediction set of five different kinds of Angelica dahurica samples.