To investigate the feasibility of rapid identification and quality evaluation of Chinese medicinal centipedes using NIR spectroscopy, the qualitative and quantitative analysis models were explored. A PCA-SVC model was optimized to differentiate five species of the genus
Animals of the genus
Previously, medicinal centipedes were mostly identified using morphological description, but some similar characteristics were probably shown among closely related species. If samples were damaged or powdered, they were difficult to be identified, and confusion and misuse would be unavoidable. Presently, molecular methods are gradually applied to identify
Near-infrared (NIR) spectroscopy combined with chemometrics is a fast, nondestructive, and environmentally friendly analysis technique that can realize multicomponent analysis. Nowadays, it is widely used in agriculture and medicine [
The nitrogen content of samples was determined with the DK 20 Heating Digester (VELP, Italy) and UDK 149 Automatic Distillation Unit (VELP, Italy). Spectra were collected with an MPA FT-NIR spectrometer (Bruker Optics Co., Ltd., Germany) and analyzed using the OPUS 7.5 spectrum analysis software (Bruker), MATLAB R2014a data analysis software (MathWorks, Inc., USA), and Unscrambler 9.7 data analysis software (CAMO Software AS, Norway).
A total of 64 samples from 28 batches have been collected from field surveys or market commodity in China since 2015. All samples were identified into five nominal species according to characteristics recorded by Siriwut et al. [
Sample information of medicinal centipedes.
Number | Species | Batch no. | Nitrogen content (%) | Origin |
---|---|---|---|---|
1 |
|
WG 002-1 | 10.09 | Suizhou, Hubei |
2 |
|
WG 003-1 | 11.36 | Suizhou, Hubei |
3 |
|
WG 004-1 | 11.82 | Jingmen, Hubei |
4 |
|
WG 004-2 | 10.60 | Jingmen, Hubei |
5 |
|
WG 005-1 | 10.77 | Xiangyang, Hubei |
6 |
|
WG 005-2 | 10.49 | Xiangyang, Hubei |
7 |
|
WG 006-1 | 9.47 | Yichang, Hubei |
8 |
|
WG 013-1 | 10.43 | Suizhou, Hubei |
9 |
|
WG 014-1 | 9.25 | Jinshan, Hubei |
10 |
|
WG 014-2 | 10.22 | Jinshan, Hubei |
11 |
|
WG 016-1 | 10.10 | Suizhou, Hubei |
12 |
|
WG 016-2 | 9.83 | Suizhou, Hubei |
13 |
|
WG 017-1 | 8.20 | Anlu, Hubei |
14 |
|
WG 017-2 | 10.15 | Anlu, Hubei |
15 |
|
WG 018-1 | 11.02 | Yichang, Hubei |
16 |
|
WG 019-1 | 10.05 | Nanzhang, Hubei |
17 |
|
WG 019-2 | 10.16 | Nanzhang, Hubei |
18 |
|
WG 020-1 | 9.06 | Anhui |
19 |
|
WG 020-2 | 11.74 | Anhui |
20 |
|
WG 027-1 | 10.55 | Henan |
21 |
|
WG 027-2 | 11.10 | Henan |
22 |
|
WG 032 -1 | 9.96 | Machang, Hubei |
23 |
|
WG 032-2 | 10.01 | Machang, Hubei |
24 |
|
WG 045-1 | 9.68 | Machang, Hubei |
25 |
|
WG 045-2 | 9.68 | Machang, Hubei |
26 |
|
WG 012-1 | 11.27 | Yulin, Guangxi |
27 |
|
WG 012-2 | 10.83 | Yulin, Guangxi |
28 |
|
WG 012-3 | 10.54 | Yulin, Guangxi |
29 |
|
WG 012-4 | 11.74 | Yulin, Guangxi |
30 |
|
WG 012-5 | 11.33 | Yulin, Guangxi |
31 |
|
WG 012-6 | 9.77 | Yulin, Guangxi |
32 |
|
WG 012-7 | 11.27 | Yulin, Guangxi |
33 |
|
WG 012-8 | 11.94 | Yulin, Guangxi |
34 |
|
WG 021-1 | 9.85 | Guangxi |
35 |
|
Wg039-1 | 11.92 | Guangxi |
36 |
|
Wg039-2 | 10.98 | Guangxi |
37 |
|
Wg040-1 | 10.25 | Mengzi, Yunnan |
38 |
|
Wg040-2 | 9.81 | Mengzi, Yunnan |
39 |
|
WG 028-1 | 12.31 | Yunnan |
40 |
|
WG 038-1 | 12.62 | Yunnan |
41 |
|
WG 038-2 | 12.17 | Yunnan |
42 |
|
WG 038-3 | 12.58 | Yunnan |
43 |
|
WG 038-4 | 12.30 | Yunnan |
44 |
|
WG 038-5 | 12.36 | Yunnan |
45 |
|
WG 038-6 | 11.67 | Yunnan |
46 |
|
WG 038-7 | 12.46 | Yunnan |
47 |
|
WG 038-8 | 11.93 | Yunnan |
48 |
|
WG 007-1 | 8.37 | Mojiang, Yunnan |
49 |
|
WG 007-2 | 8.04 | Mojiang, Yunnan |
50 |
|
WG 007-3 | 8.16 | Mojiang, Yunnan |
51 |
|
WG 007-4 | 7.47 | Mojiang, Yunnan |
52 |
|
WG 007-5 | 8.18 | Mojiang, Yunnan |
53 |
|
WG 008-1 | 8.92 | Mojiang, Yunnan |
54 |
|
WG 008-2 | 9.06 | Mojiang, Yunnan |
55 |
|
WG 041-2 | 8.54 | Bixi, Yunnan |
56 |
|
WG 041-3 | 8.37 | Bixi, Yunnan |
57 |
|
WG 022-1 | 10.48 | Suizhou, Hubei |
58 |
|
WG 022-2 | 10.07 | Suizhou, Hubei |
59 |
|
WG 022-3 | 10.52 | Suizhou, Hubei |
60 |
|
WG 022-4 | 10.65 | Suizhou, Hubei |
61 |
|
WG 022-5 | 11.45 | Suizhou, Hubei |
62 |
|
WG 015-1 | 10.86 | Chaohu, Anhui |
63 |
|
WG 015-2 | 11.71 | Chaohu, Anhui |
64 |
|
WG 015-3 | 10.82 | Chaohu, Anhui |
After being scanned with a near-infrared spectrometer, the nitrogen content of 50 mg powder of each sample was determined with the semimicro quantitative nitrogen determination method referring to the guideline of ChP 2015. The samples were digested using the DK 20 Heating Digester with a program as follows: 200°C for 5 min, then up to 260°C sustaining for 5 min, 340°C for 5 min, and 420°C for 40 min, and at last cooled down to 200°C. The sample solution was measured using the UDK 149 Automatic Distillation Unit with a program as follows: 50 ml H2O and 20 ml 40% NaOH were added to the digested solution, 20 ml 2% H3PO4 was used for receiving, the steam quantity was 50%, the distillation time was 4 min, and then titration was done with 0.025 mol/L H2SO4 standard solution (Metrological Testing Technology Research Institute of Shanghai; Batch number 150901).
After samples were smashed and dried at 55°C for 24 h, the powder of 2 g of individuals was scanned using the MPA FT-NIR spectrometer with a diffuse reflection integral sphere. The spectra were obtained in a range of 12000∼4000 cm−1 by the coaddition of 32 scans at a resolution of 8 cm−1. Each sample was scanned three times, and the average of three spectra was used for analysis. The spectra diagram is shown in Figure
NIR spectra diagram of samples.
Usually, the raw spectrum includes a lot of irrelevant information or noise, which would lead to baseline drift and instability. Therefore, spectrum pretreatment is a critical step in spectral analysis. There are many pretreatment methods, and each has advantages to improve model performance. For instance, vector normalization (VN) can be used to eliminate influences of the optical path change on the spectrum. The derivative methods including the first derivative (FD) and second derivative (SD) are always employed to eliminate spectral difference from baseline [
PCA is a commonly used method for data compression. It performs dimensionality reduction of a high-dimensional dataset, while retaining its variation as much as possible. This method can transform a number of possibly correlated variables (the original data matrix) into one or a few important variables (principal components (PCs)) to reveal the internal structure. Each PC is a linear combination of the original data. The new variables are not related to each other, which can eliminate the overlapped part of information. Moreover, these new variables include the most informative dimensions of the original variables without losing too much information. Commonly, the number of PCs is determined by the contribution rate to original variables. When the cumulative contribution rate is more than 85%, the main components can represent most of the information provided by the original variable [
The PLS is a new multivariate statistical analysis method. It attempts to recombine the original variables (mainly continuous variables) into a group of new independent comprehensive variables and extracts a few comprehensive variables to reflect the information on the original variables as much as possible. The extracted new variables have good interpretation ability for the dependent variables. During modeling, it not only considers factors of the independent variable matrix (spectral matrix) but also takes the “response” matrix (content matrix) into account. The principal component scores extracted by dimension reduction are used as input variables to avoid multicollinearity, improve stability, and simplify the model. Therefore, the PLS has the ability to simplify the model and characteristics of quick calculation and strong prediction ability, and as one of the most classical data processing tools in multiple correlation regression, it is widely applied in NIR spectroscopy quantitative analysis [
SVM is a powerful supervised learning algorithm that was first proposed by Vapnik [
RBF is a commonly used kernel function in the SVM algorithm. It has a strong ability to deal with nonlinear problems. It can be expressed as follows:
RBF has two important parameters in the SVM algorithm, i.e., penalty factor “
During modeling of the SVM algorithm, the input data need to be mapped to a higher-dimensional space to realize dimension reduction and regression fitting. So, the data should be firstly pretreated and compressed.
In the PCA-SVC qualitative model, the model performance was evaluated by 3-fold cross-validation (3-CV) of the calibration set. The internal parameters
During the process of modeling, the calibration set was used for internal cross-validation to validate model performance, the internal cross-validation adopted 6-fold cross-validation, and the root mean square error of internal cross-validation (RMSECV), coefficient of determination (
The content of 64 samples was measured. Samples used in the analysis are as follows: 25 specimens of
The NIR spectra of samples were scanned in the range of 12000–4000 cm−1; the spectra diagram is shown in Figure
The spectra of 64 samples were randomly classified into calibration and prediction sets in a proportion of approximately 2 : 1. Finally, 42 samples of the calibration set were used for model establishment, and 22 samples of the prediction set were used for model evaluation. The species were represented with category label numbers 1 to 5. The classification information is shown in Table
Classified information of the qualitative model of medicinal centipedes.
Sample set |
|
|
|
|
|
Total |
---|---|---|---|---|---|---|
Calibration set | 8 | 6 | 5 | 6 | 17 | 42 |
Prediction set | 5 | 3 | 3 | 3 | 8 | 22 |
Label value | 1 | 2 | 3 | 4 | 5 | — |
In this qualitative analysis, the three methods VN, FD, and SD were used to pretreat the raw spectra. The PCA method was used to reduce dimensions of raw and three pretreated spectra. The accumulative contribution rates of PCs were calculated. The result showed that the contribution rates of the first two PCs (PC1 and PC2) were more than 85%, which can represent most of the spectrum information [
To further investigate the influence of different pretreatments, a group of PCA-SVC models was established using the scores of the first 2 PCs as input variables and category labels as output variables. The model performance was evaluated by 3-fold cross-validation (3-CV) of the calibration set. The internal parameters of the SVC algorithm were optimized with the GS method. The values of best
Different spectral pretreatments of PCA-SVC models.
Model number | Pretreatment | NPC |
|
|
Accuracy rate (%) | ||
---|---|---|---|---|---|---|---|
3-fold cross-validation | Calibration set | Prediction set | |||||
SVC-1 | Raw | 2 | 524288 | 0.03125 | 54.7619 | 90.4762 | 59.0909 |
SVC-2 | VN | 2 | 6.71089 |
0.0078125 | 64.2857 | 66.6667 | 63.6364 |
SVC-3 | FD | 2 | 16 | 32768 | 59.5238 | 64.2857 | 63.6364 |
SVC-4 | SD | 2 | 3.35544 |
32768 | 66.6667 | 71.4286 | 68.1818 |
Although the SD was determined as the optimal pretreatment in a preliminary investigation, the accuracy in the model with scores of the first 2 PCs as input variables was just about 70%, which did not meet the requirement of discrimination. Hence, the best NPC still needs to be optimized. In light of the modeling and SD pretreatment method mentioned above, 10 PCA-SVC models (SVC-5 to SVC-14) were established using the scores of the first 1, 2, 3, …, 10 PCs of the calibration set as input variables. As shown in Table
Comparison on PCA-SVC models established with different NPCs.
Model number | NPC |
|
|
Accuracy rate (%) | ||
---|---|---|---|---|---|---|
3-fold cross-validation | Calibration set | Prediction set | ||||
SVC-5 | 1 | 64 | 4.29497 |
59.5238 | 80.9524 | 63.6364 |
SVC-6 | 2 | 3.35544 |
32768 | 66.6667 | 71.4286 | 68.1818 |
SVC-7 | 3 | 4.1943 |
262144 | 66.6667 | 85.7143 | 81.8182 |
SVC-8 | 4 | 2521.38 | 1.27148 |
71.4286 | 90.4762 | 77.2727 |
SVC-9 | 5 | 23170.5 | 3.65135 |
73.8095 | 97.6190 | 72.7273 |
SVC-10 | 6 | 26615.9 | 794672 | 73.8095 | 90.4762 | 77.2727 |
SVC-11 | 7 | 26615.9 | 1.2045 |
78.5714 | 97.6190 | 81.8182 |
SVC-12 | 8 | 5.93164 |
5792.62 | 83.3333 | 100 | 81.8182 |
SVC-13 | 9 | 131072 | 262144 | 80.9524 | 100 | 81.8182 |
SVC-14 | 10 | 1.04858 |
65536 | 80.9524 | 100 | 77.2727 |
According to the research above, SVC-12 was determined as the best qualitative analysis model. After the full spectrum was pretreated with the SD and the dimension was reduced with PCA, the model was established using the scores of the first 8 PCs as input variables and category labels as output variables. The internal parameters of best
Optimization of internal parameters with the grid search of the PCA-SVC model. (a) Initial grid search. (b) Fine search.
Validation results of the PCA-SVC model for medicinal centipedes: (a) calibration set; (b) prediction set. The red points represent validation results, and the blue points represent reference values.
In this quantitative analysis, the Kennard–Stone (K-S) algorithm was used to divide 64-sample spectra into the calibration set and prediction set in a proportion of 2 : 1 in the MATLAB R2014a software; 42 samples of the calibration set were used for validation, while 22 samples of the prediction set were used for prediction.
The partial least-squares regression (PLSR) model is one of the multiple linear regression (MLR) models; it can easily realize the ideal linear relationship between input variables (spectral information) and output variables (ingredient contents) after high dimensions are compressed by PLS. PLSR has the desirable property to analyze data that are strongly collinear (correlated), noisy, and independent variables and also simultaneously model several response variables; now, it has been developed as a standard tool in chemometrics [
The full spectral data (12000∼4000 cm−1) were used for modeling. To eliminate noise and other factors, they need to be firstly pretreated. The pretreatments including Raw, VN, FD, FD + VN, MSC, and FD + MSC were applied. After the dimensions were reduced with PLS, the treated spectral data were used as input variables and nitrogen content was the output reference, and a series of PLSR models were established with the Unscrambler 9.7 software.
During the process, the model was validated and evaluated. As shown in Table
Validation and predictive results of PLSR models with different pretreatment methods.
Model number | Pretreatment | 6-fold cross-validation | External validation | RMSEE (%) | NPC | ||||
---|---|---|---|---|---|---|---|---|---|
RMSECV (%) |
|
RPD | RMSEP (%) |
|
RPD | ||||
PLSR-1 | Raw | 0.42 | 90.50 | 2.85 | 0.51 | 80.78 | 2.30 | 0.27 | 9 |
PLSR-2 | VN | 0.47 | 87.22 | 2.51 | 0.43 | 84.3 | 2.51 | 0.34 | 5 |
PLSR-3 | FD | 0.46 | 88.14 | 2.53 | 0.46 | 84.41 | 2.39 | 0.31 | 6 |
PLSR-4 | FD + VN | 0.41 | 90.71 | 3.04 | 0.43 | 85.84 | 2.53 | 0.32 | 5 |
PLSR-5 | MSC | 0.44 | 89.63 | 2.69 | 0.50 | 81.72 | 2.39 | 0.26 | 8 |
PLSR-6 | FD + MSC | 0.40 | 90.95 | 2.96 | 0.44 | 85.61 | 2.51 | 0.31 | 5 |
The optimization of the NPC is an important step during modeling. It can be obtained from the RMSECV-NPC diagram. For instance, in the PLSR-6 model, with the change of the NPC, the RMSECV had different values; when the NPC was 5, the RMSECV had a minimum value, and the model had the best performance. Therefore, the optimal NPC was determined as 5. The optimization is shown in Figure
RMSECV-NPC diagram of the PLSR-6 model.
As described above, the best PLSR model was finally determined, the optimized pretreatment was determined as FD + MSC, and the NPC was 5. During modeling, 6-fold cross-validation was used as internal validation to validate the performance, and the predictive ability was evaluated with external validation using the prediction set. The predictive results are shown in Figure
Predictive results in the calibration set (a) and prediction set (b) of the PLSR-6 model.
Besides ingredient information, the NIR spectroscopy also contains much other information, such as physical and chemical information, which often causes spectral bands seriously overlapped. Actually, in most cases, it shows nonlinear relationship between sample spectra and content. With the development of application of chemometrics, modern intelligent algorithms have attached more attention to NIR spectroscopy analysis for its strong nonlinear fitting ability and obtained preliminary exploration and application. The SVM algorithm is based on statistics to allow obtain a good fitting effect and stable structure. As a result, it becomes a commonly used nonlinear regression algorithm. Compared with the ANN algorithm which is suitable for solving problems of complex mapping and large sample size [
In this study, an SVR algorithm combined with dimensions reduced by the PLS was used to establish a nonlinear regression model. When the parameters determined in the PLSR model (the pretreatment was FD + MSC, and dimensions reduced with PLS and NPC were 5) were introduced into the SVM algorithm, the SVR models were performed in the MATLAB R2014a software. The GS and GA were adopted to optimize the internal parameters (
Validation and evaluation results of SVR models.
Model number | Optimization method |
|
|
6-fold cross-validation | External validation | RMSEE (%) | ||||
---|---|---|---|---|---|---|---|---|---|---|
RMSECV (%) |
|
RPD | RMSEP (%) |
|
RPD | |||||
PLS-SVR-1 | GA | 99.99 | 997.03 | 0.4 | 91.54 | 2.91 | 0.41 | 85.89 | 2.55 | 0.34 |
PLS-SVR-2 | GS | 512 | 1024 | 0.34 | 93.29 | 3.72 | 0.43 | 85.5 | 2.54 | 0.32 |
Parameter optimization (a) and predictive results (b) of the PLS-SVR-2 model.
In this study, the linear regression model of PLSR and nonlinear regression model of PLS-SVR were successfully established. As shown in Tables
However, the PLSR model was built based on a linear regression algorithm to have characteristics of fast fitting and simple calculation, when the analysis requirements were not too high, and it would be widely used. In contrast, the SVR model was established based on the nonlinear regression algorithm, and it had the strong nonlinear fitting ability. It was shown from Tables
This study was carried out to explore the feasibility of using the NIR spectroscopy method to rapidly differentiate species and evaluate the quality of Chinese medicinal centipedes. In the qualitative analysis, after spectra were pretreated with the SD, dimensions were reduced with PCA, and internal parameters were optimized with the GS algorithm, a PCA-SVC model was set up using the scores of the first 8 PCs as input variables and category labels as output variables. The optimal model (SVC-12) was validated and evaluated, which could identify five species of medicinal centipedes with an accuracy of 100% (42/42) in the calibration set and 81.82% (18/22) in the prediction set. It could be accepted as an objective, rapid, and auxiliary method for identifying the species of medicinal centipedes. Through the spectra pretreated with FD + MSC, data dimension reduced with PLS, and NPC determined as 5, two best quantitative models of PLSR and PLS-SVR were also successfully determined. During the process of modeling, the RMSECV,
Meanwhile, the pretreatment methods were also optimized in this paper; although the SD was determined in the qualitative model, MSC or its combined methods were applied to pretreat the spectra in quantitative models. The MSC had advantages of weakening or eliminating interference caused by the uneven grain size of solid powder in the diffuse reflection spectrum [
This study indicated that NIR spectroscopy combined with chemometric algorithms could be successfully used to differentiate species and evaluate the quality of medicinal centipedes in China, which was characterized with rapid, nondestructive, and environmentally friendly properties. However, this study just represented preliminary exploratory research; although 28 batch samples and 64 individuals were conducted, the sample size was still limited. In the future, more samples will be used to improve the prediction ability, and other algorithms will also be considered to simplify the model and improve performance. This study also provided a reference for rapid identification and quality analysis of other animal medicinal materials using NIR spectroscopy.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by a grant from the major drug discovery projects of the National Ministry of Science and Technology of China (no. 2014ZX09304307001).