Identification of Chinese Herbal Medicines from Zingiberaceae Family Using Feature Extraction and Cascade Classifier Based on Response Signals from E-Nose

Identification of Chinese herbal medicines (CHMs) by human experience is often inaccurate because individual ability and external factors may influence the outcome. However, it might be promising to employ an electronic nose (E-nose) to identify them. This paper presents a rapid and reliable method for identification of ten different species of CHMs from Zingiberaceae family based on their response signals from E-nose. Ten Zingiberaceae CHMs were measured and their maximum response values were analyzed by principal component analysis (PCA). Result shows that E Zhu (Curcuma phaeocaulis Val.) and Yi Zhi (Alpinia oxyphylla Miq.) could not be distinguished completely by PCA. Two solutions were proposed: (i) using BestFirst+CfsSubsetEval (BC) method to extract more discriminative features to select sensors with higher contribution rate and remove the redundant signals; (ii) employing a novel cascade classifier with two stages to enhance the distinguishing-positive rate (DPR). Based on these strategies, six features were extracted and used in different stages of the cascade classifier with higher DPRs.


Introduction
Chinese herbal medicines (CHMs) are getting more and more international attention owing to their in vivo antitumor activities [1], alternative treatment for menopausal hot flushes [2], inhibition of cancer-related inflammation [3], and so on. Accurate medication of CHMs is an important issue as their global use is increasing rapidly. Besides, incorrect, inferior, or fake CHMs may result in poor clinical effects or even poisoning [4]. Therefore, it is urgent and necessary to establish rapid, accurate, and convenient methods for CHMs identification.
Classically, CHMs are authenticated by a human experience panel called macroscopic identification, which considers the morphologic characteristics (shape, color, etc.) and the organoleptic properties (odor, taste, etc.) through observing, touching, smelling, and tasting. In Pharmacopoeia of the People's Republic of China, this macroscopic identification is still an important aspect of CHMs quality evaluation. For instance, Sha Ren (dried fructus of Amomum villosum Lour.) with large size and strong odor should be superior in quality. However, this method is expensive and time-consuming and also inaccurate because of its low sensitivity and lack of quantitative information. Electronic nose (E-nose) is a rapid, reliable, and robust technology and could be employed as easy-to-use and cost-efficient technology.
In the past decades, E-nose has been applied in many fields [5]. Also, many studies on application of E-nose to quality control of CHMs and food are available in the literatures [6,7].
Yu and Wang [8] considered the volume of vial and the headspace generated time relating to the identification 2 Evidence-Based Complementary and Alternative Medicine   [10] are reported. In this paper, orthogonal test is designed to determine the optimum experimental procedure and two new solutions are proposed to improve the distinguishing-positive rate (DPR) for classifying different CHMs from Zingiberaceae family. As is well-known, CHMs from Zingiberaceae family share some chemical components together such as eucalyptol, zingiberene, and camphor. Hence some of them have quite similar odor. Nevertheless, some subtle differences among the odors of different CHMs exist according to the unique chemical component and the different proportions of chemical composition. But these differences are very small and their effects on the odors could be distinguished only by welltrained people. The employment of E-nose with optimum experimental procedure and classifiers could be useful for identifying different Zingiberaceae CHMs.

Subjects.
The research was carried out by using ten different kinds of Zingiberaceae CHMs, which were obtained from Beijing Tongrentang Co., Ltd. (Beijing, China) and authenticated by Professor Yonghong Yan in Beijing University of Chinese Medicine (Beijing, China). As illustrated in -FOX3000 E-nose consists of air generator equipment, a sampling apparatus, HS-100 autosampler, a detector unit, and a computer for data recording. The detector unit is composed of 12 metal oxide sensors (MOSs), shown in Table 2. The sensor response represents the conductance ratio ( 0 /G).
Grinded into small particles, a certain amount of each sample was accurately weighed into a 10 mL septa-sealed bottle and loaded into the autosampler tray with a rotation rate of 250 rpm. After incubation, 500 L of headspace air was automatically injected into the detector unit with a constant Evidence-Based Complementary and Alternative Medicine 3 rate of 500 L/s through a syringe. The conductance ratio of each sensor changed during the measurement process, which was recorded by a computer program. The measurement phase went on for 200 s, which was enough for all the sensors to reach the stable values and return to the baseline. The data acquisition cycle was 1 s and the responses were collected in the computer for later use.

Optimization of Experimental Conditions by Orthogonal
Test Design. The quantity of each sample was selected from 0.1 g, 0.2 g, 0.3 g, 0.4 g, 0.5 g, and 1 g. Based on the selected quantity, an orthogonal array (L 9 (3 3 )) was constructed to evaluate the effects of the following factors: particle size (A), incubation temperature (B), and incubation time (C). Factors are displayed in Table 3. Each experiment was performed in triplicate and the data were analyzed using variance analysis (ANOVA). Based on the results, the optimum experiment procedure was determined. Twelve repeated samples were prepared for each kind of CHMs and totally 120 measurements were performed by the optimum experiment procedure. The E-nose response values of CHMs' samples were recorded and classifiers were established to identify them. All the samples were measured by dynamic headspace sampling.

Principal Component Analysis, Radial Basis Function, and
Random Forests. The data attained by the process mentioned above was analyzed by different pattern recognition techniques including principal component analysis (PCA), radial basis function (RBF), and random forests (RF) to construct different classifiers.
Tenfold cross validation and external test set validation were used to evaluate the performance of different classifiers with DPR. The classification results should not be accepted if the DPR was lower than 80%.
PCA is a projection method that provides easy visualization of all the information contained in a dataset [11]. Besides, PCA helps us figure out which samples are different from others and which principal component extracted from the original variances contributes more to this difference.
RBF is a kind of artificial neural networks which have simple topological structure and clear learn procedure. RBF has its unique advantages in pattern recognition application due to its quick convergence speed and having no local minima [12]. After training, RBF can provide us with a discriminative model for blind samples analyses.
RF technique is a novel classifier containing multiple decision trees for analysis of high dimension data which is able to exploit dependencies and structures contained within spatially varying input data [13]. Self-learning ability of RF helps us avoid misjudgment and improve accuracy.
While screening the suitable neural networks or decision trees in the classifiers system, the system processes the inputs and compares its resulting outputs to the desired outputs. Errors are propagated back to the system and tell the system to adjust its inner structures so as to improve the distinguishingpositive rate. During the training, the same dataset will be processed many times as the connection weights are refined.

Feature Extraction by BestFirst+CfsSubsetEval and Cascade Classifier
Analysis. Feature extraction analysis aims to screen all the variances, selecting the factors with more valuable information to the final target and eliminating others with redundant information, which can decrease high dimensions into low dimensions and help the classifier get more accurate results. BestFirst+CfsSubsetEval (BC) is one kind of feature extraction technologies. It can screen out the characteristic parameter vectors with high relevance to the classification. We can get an optimum set of sensors for final identification based on BC.
Cascade classifier contains more than one kind of pattern recognition algorithms, which can improve the identification ability of the system and increase the DPR [14]. This paper presents a two-stage cascade classifier with RBF and RF for distinguishing those different CHMs from Zingiberaceae family. Figure 1 shows the scheme of the work flow.

E-Nose Responses to CHMs Aroma.
When estimating the sensor response to a given sample, the response values were used as follows: = 0 / , where was the response, 0 was the conductance of a sensor in the reference air, and was the conductance of the sensor in the sample gas. Figure 2 shows typical responses of twelve MOSs measuring a Sha Ren sample. One line represents one MOS's signals measuring the Sha Ren sample. The horizontal axis is the timeline, a total of 200 seconds; the vertical axis is the response values of the MOS. The curves represent the resistance value of each sensor against time due to the electrovalve action when the volatile compounds reached the detection chamber. In the initial period, the response value of each MOS is low and then increases continuously and finally stabilizes after a few seconds. In this study, 12 maximum response values of each CHM sample from 12 MOSs were extracted and analyzed individually.
The repeatability of the established method was evaluated with nine parallel tests of CHMs' samples. The relative standard deviation (RSD, = 6) values of 12 MOSs were calculated. The results are all less than 3%, proving a high repeatability of MOSs responses.

Variance Analysis (ANOVA).
The amount of a sample is important for the detection of CHMs aroma and affects the volume and the concentration of the headspace gas because  deviation (RSD) values of different incubation temperatures with six repeated samples for each experiment show that incubation temperature of 45 ∘ C has the lowest and most stable RSD value (all of them less than 2%). In order to obtain more stable responses, the CHMs' samples were measured after incubating at 45 ∘ C. The RSD analyses of the other two factors show that particle size of 850 ± 29 m and incubation time of 600 seconds are the optimum experiment conditions. (PCA, RBF, and RF). The dataset obtained from ten kinds of CHMs' samples was analyzed by PCA. The PCA plot is shown in Figure 5. Most of the samples could be classified by PCA. These CHMs share the same characteristic that they all have one or more diagnostic chemical compounds such as       BA feature extraction analysis. The result shows that DPR in the test set by RBF increases from 80% to 85%. In this study, a two-stage cascade was constructed focusing on the identification of E Zhu and Yi Zhi. Dataset-1 was established by each kind of CHMs as one category used in stage I classifier of RBF. And dataset-2 was established by E Zhu, Yi Zhi, and the remaining eight kinds of CHMs as three different categories, respectively, used in stage II classifier of RF. Detailed results are as follows: a higher DPR of 90% instead of 70% for E Zhu while being 80% instead of 70% for Yi Zhi by tenfold cross validation in stage II. All the CHMs' samples could be distinguished with high DPRs by this twostage cascade classifier.

Conclusions
The results of ANOVA and multivariance analysis showed that the suitable quantity of sample for measurement is 0.4 g and the particle size and incubation temperature mainly influence the responses of most MOSs (MOS 2-MOS 12). Based on orthogonal test design, the optimum experimental conditions were determined: particle size of 850 ± 29 m, incubation temperature of 45 ∘ C, and incubation time of 600 seconds.
The responses of E Zhu and Yi Zhi are partly overlapped and thus these two CHMs' samples could not be identified completely by PCA. In accordance with the results by human experience, the aroma detected by human nose and E-nose is more different among Sha Ren, Bai Dou Kou, and Cao Dou Kou.
The classification results of RBF and RF analysis were superior to that of PCA. All the DPRs were above 80% and the CHMs' samples from Zingiberaceae family could be classified. However, some samples of E Zhu were misjudged as Yi Zhi owing to their similar chemical compositions and sensual scent.
Two solutions, BC feature extraction and two-stage cascade classifier, were proposed to improve the identification ability of the discriminative model, via removing redundant information to reduce the data dimensions and to separate