Recognition of MIR Data of Semen Armeniacae Amarum and Semen Persicae Using Discrete Wavelet Transformation and BP-Artificial Neural Network

Abstract. Horizontal attenuation total reflection-Fourier transformation infrared spectroscopy (HATR-FT-IR) is used to measure the Mid-IR (MIR) of semen armeniacae amarum and its confusable varieties semen persicae. In order to extrude the difference between semen armeniacae amarum and semen persicae, discrete wavelet transformation (DWT) is used to decompose the MIR of semen armeniacae amarum and semen persicae. Two main scales are selected as the feature extracting space in the DWT domain. According to the distribution of semen armeniacae amarum and semen persicae’s MIR, five feature regions are determined at every spectra band by selecting two scales in the DWT domain. Thus, ten feature parameters form the feature vector. The feature vector is input to the back-propagation artificial neural network (BP-ANN) to train so as to accurately classify the semen armeniacae amarum and semen persicae. 100 couples of MIR are used to train and test the proposed method, where 50 couples of data are used to train samples and other 50 couples of data are used to test samples. Experimental results show that the accurate recognition rate between semen armeniacae amarum and semen persicae is averaged 99% following the proposed method.


Introduction
Traditional Chinese medicine (TCM) is more than just the great contributions to the flourishing and prosperity of the Chinese nation. It represents an important chapter to the annuals of oriental civilization. It has fascinated more and more people in the world [1,2].
Semen armeniacae amarum (bitter apricot kernel) is a kind of TCM, which has been known for the treatment of diseases for a long time. It is used for treating a variety of coughs and dyspnea, and treating cough caused by wind heat. It is also used for treating dryness syndrome of the intestines with constipation [3]. In Chinese medicine market, it can be difficult to be determined the authenticity of TCM. This involves the scientization of TCM quality appreciation system. While looking for the TCM quality appreciation system, more and more importance was attached to modern instrumental analysis; its application was deemed to the sign of TCM quality appreciation system coming to modernization, as it could offer more and more integrated information on quality than single component.
Fourier transform infrared spectroscopy (FT-IR) can get nearly all of material information about complex systems, so it has become (arguably) the most sophisticated analytical tool in the spectrum analysis and has become indispensable and definitive for many analyses [4][5][6][7]. TCM is a complicated system of mixtures, so it is difficult to be explained [8]. It is successful using FT-IR analysis in different family and genus. As two TCM samples are sibling species, they contain similar chemical composition. Therefore, the result is not ideal when MIR analysis was only adopted [9]. How to make use of the large amount of data of absorption spectra from complex system for fast qualitative and quantitative analysis effectively, and to make the information that buried in the MIR overlapping bands and the difference that existed in the infrared absorptions spectra be displayed visually for the identification of those spectra which are similar and complicated have been a goal of analytical chemists. The chemometrics was widely used in many fields because of the combination of the numerical computing technology and information from the equipment.
Wavelet transformation is a more effective signal-processing method than Fourier transform, the transformed results (wavelet factor) of discrete wavelet transformation (DWT) contain more valuable information, which is a relatively effective analysis method in chemometrics. The wavelet transformation is being used in chemistry and its related domains in recent years. Ehrentreich [10] pointed out that the wavelet transformation had been established with the Fourier transform as a dataprocessing method in analytical chemistry. Most of existing methods in chemistry are based on discrete wavelet transformation. For example, L. M. Shao et al. [11] introduced the wavelet transformation and its applications in respect of photoacoustic spectroscopy, EXAFS spectrum, NMR analysis, and Raman spectrum. In recent years, some researchers also combine the wavelet transformation with other some intelligent technique to analyze the signal of chemistry. For example, R. Tabaraki et al. [12] developed a wavelet neural network (WNN) model in quantitative structure property relationship (QSPR) for predicting solubility of 25 anthraquinone dyes in supercritical carbon dioxide over a wide range of pressures (70-770 bar) and temperatures (291-423 K).
Artificial neural network (ANN) can learn and train the information samples so that it will possess similar memories of human brain, identification capabilities, and implementation of various informationprocessing functions. It has good self-learning, adaptive, associative memory, parallel processing, and nonlinear conversion capabilities, which avoid complicated mathematical derivation. Even if the samples are deficient and parameters drift circumstances, the output can guarantee to be stable, thus it facilitates the theoretical analysis. Recently, DWT and back-propagation artificial neural network (BP-ANN) have been successfully applied to MIR spectroscopy analysis [13], but few studies have been reported in the MIR-DWT-ANN application to recognition TCM. Therefore, HATR-FT-IR spectroscopy combined with DWT and ANN discrimination method was proposed for the rapid and simple classification of semen armeniacae amarum and semen persicae in this study.

DWT
In numerical analysis and functional analysis, DWT is a wavelet transformation that the wavelets are discretely sampled. As with other wavelet transformations, a key advantage it has over Fourier transformations is temporal resolution: it captures both frequency and location information. Based on this advantage, DWT has a huge number of applications in science, engineering, mathematics and computer science. Most notably, it is used for signal coding to represent a discrete signal in a more redundant form, often as a preconditioning for data compression. DWT is originated from the discretization of continuous wavelet transformation (CWT) and the common discretization is dyadic. The CWT of a function or signal, for example, can be defined as where Ψ * (t) denotes the mother wavelet function. The parameters a named as scale parameter and b named as translation parameter are, respectively, used to control the dilation and position of the mother function.
After the dyadic discretization, the function of DWT is accordingly expressed as where a and b are replaced by 2 j and 2 j k. An efficient way to implement this scheme using filters was developed in 1989 by Mallat. The original signal f (t) passes through two complementary filters and emerges as low-frequency and high-frequency signals. The decomposition process can be iterated, with successive approximations being decomposed in turn, so that a signal can be broken down into many lower-resolution components [14].

BP Algorithm
Artificial neural network has many models. It can be divided into feed-forward and feed-back based on the network structure. One of the main applications of feed-forward network is identification and classification. There is no strict distinction between the input and output layers of feedback network and we can extract the important characteristics and energy minimization of data after study [15]. When the feed-forward neural-network nodes are all used as the Sigmoid function, one hidden layer is sufficient to arbitrary classification. Figure 1 shows calculation process of the feed-forward artificial neural network. Figure 1(a) shows the first stage which includes choosing the network model, learning rules, studying input and output data (output data aka the target output data), and learning and training the network to get the neural network's node weight and node threshold. The network weights and threshold are determined on a process of constantly adjusting the network weights and threshold by comparing the error between the output data of the artificial neural network and the target output data until the errors are within the allowable range.
The second stage shows up as shown in Figure 1(b). The output result is generated by inputting the testing data into the network model with the chosen weights and threshold in the first stage.
We use FT-IR-discrete wavelet characteristic vector value of the TCM samples as guide data to train the network. The BP algorithm, which is mature in multilayer feed-forward networks both in theory and applications, is used.
After data preprocessing, the two congeners are mapped to the two nodes in the output layer and the wavelet FT-IR eigenvalues of ten eigenvectors are normalized into the value between 0-1.
Classification algorithm has two steps. The first step is BP network training and the second step is classification of different kinds of categories using the trained BP network (see Figure 2).
Step 1 Is Network Training. There are three layers: the input, the output, and the hidden layer. The input layer is ten normalized eigenvectors from the five regions. The output layer is used for classification with each node of its corresponding to each kind of semen which is two in this case. The hidden layer is the layer between the input layer and output layer. The number of nodes in the hidden layer has to be carefully decided. Less hidden nodes mean higher local minimum and poor fault tolerant ability. However, too much hidden nodes mean long study time and the classification result is not always the best. It is necessary to test repeatedly and choose the best number of hidden nodes. In step 2, the network model designed in step 1 is used to test the ten eigenvectors from the five normalization regions.
Sigmoid function is used as activation function. In order to make the least-square error of the corresponding input samples p minimum, we should study and amend the threshold and weights. The formula of the least square error function can be written by t pj is the target output value of sample p in the output layer's j node, that is, the type of the plant, and o pj is the actual output value of sample p in the output layer's j node. Actual output value is calculated from the input layer to the output layer while the adjusted direction of error and weight are from the output layer to the input layer. Step 1 Step 2 Figure 2: BP algorithm flow chart.
(1) The formula of calculation of the output value o pj of the node j (the output value equal to the input value when node j is the input layer's node) is

4)
w ji is the weight value which connected nodes i and j, θ j is the threshold of node j. Threshold values can be considered as the weight connecting an output equal to 1 to other nodes, so its adjustment process is the same as w ji . The adjustment of weight w ji is as following.
(2) The formula of amendment weight w ji , which connects the hidden layer's node i to the output layer's node j, is as follows (when j is the output layer's node): where η is the learning rate, α is the momentum term, δ pj is the error signal of the output layer's node j. δ pj is calculated as follows: (3) When j is not the output layer nodes, we also used the above weight amendment connected the hidden layer's node i and the hidden layer's node j. But δ pj calculation becomes δ pk is the error message between the output with input from node j and node k, w kj is the weight connected node j and k.

Materials
Semen armeniacae amarum is the kernel of Prunus arminiaca L. (family Rosaceae

Spectral Measurements
The HATR-FT-IR spectra were collected at a resolution of 2 cm −1 scans using a Thermo-Electron (Madison, WI, USA) Nexus 670 FT-IR spectrometer with a room temperature deuterated triglycine sulfate (DTGS) detector, and with a single-bounce HATR (Ge) accessory, spectral range 4000-650 cm −1 , resolution 2 cm −1 , the cumulative number of scan 64 times. 8.0 mg of predisposed samples was, respectively, placed directly about 3.14 mm 2 on the center of the Ge crystal of the HATR accessory for measurement. To ensure good contact with the Ge crystal surface, all powder samples were pressed using a pressure tower to provide the same mechanical pressure on all samples. All obtained spectra were autobaseline corrected. No other sample preparation was required. Each species of all samples was measured three times and the averaged spectrum was used for further analysis.

Data Analysis
HATR-FT-IR of all the samples can be obtained by determination. According to the absorbance value characteristic of absorption peak, we can make the principal component analysis to the data, which are obtained by data copy in different wave bands. Then Matlab software is used to make wavelet transformation to analyze the data further. Using Morlet wavelet, which has a good detection capability of the signal singularity, as the analysis wavelet, one-dimensional DWT is done to the FT-IR spectra of samples under different scales. Then, the differences of HATR-FT-IR spectra of the samples in various scales are compared. We choose three representative scales to extract features of samples, then use BP-ANN to identify them. In the experiment, we make one-dimensional DWT to the HATR-FT-IR spectra of the samples (they are decomposed into 5 levels). We choose two scales (3 and 4) as the scales to extract the feature vector. Figure 3 shows the typical HATR-FT-IR spectra of semen armeniacae amarum and semen persicae. From Figure 3, we notice that the semen armeniacae amarum and semen persicae are similar of absorption peaks in the FT-IR spectra because they belong to the sibling plant kernels. They contain similar chemical composition like hydroxy of cellulose (seed coat), starch, and plant hormones βsitosterol, and their FT-IR absorption are quite similar. Two samples generated large numbers of sharp 260 Spectroscopy: An International Journal peaks in the FT-IR spectra region (4,000-650 cm −1 ), which indicates the seeds have a rich chemical composition. Several absorption regions were identified, and the band assignments are labeled in Figure 3. Absorption bands located around 3400 cm −1 correspond to O-H and N-H stretching vibrations that mainly occur from proteins and carbohydrates. The bands around 3010 cm −1 represent unsaturated C-H stretching vibrations that are mainly caused by unsaturated compounds and unsaturated fatty acid ester. The bands around 2923 and 2854 cm −1 represent C-H stretching vibrations that are mainly caused by lipid and carbohydrates. Absorption raised from C-H bending modes was located around 1,200 cm −1 to 1,500 cm −1 , but it overlaped with other absorption bands within this region. Three absorption bands located around 1656 cm −1 (mainly C=O strt.), 1463 cm −1 (N-H bend), and 1,250 cm −1 (C-N stret.) were largely due to amide I, II, and III modes of the proteins and lipids, respectively. Absorption bands around 1745 cm −1 correspond to isolated carbonyl group (COOR), indicating ester-containing compounds commonly found in membrane lipid and cell wall pectin. Bands around 1060 cm −1 , 1100 cm −1 , and 1160 cm −1 in the "fingerprint" region indicate several modes such as C-H bending vibration or C-O or C-C or P-O stretching vibration.

FT-IR Analysis
As semen armeniacae amarum and semen persicae are sibling species, they contain similar chemical composition. The FT-IR spectra from the different plant kernels have very close absorbance and are difficult to be distinguished by experience. So we use other methods for further classification.

Principal Component Analysis (PCA)
Although PCA itself cannot be used as a classification tool, this behavior may indicate the data trend in visualizing dimension spaces. In this paper, 6 samples of each species are randomly selected to make PCA. We have selected 20 absorption peaks in the range of 2000-650 cm −1 , then the absorption peaks which we select are tested by PCA, but the results are not satisfactory. The three-dimensional plot of MIR spectra of the two species based on PCA is shown in Figure 4. Figure 4 shows that the result does not reflect clearly the real relationship of the 12 samples in the relatives, and it is not agreed with our expectations, thus this method is not satisfactory. In order to achieve our desired results, the one-dimension DWT is introduced into our study.

Feature Extraction of FT-IR in DWT Domain
We will use the DWT to detect the singularity of the curvature curve, so we should choose proper wavelet, which has similar shape to the signal to the analyzed, short compact branch set, and big vanishing moment, as wavelet basis function. Some representative wavelet basis functions include Coiflet, Symlets, Daubechies, Molet, Mexihat, and Meyer. In this paper, we choose Daubechies wavelet as analyzing wavelet. 5 scale wavelet comprssions are performed to the MIR spectral ata. Figure 5 represents the DWT coefficients at five scales. The approximation holds the low-frequency components and the detail holds high-frequency components. Even the 5th level approximation looks very similar with the original MIR spectral data, but it is smoother than the original MIR spectra after noise is removed. We choose representative two levels (scale 3 and 4) to extract their characteristics. Characteristic variable is defined as the energy of spectrum at scale 3 and 4 in the DWT.
According to Figure 5, the differences of DWT coefficients between semen armeniacae amarum and semen persicae are obvious in five regions. In order to effectively extract representative  characteristics within two scales of DWT, the spectra in each scale is divided into five representative regions, respectively. Figure 6 is the division diagram of the feature regions. Ten feature regions of two scales in the DWT domain, whose feature values are the spectra energy in the ten feature regions, form the feature vector.

Identified Network and Application of the Results
After testing, we define the structure of BP network as ten nodes in the input layer, 12 nodes in the hidden layer, and two nodes in the output layer, the error is 0.05, α is 0.8, and η is 0.02. For the training process, we use ten input layer nodes of BP network structure, followed by normalized ten feature vectors. The output layer nodes are divided into category (1) semen armeniacae amarum, and category (2) semen persicae. The trained network is used to verify the 200 different sample data. The input data is the eigenvector extracted from the wavelet transformation of the original FT-IR. The results are showed in Table 1.   Based on the results in Table 1, basically, the two different types of kernel (the semen armeniacae amarum and the semen persicae) are correctly identified. The eigenvectors of the different kernel that extracted from the wavelet transformation of the MIR have significant difference so that the high accuracy of the classification can be achieved.

Conclusion
(1) Using the sample kernel as material to MIR analysis is an effective way to classify plants because the kernel as an organ of reproduction contains more stable characters than the vegetative organ.
(2) The technique of HATR has the advantage of measuring directly, does not damage the samples, and has better repeatability than traditional methods such as solvent extraction and direct compression process. So to the plant kernel samples, HATR is a terrific method to determinate.
(3) Discrete wavelet transformation as an important technique of chemometrics can remain the information unchanged, make variables reduced, data processing process simplified, and make a great help to distinguish similar samples.
The proposed method has a high recognition rate to the MIR data of the semen armeniacae amarum and its confusable derivatives semen persicae by combining BP-ANN with the DWT features of FT-IR of samples.