An Active Feature Selection Strategy for DWT in Artificial Taste

A discrete wavelet transform (DWT) extracts meaningful information in a time-frequency domain and is a favorable feature extraction approach from pulse-like responses in large pulse voltammetry (LAPV) electronic tongues (e-tongue). A regular DWT generates lots of coefficients to describe signal details and approximations at different scales. Thus, coefficient selection is necessary to reduce the feature size. However, the common DWT-based feature selection follows a passive mode: manipulation through human experience or exhaustive trials. It is subjective, time consuming, and barely works in nonlaboratory conditions. In this paper, we present an active feature selection strategy consisting of a dispersion ratio computation and optimal searching search. To evaluate the performance of the proposed method, we prepared several beverage samples and performed experiments with a LAPV e-tongue. Meanwhile, the features of raw response, peak-inflection point, referenced DWT method, and our proposed method were presented to indicate the effects of the refined features of the proposed method. Furthermore, we utilized several classifiers such as the k-nearest neighbor (k-NN), support vector machine (SVM), and random forest (RF) to evaluate the improvement of recognition by the refined features. Compared with other regular feature extraction methods, the proposed method can automatically explore high-quality features with an acceptable feature size. Moreover, the highest average accuracy was achieved by the proposed method for each classifier. It is an alternative feature extraction approach for a LAPV e-tongue without any manipulation in real applications.


Introduction
An artificial taste system named electronic tongue (e-tongue) has become a potential approach for liquid-phase evaluations [1,2].A sensor array and the proper pattern recognition algorithm are the two main parts of an e-tongue.The sensor array imitates a human's taste cells to sense substances, while the pattern recognition algorithm functions as the human brain to handle judgments.Compared with traditional chemical devices, e-tongues have evident advantages including lower cost, lower latency, and simpler operations.Wide applications, such as honey identification [3], rice discrimination [4], and beverage classification [5,6], have been a concern in recent years.In beverage classification, scholars mainly focus their attention on the substances with specific aromatic flavors such as tea and liquor [7,8] since the e-tongue identifications are more objective and reproducible than human judgments [9].
The pattern recognition part composed of classification and feature extraction is an important section in an e-tongue system.Various classifiers attract concerns in e-tongue studies.The principal component analysis (PCA) [10,11], support vector machine (SVM) [12,13], autoregressivemodeling technique [14], linear discriminant analysis [6,15], and back-propagation neural network [13,16] have appeared to discriminate different analytes.On the other hand, feature extraction absorbs meaningful features from sensor responses for classifiers.There are regular feature extraction methods for large pulse voltammetry (LAPV) [17][18][19], a common e-tongue type with pulse-like responses from sensors.Features of e-tongue have two categories: geometric and time-frequency approaches.For the geometric method, the PCA can obtain valuable features according to the correlation of the sensor responses [20].Furthermore, some scholars used peak and inflection values of raw response curves as features to express signal characteristics [10,21].Sometimes, even the original sensor response was used directly as geometric features for classifiers [22].Meanwhile, time-frequency analysis is another popular means for e-tongue feature extraction.The discrete wavelet transform (DWT) is a promising technology due to its capacity for multiscale analysis [23,24].Yin et al. introduced the power ratio to DWT for feature refining and optimization [25].Considering that the e-tongue system utilized in experiments follows the large pulse voltammetry (LAPV) mode and generates a series of multifrequency pulse-like responses, we pay more attention to feature extraction by the DWT approach.Normally, DWT would present a great deal of coefficients (features) along with the increase of the decomposition level, thus a feature selection process is needed to remove the redundant and meaningless DWT coefficients.However, a conventional DWT feature selection depends on human manipulation.The favorable features are achieved according to either human experience or exhaustive trials.Subjective decision and massive computation would hinder e-tongue systems from handling real industry applications.Accordingly, we need to explore an objective and simple strategy to guide DWT feature selection under unmanned conditions.
In this study, we proposed a DWT feature extraction method with an active feature selection strategy (AFS-DWT) to obtain suitable features from DWT coefficients.Both dispersion ratio computation and optimal searching stage are introduced to AFS-DWT.Meanwhile, we performed beverage experiments with a LAPV e-tongue to classify red wine, white spirit, beer, black tea, oolong tea, maofeng tea, and pu'er tea samples.Recognition models such as the k-nearest neighbor (k-NN) [26], support vector machine (SVM) [27], and random forest (RF) [28] were adopted in evaluations to explore the performance of feature extraction methodologies.The experimental results show that AFS-DWT is qualified in unmanned working conditions and shows similar performance to the best one from the reference methods.This paper is composed of five sections: Section 1 introduces the literature and background of this study.Section 2 describes the hardware of our LAPV etongue.The recognition and feature extraction methods used in this study are presented in Section 3. Section 4 shows the experimental results and analysis.Finally, some conclusions are summarized in the last section.

e-Tongue System
As shown in Figure 1, we designed a LAPV e-tongue composed of an electrolytic cell, an electrode sensor array, a LAPV transformation board, a control unit, and a higherlevel computer.Eight metal electrodes fixed on the top of the electrolytic cell are selected to form a three-electrode system.The details of the electrode sensor array are described in Table 1.The LAPV transformation board has two functions: one is generating excitation signals to the auxiliary and reference electrode sensors synchronously and the other one is transforming LAPV responses of the working electrode sensors from microcurrents to proper voltages.The control unit receives the transformed LAPV responses and handles response sampling, data storage, data transferring, and excitation signal control via an onboard 32 bit microcontroller.The sampling circuit consists of a 6-channel analog-todigital converter with a 16 bit resolution to corresponding working electrode sensors.We save the digitalized data in a TF card for further processing on a higher-level computer.Particularly, the original excitation signal is generated from a digital-to-analog converter integrated on the microcontroller; as a result, the frequency, amplitude, and time interval of excitation signals are programmable.
When the LAPV e-tongue is working, we use a ribbon cable to transfer transformed LAPV responses and excitation signals between the LAPV transformation board and control unit.Shielded cables and bayonet-nut connectors link the electrode sensor array and LAPV transformation board to reduce signal interference.Algorithm implementation and evaluation are ordinarily performed on a higher-level computer (desktop or laptop).We also designed external memories on the control unit for the microcontroller to execute lightweight algorithms if necessary.

Materials and Methods
3.1.Principal Component Analysis.PCA is widely used in dimensionality reduction and data visualization [29].It performs orthogonal transformation by converting a group of correlated signals into uncorrelated linear components.
In practice, eigenvalue decomposition is utilized to compute principal components (PCs) and their loadings.Here, we assume that the PCs denote the uncorrelated linear components while the loadings mean the amplitudes of PCs.Thus, it is possible to choose a small amount of PCs to generally express high-dimensional data in low-dimensional space.The PCs with the largest two or three loadings are customarily selected to visualize original data in 2D or 3D space, respectively.

Peak-Inflection Point Method.
Considering the geometric shapes of signals, the LAPV e-tongue responses regularly consist of a series of peaks and troughs.Therefore, it is reasonable to take some representative points from electrode responses as characteristics.A typical one is picking up four points in each pulse cycle: one peak point, one valley point, and two inflection points as response features [19], and we name this feature extraction approach the peak-inflection point method (PIPM).This geometric access for feature extraction is very convenient, and the selected points intuitively contain distinct characteristics of e-tongue responses.

Discrete Wavelet
Transform.DWT handles timefrequency analysis for discrete signals.Its use is considerable in multiscale analysis when working on digital data.This method divides original signals to approximation and the detailed parts through low-frequency and highfrequency filters, respectively.The resolution of approximation and detailed parts depend on the DWT decomposition level while the lengths of approximation and detailed parts are halved according to those of the parent 2 Journal of Sensors coefficients.As shown in Figure 2, the DWT is implemented using Mallat's pyramidal algorithm [30].CA1 and CD1 denote the approximation and detailed DWT coefficients of the first decomposition level.In the second decomposition level, CAA2 and CAD2 are the approximation and detailed DWT coefficients from CA1 while CDA2 and CDD2 are the ones from CD1.
Considering that the pulse-like responses of the LAVP mode have rich frequency components, the DWT method fits such a case due to the ability of varied time-scale analyses.

Relative Power Ratio of Discrete Wavelet Transform.
A relative power ratio (RPR) can be calculated from each DWT coefficient according to a previous study [25]; we call this feature extraction strategy RPR-DWT.Assume that C ij represents the jth DWT coefficient in the ith decomposition level.We can compute the power of C ij as follows: where e ij and r ij denote the power and RPR of C ij respectively.In feature extraction, the maximum RPR of each decomposition level is calculated as a candidate feature and then the best one is selected from among the candidates as features.We should note that RPR-DWT is a refined method for DWT coefficients, the coefficient selection process still relies on a passive mode: human designation based on either experience or exhaustive calculation.

Active Feature Selection Strategy for Discrete Wavelet
Transform.As far as we are concerned, when DWT is executed as a feature extraction method, lots of wavelet coefficients are generated as the decomposition level is increased.Therefore, choosing the necessary features from DWT coefficients becomes crucial for the effectiveness and conciseness of the following classifications.The current strategy to solve this issue depends on manual selection.However, this manner is detrimental for the practicability of DWT.In this part, we introduce an active feature selection (ASF) strategy for DWT coefficients.There are two phases contained in the ASF process.
where r ij p and r ij q denote the RPR of the pth and qth samples in the corresponding categories, respectively.Thus, the dispersion ratio S ij of C ij is defined as follows:  where the smaller dispersion ratio S ij means it is easier to separate the sample space with the DWT coefficient C ij and vice versa.
3.5.2.Phase Two: Optimal Feature Searching.In this stage, we aim to find the optimal DWT coefficients for classification with an automatic search method summarized in Algorithm 1 which ensures that the outcomes of the whole searching process are the most useful features for the following classification.Through the above phases, the original DWT features are refined and selected according to their RPR values.Concise features are finally achieved in an active manner with no human designation.

Results and Discussion
4.1.Sample Preparation.We chose seven kinds of drink such as red wine, white spirit, beer, oolong tea, black tea, maofeng tea, and pu'er tea as test objects.For each kind of tea, we measured 2 g of dried tea leaves with an electronic microbalance, then we soaked the leaves with 200 ml of boiling water for 5 minutes.After that, we filtered out the liquid as the original solution.For liquor objects, such as red wine, white spirit, and beer, we regarded the liquors themselves as the original solutions.In the experiments, we formulated three different concentrations using the original solution and distilled water for each drink.Low, medium, and high concentrations were made up by adjusting the ratio of the original solution at 14%, 25%, and 100%, respectively.The experiments of one concentration were performed three times to increase the reliability of the sampled data.Thus, we collected a total of 63 (7 kinds × 3 concentrations × 3 times) samples in this study.

Electronic Tongue Settings.
As aforementioned, the electrode sensor array of the designed e-tongue consists of six working electrode sensors, one reference sensor, and one auxiliary electrode sensor.We adopted the multifrequency LAPV (MLAPV) method [19] to generate a multifrequency excitation signal on the reference electrode sensor while the auxiliary and working electrode sensors form six electron loops via test solutions during the working process.The excitation signal of a MLAPV includes several frequency segments in one time cycle to stimulate different transient pulse-like responses.Consequently, the fingerprints of substances can be achieved in the form of a series of pulses.As shown in Figure 3, we set three segments at frequencies of around 0.2 Hz, 1 Hz, and 2 Hz.To avoid interference between adjacent segments, we added bank areas between the adjacent segments.In each segment, five pulses are arranged at the voltages of 3.3 V, 3.1 V, 2.9 V, 2.7 V, and 2.5 V according to a reference voltage of 2.3 V (the DC voltage on the reference electrode sensor).In other words, the actual pulse amplitudes are 1 V, 0.8 V, 0.6 V, 0.4 V, and 0.2 V. Thus, a total of 15 pulses are generated on the reference electrode in one signal time cycle.It should be note that the duration time of each pulse remains constant due to keeping the same excitation time for all the working electrode sensors.
Figure 3(b) shows a typical reaction of one working electrode sensor in the experiment.We set the sampling rate at 200 Hz for the sensor array to reduce the distortion according to the highest frequency of the excitation signal.Thus, we sampled a total of 12,300 points for 6 working electrode sensors in one time cycle of 61.5 s and the response size of one sensor was sorted into 12,300/6 = 2050 in one test.

Feature Visualization and Distribution.
In this part, we intend to demonstrate the extracted features by different methods including no feature extraction (NFE), PIPM, RPR-DWT, and the proposed ASF-DWT.Here we adopted NFE to indicate the data distributions before the feature extraction process.For two DWT-based feature extraction methods, we chose Daubechies 1 (Harr) wavelet as the wavelet basis due to its simplicity and wide application in DWT analysis.Considering high dimensionalities of raw responses and extracted features, we implement PCA transformation for the results of feature extraction methods mentioned above to visualize their distributions [31,32].We used red markers to denote the beverages belonging to each liquor type: beer (block), spirit (asterisk), and red wine (pentagram), while the sample distributions of black tea, maofeng tea, pu'er tea, and oolong tea are shown in black color by a triangle, asterisk, block, and circle, respectively.Figures 4-7 show the feature distributions of NFE, PIPM, RPR-DWT, and ASF-DWT, respectively, in PCA spaces.
As shown in Figure 8, we arranged all the responses of the 6 working sensors one by one to form a raw sample for PCA transformation directly.According to 12,300 points, we organized each raw sample as a 12,300-dimensional vector which As for PIPM, we took four points from each pulse.Considering 15 pulses in one time cycle, 15 × 4 = 60 points could be extracted from each sensor response, and a total of 60 × 6 = 360 points for the six sensors had been collected; thus, the feature vector of PIPM has 360 dimensions.From Figures 5(a) and 5(b), we found larger contribution rates for the first three PCs (66.79%, 7.69%, and 4.78%) of PIPM compared with the ones of NFE.It means PIPM features have improved the data quality by reducing the deviation of homogeneous samples.On the other hand, the distributions of different classes are still overlapped; even the red wine samples are no longer separated due to a distribution similar to oolong tea samples.
In terms of RPR-DWT, considering that the 1st coefficient of each decomposition level empirically has the greatest power among the coefficients belonging to the same decomposition level, we specified the selection scope in the 1st coefficients from the 1st to the 6th decomposition levels.For each working electrode sensor, we extracted a RPR value from its response.Thus, the feature size of RPR-DWT is 6 according to six working electrode sensors.From Figures 6(a)-6(f), we could find that the contribution rates (more than 85%) of the first two PCs with RPR-DWT are higher than the ones with the former feature extraction methods.We believe the power calculation process in RPR has decreased the interference caused by tiny disturbances in signals and condensed the useful messages from raw sample vectors.As Figure 6 6(b)-6(f).It seems that the 1st RPR value in decomposition layer 1 is more effective for recognition.In a word, RPR-DWT is a capable feature extraction method for LAPV e-tongues to enhance the quality of extracted features with a little feature size of 6.However, suitable DWT coefficients for RPR features are mainly selected by human experience or a lot of trials.It lacks objectivity and has high costs.
Figures 7(a) and 7(b) demonstrate 2D and 3D PCA plots of ASF-DWT features, respectively.According to Algorithm 1 in Section 3.5, feature exploration starts from the 1st decomposition layer.For the beverage data set, we found that the dispersion ratios of RPRs in the 2nd decomposition layer are not smaller than the ones of their parents in the 1st decomposition layer.Thus, the RPR and dispersion ratio computation can be stopped in the 2nd decomposition layer.At the same time, the 1st and 2nd RPRs in the 1st decomposition layer were chosen as ASF-DWT features for one sensor.A total of 12 feature values were extracted for one test.Considering that the selected features are based on the coefficients in the 1st decomposition layer, it is reasonable that the sample distributions on ASF-DWT features (shown in Figure 7(a)) are similar to the ones of RPR-DWT with the 1st RPR in the 1st decomposition level (shown in Figure 6(a)).Although the feature size of ASF-DWT is a little larger than the one of RPR-DWT, it can work automatically in unmanned situations with limited calculation.
Generally speaking, PIPM, RPR-DWT, and ASF-DWT are effective at improving the poor quality of raw sensor   RF is a powerful ensemble classification method proposed by Breiman in 2001 based on decision trees, and the bagging strategy [28], excellent robustness, and recognition ability of RF [13] make scholars tend to improve and apply this classifier.In the following evaluation, the number of decision trees in a random forest was set to 200, and the final category of the sample is determined by the voting results of the 200 decision trees.
The basic idea of SVM is to divide samples in accordance with the structure and experiential risk minimization.We chose two kinds of kernel functions for the SVM learner: linear kernel and radial basis function (RBF) kernel.The linear kernel divides samples in the original space, while the RBF kernel maps samples into a nonlinear high-dimensional space.We denote SVM with the linear and RBF kernel as SVM1 and SVM2, respectively, in the following sections.In the SVM model, a penalty coefficient C is introduced to adjust the tolerance for error classification.A larger C value represents a smaller tolerance for classification errors.We scanned the C value from 0.1 to 1 with steps 0.1 and found that C = 0 5 is a suitable choice comprehensively.Additionally, we set σ = 1 for the RBF kernel and use a one-versus-one strategy to execute multiclass identification.
k-NN is a typical classifier based on sample density in a local area.It determines the class of an unknown sample according to k labeled samples with the nearest k distances.In the following discussion, we set k = 1 and took Euclidean distance as the measure metric of k-NN.
Considering a total of 63 selected samples, we use a leaveone-out strategy [33] to perform validation.We adopted this strategy to index the effects of feature extraction methods theoretically.With regard to the label balance of training samples, equal sizes of samples were followed for each class in training and validation.

Evaluation Results
. Each feature extraction method should be combined with one classifier mentioned in the last part to compute recognition rates accordingly.Considering that the size of a validation set of a certain beverage is 9, we obtain 9 percentage values for each combination of feature extraction and classifier consequently.
According to the validation results shown in Table 2, the highest average recognition rate can be achieved with ASF-DWT in most cases compared with other feature extraction methods.Regarding SVM1, two DWT-based methods achieved the same average recognition rates and clearly exceeded NFE and PIPM while RPR-DWT obtained the best rate with SVM2.As to the rates with RF and k-NN, the DWT-based methods have performed much better than others.Somewhat differently, the rate of ASF-DWT is a little bit higher than the one of RPR-DWT.It may have actively benefited from the searching features of the ASF strategy with favorable dispersion ratios.Considering that the effects of the two DWT-based approaches are very close, we performed a t-test for the recognition rates of the same classifier and the results were collected in Table 3.We set the significance level of the t-test to 0.05.Result "1" denotes that a significant difference existed between two cases in distribution while "0" means the opposite.In other words, "1" denotes that the comparison result is not equal and depends on the average recognition rates in Table 2 while "0" shows that the recognition performance of the two methodologies is equal.The results of the t-test are all 0 which implies the recognition performance of both RPR-DWT and ASF-DWT are statistically equal.Concerning the details of DWT coefficient selection, we use x : y to denote the yth DWT coefficient in the xth decomposition level.We selected 3 : 1 RPR values by traversing the first RPR of the 1st-10th decomposition layers for RPR-DWT while the 2 : 1, 6 : 1, and 8 : 1 RPR values for ASF-DWT were automatically explored.Thus, the feature sizes are 6 and 18 for RPR-DWT and ASF-DWT, respectively.It is worthwhile, we assume, to exchange automatic operation with a small dimensional addition.
In view of the recognition learner, all the classifiers performed ineffectively on either raw sample vectors or PIPM features except RF on NFE.It is apparent that excellent results can hardly be achieved on the data with a large dispersion of the same class.For two DWT-based feature extraction methods, we discovered that the recognition rates become gradually higher in the order of SVM1, RF, SVM2, and k-NN.It is reasonable that SVM1 provides poor results due to its linear kernel which divides samples linearly while RF and SVM2 obtain much higher recognition rates by implementing nonlinear classification.We notice that k-NN reaches the highest rates of 82.54% and 84.13% with RPR-DWT and ASF-DWT features, respectively, among four feature extraction modes.However, in traditional opinions, the k-NN classifier suffers interference from the local sample distribution while SVM can achieve optimal results globally.We believe this contradiction can be explained by the sample distribution shown in Figures 6 and 7. Take Figure 7 as an example, black tea (black triangle), maofeng tea (black pentagram), and pu'er tea (black square) samples are overlapping in the plot.It is difficult to distinguish them entirely according to certain separating hyperplanes, even in nonlinear spaces.Under this restriction, recognition that depends on its neighbors (k-NN) seems more effective and feasible.

Conclusions
In this study, we propose ASF-DWT to automatically extract features from raw responses of a LAPV e-tongue to deal with the inconvenience and huge computation required by manual judgments.Dispersion ratio calculation and optimal feature search are combined together to obtain favorable features.Furthermore, we used a LAPV e-tongue to collect beverage samples and identify categories.The experimental results show that ASF-DWT is a very helpful feature extraction tool for LAPV e-tongue responses and it outperforms conventional feature extraction methods with acceptable feature size.
Future works should concentrate on the optimal rules of ASF-DWT for various applications and its compatibility with various classification methods.At the same time, the bandwidth of an excitation signal can be further increased to achieve more useful information from transient responses.

Figure 2 :
Figure 2: Structure and relationship of DWT coefficients.
(a) demonstrated, both spirit and red wine samples are clearly located in separate areas with RPR features of the 1st DWT coefficients Input: all dispersion ratios of DWT coefficients Output: selected RPRs of DWT coefficients Procedure: (1) Initialize maximum decomposition level of DWT and set i = 1 (2) Initialize optimal RPRs and set O = r 11 , r 12 (3) While i ≤ maximum decomposition level do For each RPR in ith decomposition level, compare the RPR value with its parent DWT coefficients.If current RPR value < parent's RPR value Save current RPR to O, Delete parent's RPR from O; If no new RPR added in ith decomposition level Stop and go to Step 4; i = i + 1; (4) Output the elements of O as the selected features Algorithm 1: ASF for DWT coefficients.

Table 1 :
Details of an electrode sensor array.

Table 2 :
Average recognition rates with leave-one-out validation.

Table 3 :
t-Test results for the average recognition rates of RPR-DWT and ASF-DWT.