Feature Selection Using Maximum Feature Tree Embedded with Mutual Information and Coefficient of Variation for Bird Sound Classification

The classification of bird sounds is important in ecological monitoring. Although extracting features from multiple perspectives helps to fully describe the target information, it is urgent to deal with the enormous dimension of features and the curse of dimensionality. Thus, feature selection is necessary. This paper proposes a scoring feature method named MICV (Mutual Information and Coefficient of Variation), which uses the coefficient of variation and mutual information to evaluate each feature’s contribution to classification. And then, a method named ERMFT (Eliminating Redundancy Based on Maximum Feature Tree) based on two neighborhoods to eliminate redundancy to optimize features is explored. These two methods are combined as the MICV-ERMFT method to select the optimal features. Experiments are conducted to compare eight different feature selection methods with two sounds datasets of bird and crane. Results show that the MICV-ERMFT method outperforms other feature selection methods in the accuracy of the classification and is less time-consuming.


Introduction
Birds are sensitive to changes in habitats and surroundings, and they are a good indicator of biodiversity and the ecosystem [1]. Because birds generally have a wide range of movement and cannot be observed promptly, bird sounds are one of the important ways to identify them [2].
Bird sounds are a class of environmental sounds. Some famous feature extraction methods used in audio signal processing include Mel-Frequency Cepstral Coefficients (MFCC) [3] in the frequency domain and Short-Time Fourier Transform (STFT) [4] and Wavelet Transform (WT) in the time domain [5]. Furthermore, Tsau et al. [6] suggested a method that extracts features from Code Excited Linear Prediction (CELP) bit streams. Researchers have been extracting features from multiple aspects to retrieve enough information to describe the target. However, the curse of dimensionality occurs as the numbers of the features and samples grow. It also increases the time cost of analyzing data, affects the models' generalization, and reduces the effectiveness of solving problems [7]. To avoid the curse of dimensionality, selecting a subset of features from the feature pool is necessary. e feature selection process in pattern recognition is composed of feature scoring and feature optimization. Feature scoring, the key to feature selection, finds the most distinguishable features in the classification space. Generally, feature scoring methods can be grouped into four classes: similarity-based, information-theory-based, statistics-based, and sparse-learning-based [8]. So far, researchers have proposed many different feature scoring methods [9]. For example, in unsupervised feature selection, Nonnegative Laplacian is used to estimate the feature contribution [10]. Constraint Score is applied in feature scoring in environmental sound classification [11]. e ReliefF-based feature selection algorithm is employed to select features in automatic bird species identification [12]. PCA is used as a feature reduction technology to realize bird sounds' automatic recognition [13].
Meanwhile, feature optimization, the second phase of feature selection, selects a subset of features, characterized by low redundancy and high contribution to the classification, from the feature sequence ordered by scores. Filter, Wrapper, and Embedded are three types of methods used to select a subset of features, and many studies have proposed various feature optimization algorithms based on these methods. Binary Dragonfly Optimization Algorithm, PSO (Particle Swarm Optimization), and Artificial Bee Colony are some examples. Specifically, S-shaped and V-shaped transfer function can be used to map continuous search space to discrete search space [14]. Mutual information can be combined with PSO to eliminate redundant features [15]. In some research, the gradient enhanced Decision Tree [16] is used to evaluate feature contribution, and Artificial Bee Colony is applied to optimize the features [17]. Pearson correlation coefficient is a common evaluation metric used in literature, which evaluates the correlation between features, and is followed by Artificial Ant Colony to select highquality features [18].
Most feature scoring methods, such as Constraint Score and Laplacian, are based on the correlation and differences among spatial distances between features. Although these algorithms have low time complexity, the diversity of the features is neglected. Specifically, units of the features are usually different. Some algorithms calculate the mutual information between the feature sample and the label from a probabilistic and statistical perspective [15]. However, the label is generally a discrete variable, while features are continuous variables. In recent years, many studies regard feature selection as an optimization process and combine feature selection with intelligent searching methods [9,[19][20][21][22]. e multiobjective optimization problem of a large dataset has a high time and space complexity. A reduction in the features' dimensions usually decreases in the classification model's sensitivity and generalization.
Regarding the issues mentioned above, from an information theory perspective, this paper proposes a feature scoring method MICV (Mutual Information and Coefficient of Variation). MICV utilizes the characteristics of mutual information and coefficient of variation and aims to minimize intraclass distance and maximize interclass distance. A feature optimization method, ERMFT (Eliminating Redundancy Based on Maximum Feature Tree), is suggested based on a minimum spanning tree concept. Experiment results show that the MICV-ERMFT method can effectively reduce the data dimension and improve the classification model's performance. Compared with eight feature evaluation methods, the MICV-ERMFT method has significant improvement in the performance on the same dataset in this paper.

Materials and Methods
In bird sounds' recognition, there exists a variety of methods to extract features and classify the sounds. For example, Human Factor Cepstral Coefficients are used to extract bird sound features, and classification and recognition are performed by the maximum likelihood method [23]. Zottesso et al. [24] suggest a method that extracts bird song features based on the spectrogram and texture descriptors and uses the dissimilarity framework for classification and recognition. In this paper, the classification process of bird sounds is divided into three stages: feature extraction, feature selection, and classification recognition. Feature selection is selected as the research focus. e proposed classification process of bird sounds based on MICV-ERMFT is shown in Figure 1: Stage 1. Preprocess the bird sounds' audio data (remove noises and converse the channel), and use MFCC and CELP to extract features from the preprocessed data and construct dataset D M&C (dataset formed by the merger of MFCC and CELP features). Stage 2. Apply the MICV method on D M&C , evaluate the contribution, and score each feature. Sort the feature sequence in ascending order, denote as F, and calculate the Pearson correlation coefficient for the features and build a maximum feature tree T. en, apply the ERMFT method to eliminate redundant features and construct a new dataset D M&C′ . Stage 3. Build a classification model on D M&C′ and analyze the classification results.

Feature Extraction.
Birds make sounds in the same way as humans do [25,26]. e frequency of human language used for daily communication ranges from 180 Hz to 6 kHz, and the most used frequency range for bird calls is from 0.5 to 6 kHz [25,27]. Under this assumption, we process the features of bird sounds in a way similar to processing the human language. MFCC (Mel-Frequency Cepstral Coefficient) and CELP (Code Excited Linear Prediction) are applied to the raw bird sounds data to extract features in this paper. [3] is a human-hearing-based, nonlinear feature extraction method. e process is shown in Figure 2.

MFCC. MFCC
Step 1. A single-frame, short-term signal x w (i, n) is obtained by separating frames and adding a window function to the original audio signal x(n). Adding a window function reduces the frequency spectrum leakage. is paper selects the 20 s as a frame and uses the Hamming window.
Step 2. To observe the distribution of x w (i, n) in frequency domain, FFT (fast Fourier transform) is used to transform the signal from the time domain to frequency domain, named X(i, k): (1) Step 3. Calculate the energy of the spectral line per frame: Step 4. Calculate the energy of E(i, k) through the Mel filter: where i is the i-th frame, k is the k-th spectral line in the spectrum, and H m (k) is the analysis window with a sample length of k; Step 5. Take the logarithm of the energy of the Mel filter and calculate the DCT (Discrete Cosine Transform): where m is the m-th Mel filter, i is the i-th frame, and n is the spectral line after the DCT. In this paper, MFCC uses 13-dimensional static coefficients (1-dimensional log energy coefficient and 12-dimensional DCT coefficients) as extraction parameters [3,28]. e resulting sample has 13 features.

CELP.
e CELP feature extraction method is derived from LPC (Linear Predictive Coding) based on a compression coding tech G.723.1. e LPC is extracted from the 0th to 23rd bits from the bit coding in each frame, forming the 10-dimensional LPC. Another 2-dimensional feature, the lag of pitch, is extracted from the 24th to the 42nd bit stream in each frame. e extraction of CELP is shown in Figure 3.
Endpoint detection is performed after the original audio file is preprocessed. en each audio is divided into several sound segments. Each sound segment is considered as a sample in the experiment. For each frame, features are extracted using MFCC (13 dimensions) and CELP (12 dimensions). e sampling rate is 16 kHz; audio is a single channel. Each sample contains several frames. For each  Calculate DCT Mathematical Problems in Engineering detection segment (including many frames), the mean, median, and variance of each feature are calculated to obtain 75-dimensional data. e feature extraction process is shown in Figure 4.

Feature Scoring Method MICV.
Based on the principle of small distance within classes and large distance between classes, features that are easy to distinguish are selected. To calculate the degree of feature differentiation, mutual information MIEC (Mutual Information for Interclass) is used to measure the interclass distance, and the coefficient of variation CVAC (Coefficient of Variation for Intraclass) is used to measure the intraclass distance. e MIEC and CVAC methods are combined to calculate the classification contribution degree of features. e calculation equation is Because intraclass distance and interclass distance have different weights, the coefficient λ(0 < λ < 1) is introduced to adjust the weights.

MIEC.
Mutual information measures the correlation or the dependency between two variables. For two discrete random variables X and Y, mutual information I(X; Y) is calculated as In equation (6), p(x, y) is the joint probability density function of x and y, p(x) and p(y) are the marginal probability density functions of x and y.
Generally, when mutual information is used to select features, variables X and Y represent the feature vector and label vector. In this paper, X and Y represent two vectors of different classes under the same feature. Given feature space F and classification space C, the interclass mutual information of f-th feature, miec f , is calculated as In equation (7), i and j(i ≠ j) are the samples of f-th feature in i-th class and j-th class. miec f is the interclass mutual information of f-th feature in F.
e interclass difference feature f is greater when the miec f is smaller, and vice versa.

CVAC.
In statistics, the variation (CV) coefficient measures the variation between two or more samples or the dispersion between them. e expression is where μ and σ are the mean and standard deviation of the samples. Given feature space F and classification space C, the intraclass coefficient of variation of feature f, cvac f is calculated as In equation (9), Cv i represents the CV of samples in class i. e feature f has a higher cohesion when cvac f is smaller.

Feature Selection Method MICV-ERMFT.
After scoring the features using the MICV method, high-quality features are selected. MICV-ERMFT is used to eliminate redundant features in the feature array sorted by scores. e process is shown in Algorithm 1.

Build Maximum Feature Tree.
e maximum feature tree is derived from the minimum spanning tree. For an undirected graph G(V, E), each edge has a weight w, a minimum spanning tree is a subset of edges E' that connect all the vertices V with no cycle, and the total weight of edges in E' is minimum. In a maximum feature tree, features are represented as vertices and weights of the edges are decided by Pearson correlation coefficient. P (F r ,F c ) represents the correlation coefficient between features F r and F c , which is calculated as In equation (10), F ri represents the i-th sample of feature r; F r is the feature r's mean value of all samples. In equation (11), I (F r ,F c ) is the correlation coefficient between features r and c. Algorithm BMFT (building the max feature tree) uses equations (10) and (11) to calculate the correlation coefficient matrix and construct the maximum feature tree. Details are described in Algorithm 2.

Remove Redundant Features Based on Two
Neighborhoods. ERFTN (Eliminate Redundant Features based on Two Neighborhoods) is based on eliminating redundancy using the concept of two neighborhoods. One example with a maximum feature tree T and feature sequence F sorted using the MICV method is demonstrated in Figure 5: As shown in Figure 5, given max feature T, 10 sorted with MICV method in ascending order, the steps of the ERFTN algorithm are listed as Algorithm 3. e final feature subset

Experiments and Results Analysis
3.1. Experimental Dataset. Currently, there are many websites dedicated to sharing bird sounds from around the world, such as Avibase [29] and Xeno-Canto [30]. Recordings of bird sounds are collected and annotated on these websites. e tapes include various types of voice expressions (multiple calls and songs) of various individuals recorded in their natural environment. e dataset used for this paper comes from the Avibase, which is a collection of MP3 or WAV audio files. ese audio files are unified into the 16 kHz sampling rate and monochannel. Since the audio files are not all bird sounds, the bird sounds in the audio are separated through the voice activity detection (VAD) [25,31], and then the MFCC and CELP features are extracted according to the process shown in Figure 4. e experiments used two datasets including bird sounds and crane sounds. We have selected six different bird species from different genera in bird sounds, which contains 433 samples. e crane sound dataset includes 343 samples from seven species of Grus. e dataset information is shown in Tables 1 and 2. e feature scoring method is compared with ConstraintScore (CS) [11] and six other feature scoring methods provided by Weka [32] including Correlation (Cor), GainRatio (GR), InfoGain (IG), One-R (OR), ReliefF (RF), and SymmetricalUncert (SU) in experiments.

Classifier Performance Evaluation.
Kappa, F 1 score, and accuracy rate were used as evaluation indicators.
(1) Kappa. Cohen's Kappa coefficient is a statistical measure that indicates the interrater reliability (and also intrarater reliability) for qualitative (categorical) items: where p o is the overall classification accuracy, which is calculated by the number of correctly classified samples divided by the total number of samples. Based on the confusion matrix, assume the numbers of real samples in each class are a 1 , a 2 , . . . , a n , the numbers of predicted samples are b 1 , b 2 , . . . , b n , and p e is calculated as p e � a 1 × b 1 + a 2 × b 2 + · · · + a n × b n n * n .
(2) F 1 Score. It is an index used to measure the accuracy of classification models in statistics, while taking into account the accuracy and recall of classification models. As shown in equation (14), precision represents the precision rate and recall represents the recall rate.  Each detection fragment contains the number of n frames, and the mean, variance, and median of n samples are calculated for each feature Figure 4: Extraction process of bird sounds feature.

Mathematical Problems in Engineering
(3) Accuracy. e accuracy is calculated based on the equation: In equation (15), n represents the correct number of classifications, and M represents the number of all samples.
Each dataset is divided into 70% training set and 30% test set. Each experiment is repeated 10 times to average some biased results.  Table 3. A lower ratio indicates a better performance. Table 3 shows that better results can be obtained when λ is set at 0.1 or 0.3 or 0.2. In the following experiments in this paper, λ is set to 0.1.

Compare MIEC, CVAC, and MICV.
e selected feature set has a decisive effect on the classification model. Features with higher scores normally lead to more positive classification performance. e experiments sort the feature sequence in ascending order according to feature scores obtained from MIEC, CVAV, and MICV, respectively. In Figure 6, in most cases, the red curves are more stable to ascend, which shows that, with the increase of features gradually, the classification model's performance will be improved, especially in Figure 6 To sum up, combining MIEC and CVAC works better than using them alone. Step: (1) Calculate MICV using equation (5) for each feature in D.
(2) Sort the MICV feature sequence in ascending order to obtain F. According to F, select data in D and gradually add one, and use the base classifier to score. Delete the feature that led to the decline of the index, and obtain the feature sequence F * , and map F * to D to get dataset D * . (3) Calculate Pearson correlation coefficient matrix P for the feature vector by D * . (4) Apply algorithm BMFT (Algorithm 2) to construct a maximum feature tree T for P. Name: BMFT (building max feature tree). Input: Correlation coefficient matrix P n×n #n is the number of features. Step: (1) Initialize root T � 1 { }.

Experiment of MICV Results and Analysis.
In this section, the proposed MICV is tested on the Birds dataset and the Crane dataset.
e results of the experiment in Figures 7 and 8, show that, at the same number of selected features, the Kappa value of the MICV method is basically higher than that of other methods. As the number of features increases, the Kappa value of the MICV method can converge earlier and remains relatively stable compared with other methods. MICV is more effective compared with the results of other feature evaluation methods.
Tables 4 and 5 record the best classification results (Kappa, accuracy, and F 1 scores) for each feature scoring sequence, as well as the number of features used to obtain this value. e bold one on the left side of "|" in each row in the table indicates that the method has the least number of features than other methods, and the bold on the right indicates that the method has the highest evaluation indicator score. Table 4 shows that, in bird dataset, MICV methods had the highest Kappa value under four different classifiers. In J48, NB, and RFs classifiers, MICV methods had the lowest number of features and the highest score of evaluation indicators in most cases. As shown in Table 5, the performances of MICV in J48, NB, and RFs classifiers are significant.
In summary, the MICV method is more effective in selecting optimal features than the other seven methods. e method can also get a good modeling effect by using a lower dimension.

Experiment of MICV-ERMFT Feature Selection.
In the second part of the experiment, features are evaluated using CS Name: ERFTN (Eliminate Redundant Features based on Two Neighborhoods). Input: T: Max feature Tree by Algorithm BMFT, F: Features sorted with MICV method. Step: (1) Get the first element x in F.
(2) V � y|y ∈ T, y is the adjacent vertices of x }.
(3) Update F by deleting all vertices in V, that is F � F\V. (4) Choose the next unvisited element as x.

Procedure of Experiment.
e procedure is demonstrated in Figure 9. Eight different methods (MICV and the seven other methods mentioned above) are used to evaluate each feature's classification contribution and score the features. After sorting the features in an ascending order based on the scores, the ERMFT method is then used to eliminate redundant features, resulting in a feature subset F′. F′ is then mapped to Dataset, resulting in Dataset'. J48, SVM, BayesNet (NB), and Random Forests (RFs) are the experiment's classifiers. For each independent dataset, it is divided into 70% training set and 30% test set. Each experiment is repeated ten times, and the average Kappa is calculated. Also, the DRR (Dimensionality Reduction Rate) as an evaluation indicator is introduced.
In equation (16), F n ′ is the number of selected features and F n is the number of all features of each dataset. e larger the DRR value, the stronger the ability to reduce dimensions. In Figure 10(a), it can be clearly observed that the MICV-ERMFT method has a slightly higher Kappa than other methods and the J48 classifier in Figure 10 In conclusion, compared with the other seven methods, the MICV-ERMFT method demonstrates good abilities in dimensionality reduction and feature interpretation.

Experiment of MICV-ERMFT Results and Analysis.
Combining Figures 8(b) and 8(d) with Table 6, it is obvious that the MICV-ERMFT method has a significant dimensionality reduction effect and model performance effect for the Birds dataset and the Crane dataset. In Table 6, Kappa value and DRR performance are very good for J48, NB, and SVM classmates on Birds dataset. Particularly for the NB classifier, the other seven comparison methods' Kappa value does not exceed ORI, while the MICV-ERMFT method exceeds 0.4. In the Crane dataset, the MICV-ERMFT outperforms other methods. Table 7 shows the      e MICV-ERMFT method remains excellent for the most part and is more stable than the other methods, although other methods surpass the MICV-ERMFT method in some classifiers. Besides, the MICV-ERMFT method improves the Kappa value compared to the original data. Although the improvement is minimal in some cases, the MICV-ERMFT method only uses about half of the characteristic features compared to the original data.
In conclusion, MICV-ERMFT has better performance in dimensionality reduction and model performance improvement.

Conclusion
Feature selection is an important preprocessing step in data mining and classification. In recent years, researchers have focused on feature contribution evaluation and redundancy reduction, and different optimization algorithms have been proposed to address this problem. In this paper, we measure the contribution of features to the classification from the perspective of probability. Combined with the maximum feature tree to remove the redundancy, the MICV-ERMFT method is proposed to select the optimal features and applied in the automatic recognition of bird sounds.
To verify the MICV-ERMFT method's effectiveness in automatic bird sounds recognition, two datasets are used in the experiments: data of different genera (Birds dataset) and data of the same genera (Crane dataset). e results of experiments show that the Kappa indicator of the Birds dataset reaches 0.93, and the dimension reduction rate reaches 57%. e Kappa value of the Crane dataset is 0.88, the dimension reduction rate reached 53%, and good results were obtained.
is study shows that the proposed MICV-ERMFT feature selection method is effective. e bird audio selected in this paper is noise filtered, and further research should test this method's performance using a denoising method. We will continue to explore the performance of MICV-ERMFT in the dataset with a larger number of features and instances.

Data Availability
All the data included in this study are available upon request by contact with the corresponding author.
Disclosure e funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Conflicts of Interest
e authors declare no conflicts of interest.