A Feature Selection Algorithm Integrating Maximum Classification Information and Minimum Interaction Feature Dependency Information

Feature selection is the key step in the analysis of high-dimensional small sample data. The core of feature selection is to analyse and quantify the correlation between features and class labels and the redundancy between features. However, most of the existing feature selection algorithms only consider the classification contribution of individual features and ignore the influence of interfeature redundancy and correlation. Therefore, this paper proposes a feature selection algorithm for nonlinear dynamic conditional relevance (NDCRFS) through the study and analysis of the existing feature selection algorithm ideas and method. Firstly, redundancy and relevance between features and between features and class labels are discriminated by mutual information, conditional mutual information, and interactive mutual information. Secondly, the selected features and candidate features are dynamically weighted utilizing information gain factors. Finally, to evaluate the performance of this feature selection algorithm, NDCRFS was validated against 6 other feature selection algorithms on three classifiers, using 12 different data sets, for variability and classification metrics between the different algorithms. The experimental results show that the NDCRFS method can improve the quality of the feature subsets and obtain better classification results.


Introduction
In the era of big data, the number of dimensions of small sample data has increased dramatically, leading to dimensional disasters. In the preprocessing stage, irrelevant and redundant features need to be processed using data dimension reduction techniques. Because there are a lot of irrelevant and redundant features in high-dimensional data, these features not only lead to higher computational complexity but also reduce the accuracy and efficiency of classification methods. Feature selection [1][2][3][4][5] differs from other data dimensionality reduction techniques (e.g., feature extraction) [6] in that feature selection focuses on analysing the relevance and redundancy in highdimensional data, removing as many irrelevant and redundant features as possible and retaining the relevant original physical features.
is approach not only improves the data quality and classification performance but also reduces the training time of the model and makes it more interpretable [7][8][9].
Feature selection methods can be classified into three types: filter methods [10,11], wrapper methods [12], and embedded methods [13]. Due to their high computational efficiency and generality, filter methods are also easily applied to ultra-high-dimensional data sets. In this paper, the filter feature selection method is used. e filter feature selection methods can be classified into rough set [14], statistics-based [15], and information-based [16] according to different metrics. Among these criteria, informationtheoretic-based feature selection algorithms are currently the most popular research direction for filter feature selection algorithms. Usually, feature selection algorithms in information theory are further divided into mutual information metrics [17,18], conditional mutual information metrics [1,19], interactive mutual information metrics [20][21][22], and so on. ese methods then only determine whether the features are redundant and relevant under a single condition, so the optimal feature subset cannot be obtained. At the same time, the main differences between feature extraction in deep learning and feature selection algorithms based on information-theoretic filtering are described in two ways: (1) from a business perspective, feature selection algorithms can analyse features, whereas feature extraction can only perform pattern mapping and not correlation analysis and research; (2) from an efficiency perspective, feature extraction requires higher computational resources and longer training time, whereas feature selection only needs to be performed in a low-performance server.
In a high-dimensional small sample environment, the dynamic search for redundant and correlated features between features becomes a current problem to be solved in response to the diversity and high dimensionality of the data. is paper proposes a feature selection algorithm for nonlinear dynamic conditional relevance (NDCRFS). e innovations and contributions of this paper are as follows: (1) Firstly, the correlation between independent features and class labels is calculated by mutual information. Secondly, the correlation between the candidate features and the selected features under the class label is calculated using the conditional information. Finally, the correlation and redundancy between features are judged by the interaction information. is method solves the problem of how to measure the relevance and redundancy between selected features and candidate features.
(2) e interaction information is normalized by an information gain factor to solve the dynamic balance of interaction information values. (3) Experimental comparison of 12 benchmark data sets in k-nearest neighbour (KNN), decision tree (C4.5), and support vector machine (SVM) classifiers showed that the NDCRFS algorithm outperformed other feature selection algorithms (Mutual Information Maximization (MIM) [23], Interaction Gain-Recursive Feature Elimination (IG-RFE) [24], Interaction Weight Feature Selection (IWFS) [21], Conditional Mutual Information Maximization (CMIM) [25], Dynamic Weighting-based Feature Selection (DWFS) [26], and Conditional Infomax Feature Extraction (CIFE) [23]). e experimental results demonstrate that the proposed NDCRFS algorithm is an effective criterion for classifying feature subsets and can select the feature subsets with good classification performance. e rest of the paper is organised as follows. In Section 2, related work is presented. Section 3 discusses mutual information and conditional mutual information. In Section 4, the development of filtered feature selection algorithms is introduced and summarised and also a discussion is given on how to define independent feature relevance and redundancy, new categorical information relevance, and interaction feature dependency relevance and redundancy. In Section 5, the process and details of the implementation of the NDCRFS algorithm are described in detail. In Section 6, the effectiveness of the NDCRFS algorithm is validated by conducting a comprehensive evaluation of 12 data sets in ASU and UCI, while giving a related discussion. In Section 7, the paper is summarised and the shortcomings and future developments of the NDCRFS algorithm are pointed out.

Mutual Information and Conditional Mutual Information
Let X, Y, and Z be three discrete variables [27], where X � erefore, the mutual information between X and Y is defined as follows: In the above equation, p(x i , y i ) refers to the joint distribution, and p(x i ) and p(y j ) refer to the marginal distribution.
Also, the conditional mutual information of X , Y, and Z is defined as follows:

Related Work
A large number of feature selection algorithms have been proposed for filters, which mainly use forward search to find the optimal subset of features by evaluating the relevance between features and class labels and the redundancy between features using their respective evaluation criteria. Let F be the original set of features and let S be the best feature subset S ⊂ F, J(·) represents the assessment criteria, f k indicates candidate features, and f select indicates a selected feature, Lewis et al. proposed the MIM algorithm, which focuses on selecting the k most relevant features from F using the relevance of the features to the class labels. In the MIM algorithm, it is evaluated by the following criteria: erefore, Lin et al. studied the limitations of the MIM algorithm and proposed CIFE algorithm, in which it is evaluated with the following criteria: In J CIFE , in addition to measuring redundancy I(f k ; f i ) between features, it is proposed to measure redundancy within class labels I(f k ; f i |C) .
Yang et al. [28] proposed the Joint Mutual Information (JMI) algorithm, which is evaluated with the following criteria: where J JIM (f k ) has only one additional weighting factor 1/|S| over J CIFE and |S| represents the optimal number of feature subsets. Fleuret et al. proposed CMIM algorithm according to the maximum-minimum criterion, which is evaluated as follows: Sun et al. considered nonlinear criteria with low computational cost and therefore proposed DWFS, in which the DWFS algorithm is evaluated as follows: where, in the W DWFS (f k ) standard, I(f k ; C|f select ) > I(f k ; C) means relevant and I(f k ; C|f select ) < I(f k ; C) means redundant. Hu et al. [29] proposed the Dynamic Relevance and Joint Mutual Information Maximization (DRJMIM) algorithm based on the DWFS algorithm and the JMIM algorithm, which mainly addresses the definition of feature relevance, that is, how to distinguish between the relevance of candidate features and the relevance of selected features. e evaluation criteria of this algorithm are as follows: Xiao et al. [30] believed that the use of redundancy between features can further improve the accuracy of the classification algorithm. Based on this, the Dynamic Weights Using Redundancy (DWUR) algorithm was proposed. Evaluation criteria of the algorithm are as follows: In the above equation, In summary, the analysis of equations (3) to (9) shows that the existing feature selection algorithms all have some of the following problems: (1) Redundant features and irrelevant features are not completely eliminated. (2) Interdependent features are often removed as redundant features because they are highly correlated with each other. ese algorithms ignore judgements about the relevance and redundancy of interdependent features. (3) e dependency relevance and redundancy of interaction features can be judged by conditional mutual information and mutual information differences. erefore, the study of better feature selection criteria is an urgent problem to be solved.

Evaluation Basis for Feature Selection
Bennasar et al. [31] argued that a feature f is considered useful if it is related to the class label C ; otherwise, feature f is considered useless. is assumption only considers features to be completely independent of each other. In reality, feature f and label C correlations vary with the addition of different features, and it can be concluded that there are interdependencies between features and that feature f and class label C correlations and redundancies change dynamically with each other. In this section, the relevance and redundancy of independent and dependent features will be analysed and discussed. Let

Independent Feature Relevance and Redundancy Analysis.
Mutual information I(f i ; C) is often used to assess the correlation between feature f i and the class label C. e stronger the correlation between feature f i and the class label C is, the closer the I(f i ; C) value is to 1; conversely, the weaker the correlation is, the closer the value is to 0. If I(f i ; C) > I(f j ; C), then the correlation between feature f i and the class label C is stronger than the correlation between feature f j and the class label C. If I(f i ; C) < I(f j ; C), then the correlation between feature f i and the class label C is weaker than the correlation between feature f j and the class label C. e mutual information I(f i ; f j ) is often used to assess the correlation between feature f i and feature f j . If the correlation between f i and f j is high, then the redundancy between features is strong; conversely, the redundancy is weak. When I(f i ; f j ) � 0, the features f i and f j are independent of each other. When I(f i ; f j ) � 1, it means that feature f i and feature f j are redundant, and then it means that feature f i or f j is deleted.

Relevance Analysis of New Classification Information.
If I(f i ; C|f select ) > 0, it means that the candidate feature f i can provide more classification information. If I(f i ; C|f select ) � 0, it means that the candidate feature f i Computational Intelligence and Neuroscience cannot provide any useful classification information and the features f i and f select are independent of each other.
If I(f i ; C|f select ) > I(f j ; C|f select ), it means that feature f i provides more classification information than feature f j .

Relevance and Redundancy of Interaction Feature
Dependencies. According to the literature [6,18,29], if I(f i ; f select |C) > I(f select ; C) relevance of the selected feature f select to the class label C is becoming stronger after the candidate feature f i is added, it indicates that the candidate feature f i can provide more classification information. If , the correlation between the selected feature f select and the class label C is weakening after the candidate feature f i is added, indicating that the candidate feature f i and the selected feature f select are redundant with each other.

NDCRFS Algorithm Description and
Pseudocode Implementation e feature selection algorithm seeks to search for sets of features that are closely related to class labels. To more accurately measure the relevance of features to class labels, the NDCRFS algorithm measures the relevance and redundancy of features in three ways: measuring the interaction correlation and redundancy between f k and f select under the class label C erefore, for the evaluation criteria for the NDCRFS algorithm, the specific formula is as follows: In the above formula, From equation (10), in the NDCRFS algorithm, it firstly selects the minimum redundant features from J NDCRFS (f k ) based on the correlation analysis between the selected features f select and the candidate features f k ; secondly, it selects the most relevant features to the optimal feature subset S by iteration, and its pseudocode is as follows.
From Algorithm 1, line 1 initializes set S and counters k. In lines 2 to 7, the mutual information of each feature in the set F is calculated. In lines 8 to 10, at the same time, the selected optimal feature f k is removed from set F, and feature f k is added to set S. At this time, the candidate feature f k becomes the selected feature f select . In lines 11 to 18, the values of I(f k ; C|f select ), I(f k ; f select |C), and I(f select ; C) are calculated. e NDCRFS algorithm consists of 2 "for" loops and 1 "while" loop. erefore, the time complexity of the NDCRFS algorithm is O(Tmn) (T represents the number of selected features, n represents the number of all features, and m represents the number of all samples, where T ≪ n). e complexity of the NDCRFS algorithm is higher than that of the MIM algorithm, IWFS algorithm, CMIM algorithm, DWFS algorithm, and CIFE algorithm, but the NDCRFS algorithm is lower than the IG-RFE algorithm, mainly because the NDCRFS algorithm also needs to calculate CU(f select , f k ), I(f k ; f select |C) − I(f select ; C), I(f k ; C|f select ).

Introduction to the Data Set.
In order to verify the effectiveness of the NDCRFS algorithm, a total of 12 data sets were used in the experiments. e experimental data sets were selected from the internationally renowned UCI [3] and ASU [14] general data sets, which are described in detail in Table 1. From Table 1, we know that the sample range is from 60 to 7494, the feature range is from 16 to 19 993, and the classification label range is from 2 to 20. e experimental data sets involve biomedical (Lymphography, Dermatology, Lung Cardiotocography, Lymphoma, Nci9, SMK-CAN-187, and Carcinom), face image data (COIL20 and Pixraw10P), and text data (PCMAC and Pendigits).

Experimental Environment Setup.
NDCRFS was compared with six feature selection algorithms, MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE, to verify its effectiveness. e experiments were conducted using KNN, SVM, and C4.5, respectively, on the same feature subsets. e number of feature subsets was set as (K); for example, K � 10 for Lymphography and Pendigits and K � 30 for the rest of the settings. e experimental environment for this paper was an Intel-i7 processor with 8 GB RAM, and the simulation software was Python 2.7. A 5-fold cross-validation method was used in the experiments to obtain the average classification accuracy of the current classifier for that feature selection algorithm's average classification accuracy. In the experiment, the incomplete samples are deleted, and, at the same time, according to Kuarga [32], the class attribute dependence maximization method is used to discretize continuous data.

Comparison of Algorithm Variability.
is paper proposes a method to measure the difference between two selected feature subsets using the Jaccard method. Among them, S 1 ⊂ F, S 2 ⊂ F, S 1 ≠ S 2 .S 1 represents the feature subset selected by the NDCRFS algorithm, and S 2 represents the feature subset selected by other feature selection algorithms. e specific formula (11) is as follows:

Computational Intelligence and Neuroscience
As can be seen in Table 2, the mean values of the difference between NDCRFS and MIM, NDCRFS and IG-RFE, NDCRFS and IWFS, NDCRFS and CMIM, NDCRFS and DWFS, and NDCRFS and CIFE are 0.355, 0.389, 0.261, 0.222, 0.286, and 0.166, respectively, indicating that the difference between features is not considered. When sorting the relationship, the NDCRFS algorithm is significantly different from the other feature selection algorithms. Tables 3 to 5 show the average classification accuracy on the 12 data sets using KNN, C4.5, and SVM. Bold represents the highest accuracy value in the feature selection algorithm for that data set. Tables 3-5 show that the NDCRFS algorithm had the highest average classification accuracy of 88.734%, 81.574%, and 79.213%, respectively. "Wins/Ties/Losses" describes the number of wins/ties/losses between NDCRFS and MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE.
From Table 5, the NDCRFS algorithm is superior to the MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE algorithms in the majority of data sets, with 10, 12, 12, 11, 10, and 11, respectively. In Figure 3(a), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (87.964%, the number of required features is 28), which is 36.966%, 62.936%, 37.517%, 36.419%, 32.191%, and 67.049% higher, respectively. In Figure 3(b), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (85.589% with 20 required features), which is 0.001%, 0.102%, 3.394%, 0.255%, 0.206%, and 5.194% higher, respectively. In Figure 3(c), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (92%, the number of required features is 5), which is 1%, 1%, 1%, 1%, 1%, and 1% higher, respectively. In Figure 3(d), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (68.352%, the number of features required is 24), which is 4.466%, 6.285%, 15.528%, 12.419%, 19.714%, and 27.447% higher, respectively. 6.5. Runtime Analysis of the Algorithm. Calculating the running time of feature selection algorithms is also one of the criteria to measure the importance of feature selection algorithms. Now, the running times of the NDCRFS algorithm, the MIM algorithm, the IG-RFE algorithm, the IWFS algorithm, the CMIM algorithm, the DWFS algorithm, and the CIFE algorithm are compared. In Table 6, these feature selection algorithms are the final runtimes derived from the feature ranking of all features of the 12 data sets. e  e results of the 5-fold cross-validation experiments on the ASU and UCI data sets show that the proposed NDCRFS algorithm is able to select a subset of features with better classification performance, which can further improve the discrimination ability of the data set under data dimensionality compression.

Conclusion
Feature selection is an important tool for the data preprocessing phase in high-level small sample data. e main objective of feature selection is to select the optimal subset of features and should have a high classification accuracy. erefore, in this paper, a nonlinear dynamic conditional correlation feature selection algorithm is proposed. e algorithm first uses mutual information, conditional mutual information, and interactive mutual information to determine and identify the relevance and redundancy of independent features and dependent features. Secondly, the "maxmin" principle is used to eliminate redundant and irrelevant features from the original feature set iteratively. Finally, the effectiveness of this algorithm is verified through experiments, which demonstrate that the NDCRFS algorithm significantly outperforms feature selection algorithms MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE in most of the data sets.
However, the NDCRFS algorithm also has an unsatisfactory selection of feature subsets on some data sets. In the future, it will be necessary to optimize the NDCRFS, while verifying the proposed method in research fields.

Data Availability
e experimental data set selects the world-famous UCI universal data set (https://archive.ics.uci.edu/ml/datasets. html) and the world-famous ASU universal data set http://featureselection.asu.edu/datasets.php).

Conflicts of Interest
e author declares that he has no conflicts of interest.

Authors' Contributions
e author wrote, read, and approved the final manuscript. Computational Intelligence and Neuroscience 9