A Constrained Feature Selection Approach Based on Feature Clustering and Hypothesis Margin Maximization

In this paper, we propose a semisupervised feature selection approach that is based on feature clustering and hypothesis margin maximization. +e aim is to improve the classification accuracy by choosing the right feature subset and to allow building more interpretable models. Our approach handles the two core aspects of feature selection, i.e., relevance and redundancy, and is divided into three steps. First, the similarity weights between features are represented by a sparse graph where each feature can be reconstructed from the sparse linear combination of the others. Second, features are then hierarchically clustered identifying groups of the most similar ones. Finally, a semisupervised margin-based objective function is optimized to select the most data discriminative feature from within each cluster, hence maximizing relevance while minimizing redundancy among features. Eventually, we empirically validate our proposed approach on multiple well-known UCI benchmark datasets in terms of classification accuracy and representation entropy, where it proved to outperform four other semisupervised and unsupervised methods and competed with two widely used supervised ones.


Introduction
In many machine learning and pattern recognition applications, the data used is provided with a very large feature space describing it [1]. is space can be composed of the following four groups of features [2,3]: (a) completely irrelevant, (b) redundant and weakly relevant, (c) nonredundant but weakly relevant, and (d) strongly relevant features. e first two groups can significantly degrade the performance of learning algorithms (classification, regression, and clustering), decrease their computational efficiency, increase their likelihood of overfitting, and undermine their generalization capability [4][5][6].
us, a good dimensionality reduction technique is usually applied with the aim of identifying features from groups (c) and (d).
Feature selection and feature extraction are two wellknown dimensionality reduction tools [2]. In this work, we were particularly interested in feature selection due to its straightforwardness [7]. Unlike feature extraction, it simply obtains a subset of the original features that can best describe a dataset according to some objective function. is is done without applying any changes or transformation to the original feature space [8][9][10]. Feature selection can be categorized into different groups according to the availability of supervision information. It is unsupervised when label information is not available [5,[11][12][13], supervised when data is fully labeled [14][15][16][17], and semisupervised when few labeled data points are labeled and many are unlabeled [18][19][20][21].
In fact, many real-world applications pose the latter case where both supervised and unsupervised feature selection algorithms cannot fully take advantage of all data points in this scenario. us, we were interested in semisupervised methods to feat both labeled and unlabeled points. In fact, compared to class labels, pairwise constraints are another type of supervision information that can be acquired more easily [22]. ese constraints simply specify whether a pair of data points belongs to the same class (must-link constraint) or to different classes (cannot-link constraint) without specifying the classes themselves [23,24]. Some constraint scores use these two notions to rank features; however, they neglect the information provided by unconstrained and unlabeled data [25].
On the other hand, although existing work in the domain of feature selection has resulted in many powerful techniques, most methods unintentionally ignore important information regarding the level of feature correlation during their selection process [26]. Typical examples are iterative methods by which a single individual feature is included/ excluded once at a time from a candidate subset of features.
is loss of important information might lead to having redundant features in the final selected subset. To be precise, multiple research works applied feature clustering with the aim of selecting a nonredundant and relevant feature subset. For that, they tended to build the similarity matrices between features using information theoretic measures such as mutual information [27], conditional mutual information [15], and maximal information coefficient [28]. It was also considered easy to transfer traditional data clustering methods to work for feature clustering; however, the challenge was in finding a suitable and meaningful definition of similarity notion between features [29]. erefore, in this paper, we propose a semisupervised feature selection method that combines feature clustering with hypothesis margin maximization to obtain a nonredundant feature subset where features are ranked in the order of their relevance to a target concept. is method consists of (1) a constrained margin-based feature selection algorithm (Relief-Sc) that utilizes pairwise cannot-link constraints and benefits from both the local unlabeled neighborhood of the data points as well as the provided constraints and (2) a feature clustering method that combines sparse graph representation of the feature space with margin maximization.
We first benefit from the characteristics of nonparametrized sparse representation where it is possible to reconstruct each feature by the sparse linear combination of others [30]. is is done through solving an L 1 -minimization problem. In fact, we treat these sparse coefficients as similarity weights to build our feature similarity matrix on top of which we apply clustering. For instance, we adopt an agglomerative single-link hierarchical clustering method. Upon obtaining the clustering solution, the features within each cluster (or group) play important roles in reconstructing each other and are thus assumed to be redundant in our scenario. us, we select a representative feature from each cluster [28,31,32]. For that, we utilize Relief-Sc to find the feature that best maximizes a pairwise constraint-relevance margin-based objective function. is maximization is quantized by assigning bigger weights to features that best contribute to enlarging a semisupervised distance metric called constrained hypothesis margin. In fact, as cannot-link constraints are considered more important than must-link constraints from the margin's point of view [33], our constrained hypothesis margin particularly utilizes them.
Finally, the core contribution of our work lies in the overall approach called FCRSC (Feature Clustering Relief-Sc) that aims at maximizing relevance while minimizing redundancy, and to the best of our knowledge, no work has previously done that by combining feature clustering upon sparse representation with the constrained hypothesis margin. To be precise, there exists another work [23] which is also based on the constrained margin concept but deals differently with redundancy, which is included in our experimental comparisons. e rest of this paper is organized as follows. In Section 2, we present the proposed approach and detail its main building blocks. In Section 3, we present the experimental results to validate the proposed approach. Experiments are achieved on multiple well-known UCI machine learning datasets. Finally, the discussion and conclusion are provided in Section 4.

Feature Selection by Hierarchical Clustering and Hypothesis Margin Maximization
In this section, we detail each step of the proposed approach. For instance, in Section 2.1, we explain how to represent the relationships between features to be used in clustering. In Section 2.2, we briefly explain hierarchical clustering in our context, and in Section 2.3, we give a detailed explanation of hypothesis margin and its concepts. Finally, in Section 2.4, we present the overall semisupervised feature selection approach that combines both feature clustering and hypothesis margin maximization.

Feature Space Sparse Graph Construction.
Sparse graph representation has received a great deal of attention in recent years [1,34,35]; this is due to its ability to find the most compact representation of the original data and to preserve its underlying discriminative information [36]. In fact, the sparse representation model generally aims at representing a data point using as few as possible other data points within the same dataset (overcomplete dictionary). Conventionally, some recent work utilized sparse theory to build the similarity matrix between data points (instances) by assuming that each point can be reconstructed by the sparse linear combination of other points [30,37,38]. On the contrary, in this paper, the similarity graph adjacency structure and the corresponding graph weights are built simultaneously among features instead of data points [39,40]. While computing the sparse linear coefficients by solving an L 1 -norm regularized least squares loss problem, the most similar features as well as their estimated similarity weights to the reconstructed feature are identified. Hence, we obtain the feature-wise sparse similarity matrix that will be used in grouping features. It is important to note that the main advantages of using the L 1 -graph are the following: (i) It can lead to a sparse representation which can enhance the efficiency and the robustness to noise in learning algorithms [36] (ii) While many clustering algorithms [41,42] are very sensitive to some parameters when building their similarity graphs (like the performance of traditional spectral clustering that is heavily related to the choice of sigma in Heat Kernel), our graph construction is parameter-free 2 Computational Intelligence and Neuroscience (iii) It obtains both the graph adjacency structure and the corresponding similarity weights by L 1 -optimization, while L 2 -graphs usually separate them into two steps To mathematically formalize the problem, we consider a data matrix X � [x 1 , . . . , x n , . . . , x N ] T ∈ R N×F including all the features in its columns. To be clear, we also consider X from the features point of view; thus, it can be expressed as X � [A 1 , . . . , A i , . . . , A F ] ∈ R N×F . As our aim is to find the reconstruction sparse linear coefficients of each feature, we use the second representation of X in this section. erefore, to reconstruct each feature (attribute) A i using as few entries of X as possible, we solve an L 0 -norm optimization problem as follows: where ‖.‖ 0 denotes the L 0 -norm, which is equal to the number of nonzero components in s i , and Note that the solution of (1) is NP-hard. us, a sparse vector s i can be approximately estimated by the following L 1 -minimization problem [38]: where ‖.‖ 1 denotes the L 1 -norm and 1 ∈ R F is a vector of all ones values. In fact, due to the presence of noise, the constraint A i � Xs i in (2) does not always hold. us, in [1], Liu and Zhang mentioned a modified robust extension (invariant to translation and rotation) to mitigate this problem. It can be defined as follows: where ξ represents a given error tolerance. e sparse vector s i is computed for each feature A i . e optimal solution of (3) for each sample A i is a sparse vector s i ; this vector allows building the sparse reconstructive similarity matrix S � (s i,j ) F×F , defined by e L 1 -minimization problem can be solved in polynomial time by standard linear programming methods [30] using publicly available packages such as SLEP package [43]. As the vector s i is sparse (a lot of its components can be zero and few have nonzero values), the features in the dataset which are far from each other will have very small (zero or near zero) coefficients.
is solution can reflect the intrinsic geometric properties of feature space. Algorithm 1 summarizes the graph construction.

Agglomerative Hierarchical Feature
Clustering. As we mentioned before, a good feature selection algorithm is expected to find features that are most relevant in terms of discriminating data points between different classes while being least correlated to each other. e latter is similar to the general assumption of clustering where the data is partitioned such that points within the same cluster are as similar as possible to each other and as different as possible from points in other clusters [27]. Focusing on the idea of finding the least redundant (most diverse) features brought up clustering the features themselves instead of clustering data points [15,44]. For instance, this minimized-redundancy among features can be obtained by dividing them into different groups according to a similarity criterion followed by choosing one or more features to represent each group.Among the four primary clustering categories that are hierarchical, density-based, statistical, and centroid-based [45], we were interested in hierarchical clustering. It is useful when the structure of the dataset can hold nested clusters and does not require a predefined number of clusters, since hierarchical clustering algorithm outputs a tree diagram called dendrogram. e dendrogram that records the sequence of mergers of clusters (features) into larger clusters presents a multilevel grouping of these features [31].
Hence, depending on the cutoff level of the dendrogram, the number of obtained clusters can vary between 1 and F. Intuitively, at lower levels of the dendrogram, we have the clusters of most redundant features that are first to be grouped.
us, cutting at low levels results in a higher number of clusters and therefore more cluster representatives. is yields a bigger output feature subset. However, cutting the dendrogram at higher levels results in a smaller number of clusters and thus fewer cluster representatives, i.e., a smaller output feature subset. Hence, although choosing a high level of clustering ensures eliminating more redundancy, it could still cause more information loss.
As a result, a good quality of clustering is closely related to the problem-adequate choice of the cutoff level. erefore, we decided cutoff when merging distances become large enough to create a second-level hierarchy, which means when clusters of features start being merged together instead of merging individual features. is is explained by our goal of reducing redundancy among features to a certain level without excessive compression that might lead to some information loss.
In summary, hierarchical clustering can have multiple methods for computing the distance between clusters (ward, complete, median, centroid, single, and others); however, as we are working with feature clustering and as we wish to merge the clusters based on the most similar features (not their average or the farthest two features within a cluster, i.e., complete linkage), the agglomerative single-linkage hierarchical clustering was applied in this paper. It takes as input the F × F feature-wise similarity matrix denoted by S obtained from Section 2.1. is algorithm initially assigns each feature to its own cluster and then finds the largest element s ij in S; after that, the corresponding two most similar clusters (or features) are merged based on s ij . After each merging step, the similarity matrix is updated by replacing the two grouped clusters (or features) by the newly formed cluster in S. is update can be expressed as s e,ij � max s s ei , s ej , where s e,ij represents the similarity between the new obtained cluster (by merging two clusters C i and C j or features A i and A j ) and any other existing cluster C e (or existing feature A e ). s ei and s ej are the respective similarities between cluster (or feature) pairs.

Hypothesis Margin in the Semisupervised Constrained
Context. In this section, we present a concise summary of our margin-based feature selection algorithm called Relief-Sc [33]. e power of Relief-Sc lies in its ability to solve a simple convex problem in a closed form (obtaining a unique solution) through utilizing a highly nonlinear classifier to evaluate its margin-based objective function. e latter unique solution is a set of features ranked in the order of their relevance with respect to a specified problem (e.g., the classification of a newly arriving data point). It tends to rank features in terms of their ability to maximize this hypothesis margin-based objective function using a given set of cannotlink constraints. As a reminder, these constraints are a cheaper kind of supervision information specifying only that two data points should belong to different groups without specifying the groups themselves [23,46]. However, we noticed that a drawback of Relief-Sc and its basic supervised precursor Relief algorithm [47] is that they lack the ability of dealing with redundancy among features. Nevertheless, it is well known that eliminating redundant features is a very important aspect of feature selection.
By definition, the hypothesis margin is the largest distance a data point can travel in its feature space without affecting the labeling structure of the dataset and thus without altering the label prediction of a new arriving point. erefore, having a large margin provides high classifierconfidence when it is undergoing prediction. e hypothesis margin is closely built up on two important notions that were initially suggested in a supervised context [47,48]. ese notions, namely, the near-hit and the near-miss of a data point, were defined as its nearest point within the same class and its nearest point from a different class, respectively. us, consider X � [x 1 , . . . , x n , . . . , x N ] T ∈ R N×F where N is the number of data points, F is the number of features, and x n � (x n1 , . . . , x ni , . . . , x nF ) T is the nth data point characterized by F features.
Original definition: In a supervised context, the near-hit of a data point x n is its nearest point within the same class denoted by H(x n ) and the near-miss of a data point x n is its nearest point from a different class denoted by M(x n ).
However, in this paper, we work in a semisupervised context where the only supervision information available is in the form of pairwise cannot-link constraints. Hence, a modification of the near-hit and near-miss notions is to be applied. Definition 1. In our considered semisupervised context, let (x n , x m ) be one cannot-link constraint in the set of constraints C � (x n , x m ) . en, the nearest point to x m or its near-hit, denoted by H(x m ), will now represent the nearmiss of x n . On the other hand, H(x n ) represents the near-hit of x n . e number of constraints |C| is a user predefined value. An example is illustrated in Figure 1.
As a side note, we would like to explicitly mention how Relief-Sc is considered semisupervised compared to other constrained scores that utilize only the constraint itself in ranking features [25]. In fact, Relief-Sc does not only depend on the cannot-link constraint C � (x n , x m ) but also on its unlabeled local neighborhood as can be seen in Figure 1.
ese unlabeled data points presented as black dots and denoted by H(x n ) and H(x m ) are considered in the margin calculation making the context semisupervised.

Definition 2.
e constrained hypothesis margin is calculated as the difference between the distance from the data point x n to the near-hit of the data point x m , i.e., H(x m ), and the distance from the data point x n to its own near-hit, i.e., H(x n ).
e Δ(A i , p 1 , p 2 ) function, defined as the L 1 -norm (Manhattan distance) in our work, calculates the distance between any two data points p 1 and p 2 on a particular feature A i . Note that, for quantitative features, Δ is calculated as follows: and for qualitative features, Δ is calculated as follows: ALGORITHM 1: Sparse graph construction. 4 Computational Intelligence and Neuroscience where value(A i , p) is the value of the data point p over the i-th feature. e minimum and maximum values of a particular feature A i denoted by min(A i ) and max(A i ) are evaluated over the whole set of data points. is normalization ensures that all weight-updates range between 0 and 1 for both quantitative and qualitative features.
As the ability of a feature to discriminate between data points can be evaluated by how much it contributes to the margin's maximization, we represent and quantify this contribution by a weight vector w � (w 1 , . . . , w i , . . . , w F ) T spanning over the F features. us, a weight that is equal to the value of the hypothesis margin a feature can induce, will be assigned to each feature correspondingly.

Definition 3.
e constrained weighted hypothesis margin for a particular constraint (x n , x m ) is presented as

Definition 4.
e constrained weighted margin-based objective function to be optimized by Relief-Sc over the whole constraint set C can be written as follows: where the weight vector w ≥ 0 holds positive values for relevant features since it is a distance metric, and ‖w‖ 2 2 � 1 is to prevent the vector from being maximized with no bounds [49]. (8) and (9), we denote by z � (z 1 , . . . , z i , . . . , z F ) T the margin vector summed over all cannot-link constraints in C, where each element z i of z corresponds to the margin induced by a specific feature A i . z i can be calculated as follows: Accordingly, the optimization problem can be formulated as follows: Equation (11) shows that the features that participate the most in the overall maximization of the margin will be assigned a higher weight and will be consequently selected.
en, the optimum solution can be obtained in a closed form as follows: Just with the aim of making Relief-Sc more robust, the hypothesis margin can be calculated over a group of K-nearest neighbors (KNN) for each pair of cannot-link constraints. For instance, instead of calculating the margin over only the nearest hit and the nearest miss for the points x n and x m , we can evaluate the margin over K-nearest hits and K-nearest misses represented by KH(x n ): where K is a user predefined parameter representing how many closest points to x n and x m are to be considered. In fact, this means that the margin will be averaged over a larger neighborhood and thus will be less vulnerable to noisy data. Consequently, the ith element (corresponding to the ith feature) of the margin vector z can be given by Finally, in the context of a robust constrained hypothesis margin over KNN, Relief-Sc utilizes a cannot-link constraint set to evaluate the averaged margin z in order to optimize w directly as shown in step 3 of Algorithm 2.

Proposed Feature Selection Approach.
Our proposed approach that combines feature clustering with Relief-Sc, called FCRSC, is a filter-type feature selection method (it does not depend on the performance of any learning algorithm while obtaining its ranked feature subset). In addition, one very important advantage of our method is that it is nonparametric, which means its performance is not vulnerable to being closely related to any tuned parameter. Moreover, we do not specify the number of clusters to be obtained from hierarchical clustering, but instead we state a mechanism that choses the cutoff automatically such that no To sum it up, first, we build the feature similarity graph on top of which we apply agglomerative hierarchical clustering. is graph is obtained through sparse coding, where the assigned similarity weights between features are in fact sparse coefficients indicating how much each feature contributes to the reconstruction of the other.
It is very important to find a clustering solution C where features compression is not exaggerated (obtaining very few clusters) or underestimated (obtaining the trivial solution: each feature in its own cluster). Meanwhile, a weight is also assigned to each feature by Relief-Sc as it maximizes the semisupervised margin-based objective function.
us, the significance of this approach lies in the last but most important algorithm called FCRSC. It starts with the two available ingredients, i.e., the clustering solution C obtained by hierarchical clustering and the weight vector w obtained by Relief-Sc. en, for each of the clusters c l in C, the number of features within c l is evaluated and denoted by |c l |. When a cluster has one and only one feature, i.e., |c l | � 1 (considered not redundant at all), it is directly added to the chosen feature subset F s . However, when more than one feature is assigned to c l , the features within c l are sorted in the descending order of their corresponding margin weights given in w and stored in a variable named Sorte d; then, the feature with the highest weight (most relevant obtained by getting the first ranked feature in Sorted) is added to the feature set F s , and the rest are eliminated as they are judged to be redundant. After getting the representative feature from each cluster, these are sorted again in the descending order of w (stored in ranked F s ) leading to the optimization of a twofold objective problem, i.e., (1) minimizing redundancy between features in F s and (2) maximizing relevance between the features and the cannot-link constraints in C.
us, we obtain the ranked feature subset ranke dF s . Note that the number of features in ranked F s will be equal to Input: (i) Training data X (ii) Set of cannot-link constraints C (iii) Number of nearest neighbors K Output: Weight vector w (1) Calculate KH(x n ) and KH(x m ) for each cannot-link constraint in C with respect to (2) X For i � 1, . . . , F,  Computational Intelligence and Neuroscience the number of obtained clusters denoted by |C|. We illustrate the FCRSC in Figure 2.

Experimental Results
In this section, we compare the performance of our proposed FCRSC approach with some of the well-known state-of-theart feature selection methods. is comparison is applied in terms of classification accuracy, redundancy-removal ability, and execution time. Note that a machine with 2.60 GHz CPU and 16 GB of RAM was used to perform the experiments. e used datasets, feature selection methods, and classifiers are detailed in the following sections.

Datasets' Description.
To evaluate the proposed approach, six well-known benchmark datasets representing a variety of problems were used. ese datasets are Wine, WDBC, Ionosphere, Spambase, Sonar, and Arrhythmia from the UCI machine learning repository [50]. We summarize the main characteristics of each dataset in Table 1, where the first column identifies the name of the dataset, the second column shows its corresponding number of data points, the third column shows the data dimension, and finally the last column specifies the number of classes.
Usually, a dataset can have features lying within different ranges, which affects the performance of feature selection algorithms leading to unreliable outcomes. us, similarly to [5], we normalize the features of each dataset using max-min criterion to scale their values between zero and one. Furthermore, for Arrhythmia dataset where some of the feature values are missing, we replace them by the average of all available values of the same corresponding feature.
In addition, we also similarly partition each dataset into 2/3 for training and 1/3 for testing. is process is repeated independently 10 times, and only the averaged results are recorded. In each run, feature selection followed by classifiers learning is applied to the training subset to allow ranking the features according to their assigned scores by different algorithms and then training the classifier in these same ranked feature sets. e classification accuracy that can be obtained by each ranked set of features (each feature selection method) is then measured by applying the learned classifier on the testing subset defined by these same features.

Used Filter Feature Selection Methods for Comparison.
In order to evaluate the performance of our proposed FCRSC method, we apply experimental comparisons with respect to six filter-type ranking feature selection methods, out of which two are supervised, two are unsupervised, and two are semisupervised. We briefly describe these methods as follows: (i) Variance score [51]: it is generally known as the simplest unsupervised feature evaluation method. It uses the variance along a specific feature to reflect its representative power. en, the features with the maximum variance are selected as it assumes that a feature with higher variance contains more information and is more relevant. Variance score depends on the following equation to evaluate features: where N is the number of data points, A in is the value of feature A i on a data point x n , and [11]: it is a well-known unsupervised feature selection method which not only depends on selecting the features of larger variances and higher representative power but also considers their locality preserving ability. Its key assumption is that data points within the same class should be close to each other and far otherwise. Note that the smaller the Laplacian score, the better. e Laplacian score is based on the following equation: where D is a diagonal matrix D nn � n S nm and S nm is the neighborhood matrix between data points. (iii) Minimum-redundancy-maximum-relevance (mRMR) [14]: it is a supervised multivariate feature selection method that is said to output a feature subset having the most diverse features (as noncorrelated as possible) while still having a high correlation with the class label. e maximal relevance criterion is e minimal redundancy criterion is where F s is the selected subset of features, |F s | is the number of features in F s , Y is the vector of class labels, and I(x; y) denotes the mutual information between elements x and y. (iv) ReliefF [52]: it is a supervised margin-based feature selection algorithm. It is a robust extension of Relief that chooses random data points and uses them to calculate the weights of the feature relevance based on a predefined number of nearest neighbors [47].

Computational Intelligence and Neuroscience
ReliefF depends on the following update rule upon a group of random data points: If there is only one feature, add directly to feature set Fs If there is more than one feature, sort by descending order of weights ω and take the first row, i.e., feature with highest weight  where w i is the weight assigned to the feature A i , X m represents the set of random instances used to evaluate w with |X m | � m, K is the number of nearest hits and misses to be considered, c represents the class, and H and M(c) represent the nearhits and near-misses, respectively. Moreover, P(c) represents the prior probability of class c (estimated from the training set) and 1 − P(class(x)) represents the sum of probabilities for the misses' classes. (v) Simba with side constraints (Simba-Sc) [23]: it is a semisupervised margin-based algorithm that iteratively utilizes pairwise constraints, specifically cannot-link ones, to evaluate the ability of features to discriminate data points. is score uses a gradient ascent method to maximize its margin-based objective function. us, a higher score means a more relevant feature. Note that Simba-Sc has a mechanism to deal with redundancy; however, it may still choose correlated features only when this contributes positively to the overall performance. e update rule used by Simba-Sc can be summarized as follows: where [33]: the semisupervised margin-based algorithm is detailed in Algorithm 2.
In fact, we use the variance and Laplacian scores as they are widely used well-known unsupervised filter methods for comparison [5,25]. We also choose to compare our FCRSC method with the supervised mRMR method since it aims at optimizing the same twofold objective problem as our method. ReliefF, Simba-Sc, and Relief-Sc, on the other hand, are all margin-based, similarly to the proposed FCRSC. Note that Relief-Sc is somehow a previous version of FCRSC that does not detect feature redundancy; that is why it is important to compare the performance of FCRSC with Relief-Sc to show the significance of our proposed method. Finally, since we aim to position the performance of our proposed method with respect to supervised, unsupervised, and semisupervised contexts, we chose two feature selection methods from each one.
As the three semisupervised constrained algorithms Simba-Sc, Relief-Sc, and the proposed FCRSC are dependent on cannot-link constraints, we generate them in each run (similarl to [23]) as follows. A pair of data points is chosen randomly from the training set, and the class of each point of the chosen pair is checked. If it appears that the two points belong to different classes, they are accordingly added to the cannot-link constraints set. is operation is repeated until we find the needed number of constraints for each dataset.

Parameter Setting.
For the constrained algorithms that use cannot-link constraints, the number of constraints was considered relatively to the number of data points available in each dataset. us, it was set to 20 for Wine, Ionosphere, Sonar, and Arrhythmia; 40 for WDBC; and 100 for Spambase dataset. Moreover, the number of starting points for the nonlinear optimization method (gradient ascent) in Simba-Sc is set to its default value by the authors, i.e., 5. In addition, for fair comparisons, common parameters between different algorithms were set to the same values. For instance, Laplacian score, ReliefF, Relief-Sc, and FCRSC had their neighborhood size set to 10 in all experiments (similarly to [5]) except for Spambase, which was set to 60 due to its large sample size. We also set the number of features in the subset obtained by mRMR to be equal to the number of clusters obtained by FCRSC.

Used Classifiers.
We apply four different widely used classification schemes to compare, with generality, the significance of rankings or subsets by each of the used feature selection methods.
(i) K-nearest neighbor (KNN) [53]: it is a simple nonparametric method that can achieve high performance when the number of data points is sufficiently big. It utilizes only the spatial distributions of empirical samples without any previous assumptions about their class distributions, where a new data point is classified by the class of the majority of its K-nearest points. In our experiments, we use K � 1. (ii) Support vector machines (SVM) [54]: they are a set of well-known general learning methods that have become very popular in the last decade. SVM classifier maximizes a margin between data points called sample margin. In our experiments, similarly to [5], we apply multiclass SVM by one-against-one method with sequential minimal optimization (SMO) solver and polynomial kernel; however, we use fitcecoc and predict Matlab functions. (iii) Naive Bayes (NB) [55]: it is a probabilistic classifier based on Bayes theorem. It applies classification with a naive (strong) independence assumption between features. In other words, it considers that, given the class labels, features are conditionally independent of each other. WEKA's implementation with default values was used [56]. (iv) Decision tree (C4.5) [57]: it is a well-known classifier that applies an entropy based criterion to a set of training data to build the decision tree. For Computational Intelligence and Neuroscience instance, the data points can be split into smaller subsets by using a feature as the decision rule. For this purpose, the algorithm measures the information gain at each split. WEKA's implementation with default values was used.
We use more than one classifier with different decisionmaking natures and learning processes in order to provide a fair evaluation of used filter feature selection methods independently of the applied classification rules. Moreover, these experiments can also be eye-opening as to which classifier can be best used with the proposed feature selection method.

Evaluation Metrics.
We evaluate the performance of the different used supervised, unsupervised, and semisupervised feature selection methods in terms of two aspects. One is related to the data classification accuracy obtained by applying a classifier to the selected set of features, and the second is directly related to measuring the redundancy of a feature subset or ranking.
(i) Classification accuracy: it is a supervised metric defined as a percentage of correct predictions. us, it evaluates how many data points were correctly classified by classifiers in Section 3.4 using the selected features.
where TP, TN, FP, and FN represent the numbers of true positives, true negatives, false positives, and false negatives, respectively. (ii) Representation entropy (RE) [5,58]: it is an unsupervised metric used to compare the redundancy in obtained feature subsets. It obtains a maximum value when all the eigenvalues become equally important, which means the level of uncertainty becomes maximum. is indicates that information is evenly distributed along all the principal directions. On the contrary, it obtains a value of zero only when all the information is present along the single principal coordinate direction (all eigenvalues are equal to zero except one). e RE of a d-sized feature subset, denoted by H R , is calculated as follows: where λ j , i.e., one eigenvalue of the d × d covariance matrix of the respective feature space, is calculated as follows:

Performance Evaluation in Terms of Classification Accuracy and Feature Set Redundancy for Constrained
Algorithms. According to the previously detailed experimental setup, we first compare the performance of the proposed FCRSC method with that of the constrained feature selection methods (Relief-Sc and Simba-Sc) described in Section 3.2. We chose to closely compare our method with these algorithms first as they belong to the same supervision context of the proposed method and depend on an evaluation function of similar nature. e comparison is applied in terms of two important aspects: the classification accuracy a ranked feature set can achieve and the amount of redundancy that this set possesses. In fact, we aim at showing how FCRSC can improve the performance of its precursor Relief-Sc and compare its performance with its competitor Simba-Sc in terms of both of these aspects. In addition, C4.5 classifier was used as it cannot detect feature interactions [59], a capability that the compared Relief-based algorithms have [60].
Hence, Figures 3 and 4 show the averaged accuracy rates obtained by the C4.5 classifier on the ranked feature sets obtained by the constrained filter methods (Simba-Sc, Relief-Sc, and the proposed FCRSC) on Wine, WDBC, Ionosphere, Spambase, Sonar, and Arrhythmia datasets over 10 independent runs. In fact, each row of the two figures corresponds to one of the datasets presenting the averaged classification accuracy in the left figures and the corresponding averaged representation entropy in the right figures.
For instance, from Figures 3 and 4, we can see that generally FCRSC was always able to obtain a feature subset that provides a better classification accuracy while being less redundant. is was true except for Spambase in Figures 4(a)  and 4(b), where the three algorithms performed approximately the same in terms of accuracy and redundancy, and for Ionosphere in Figures 3(e) and 3(f ), where FCRSC outperformed Relief-Sc and competed with Simba-Sc. On the other hand, FCRSC proved that it can significantly improve the classification performance of its constrained antecedent Relief-Sc (e.g., Ionosphere, Sonar, and Arrhythmia) through compromising between maximum relevance and minimum redundancy in order to compose a subset that holds either weakly relevant but nonredundant features or strongly relevant ones. e fact that FCRSC is able to maintain the same classification accuracy of Relief-Sc or slightly increase it while decreasing the number of ranked features, together with the fact that the representation entropy is higher, validates the hierarchical feature clustering upon sparse graph. is shows that our suggested grouping of features and our idea of representing the feature space were in harmony with margin maximization leading to the expected behavior.To be precise, from Figure 3(a) on Wine dataset, we can see that FCRSC outperformed both constrained algorithms. Although Figure 3(b) presenting RE on Wine shows that Simba-Sc had chosen less redundant features as the first three ones, still FCRSC outperformed it in terms of classification accuracy. In addition, the results on WDBC dataset were interesting, by which Figures 3(c) and 3(d) show that FCRSC was better than Relief-Sc and Simba-Sc in both classification accuracy and redundancy reduction. For instance, FCRSC allowed a maximum classification accuracy of 94.60% after only 12 features out of 30 in the original space. Moreover, FCRSC on Sonar dataset, as can be seen in Figures 4(c) and 4(d), outperformed Simba-Sc and Relief-Sc from the first few features until it reached its maximum of 75.94% on only 19 features out of 60 in the original space. However, later on, starting from the 27th chosen feature, Simba-Sc and Relief-Sc performed slightly better, although FCRSC maintained fewer redundant feature subsets throughout all of its ranked features. It is important to mention that FCRSC enhanced the performance of Relief-Sc over Simba-Sc approximately in all figures of Figures 3  and 4, which means that the drawback of Relief-Sc (see Section 2.3) was refrained by FCRSC. is also shows that the ability of Simba-Sc to detected redundant features was the reason for its ability to outperform Relief-Sc. is notice confirms the equal importance of redundancy removal and objective relevance when selecting features for classification. It also shows the superiority of FCRSC to its only similar-bynature semisupervised competitors (Simba-Sc and Relief-Sc) noting the straightforwardness and easiness of reproducing the feature clustering part of FCRSC unlike Simba-Sc [23].Finally, as can be seen in Figures 4(e) and 4(f ) on Arrhythmia, FCRSC clearly outperformed Simba-Sc which was not able to detect a feature subset that can at least provide a classification accuracy equivalent to the one obtained without feature selection. In fact, Relief-Sc was able to find such a subset, but its performance was lagging behind that of FCRSC. e latter was able to find a smaller feature subset with better classification accuracy and less redundancy among its features.

Performance Evaluation in Terms of Classification
Accuracy Using Multiple Classifiers for the Unsupervised, Supervised, and FCRSC Algorithms. For the sake of generality, in this section, we compare the classification performance of the proposed constrained FCRSC method with that of the unsupervised and supervised methods described in Section 3.2 using three different classifiers. We use the well-known KNN, SVM, and NB classifiers, each depending on a different decision rule nature, to show the general positioning of FCRSC performance with respect to some of the well-known feature selection state-of-the-art algorithms.
is also shows whether a performance degradation or improvement is classifier-dependent or is really imposed by the chosen feature subset.
For instance, each of Figures 5-10 corresponds to a dataset with the classification accuracies obtained by three different classifiers upon the feature subsets ranked by variance score, Laplacian score, ReliefF, mRMR, and the proposed FCRSC.From these figures, we can see that in general FCRSC was always able to obtain a higher accuracy curve compared to the unsupervised Variance and Laplacian scores except on WDBC where the five feature selection algorithms showed interfering and fluctuating accuracy curves from the first few ranked features as can be seen in Figure 6. FCRSC was also sometimes able to compete with supervised methods as can be seen in Figure 5 on Wine, in Figure 9 on Sonar, and in Figure 10 on Arrhythmia.
For instance, Figure 5 shows that using the three different classifiers on Wine dataset, FCRSC performed better than the unsupervised Variance score on all of them from the first few features, and it also outperformed the Laplacian score; however, the latter chose a better starting feature. is means that the locality preserving ability of the first feature was more significant for classification results than the constraint-relevance objective that is respected by FCRSC.
is can be solved by improving the choice of constraints [22]. In addition, as mentioned before, FCRSC competed with the supervised ReliefF and mRMR as can be seen in Figures 5(a) and 5(b), where in fact FCRSC and mRMR performed approximately the same. is can be due to their similar behavior in compromising between maximizing relevance and minimizing redundancy.
Moreover, FCRSC on Ionosphere, as can be seen in Figure 7, clearly outperformed the unsupervised methods on the three classifiers in a very similar manner. Again, Figures 7(a) and 7(b) show close classification accuracy values recorded by the supervised mRMR and the constrained FCRSC.
On the other hand, on Spambase dataset, as can be seen in Figure 8, feature selection was generally not significant by all algorithms (the best accuracy was obtained on the original feature set) except for ReliefF and mRMR with NB classifier shown in Figure 8(c). is can be due to the fact that some datasets need all the available features to obtain best classification performance especially when they are not very high dimensional. However, the performance of FCRSC was generally stable with the three used classifiers. It lies between the supervised and unsupervised methods except for SVM classifier presented in Figure 8(b) where both the unsupervised methods together with FCRSC outperformed mRMR. In addition, the results on Sonar and Arrhythmia were very interesting. As can be seen in Figure 9(a) on Sonar dataset, both Variance and Laplacian scores reached their highest accuracy rates (82.03%) on the full feature space of Sonar dataset, i.e., 60 features, whereas FCRSC obtained an accuracy of 83.04% over only 37 features. is shows that, in this case, the unsupervised Variance and Laplacian scores could not obtain a smaller feature subset that can provide a similar or better classification compared to the original one. It is important here to note that the performance degradation of FCRSC that appeared between approximately the 27th and 37th features on Sonar with C4.5 classifier (analyzed in the previous section and presented in Figure 4(c)) was classifier-related since a good performance was obtained for the same features using KNN, SVM, and NB classifiers.
On the other hand, Figure 10 on Arrhythmia also shows that FCRSC outperforms the unsupervised algorithms and competes with the supervised ones. In fact, it was also able to obtain either a similar or a higher classification accuracy compared to the one obtained by the full feature space with approximately only half the number of features. For example, an accuracy of 55.3% on 117 feature was recorded Computational Intelligence and Neuroscience 13  using FCRSC with KNN (figure 10(a)) compared to 52.93% on 279 features without feature selection and 55.5% on 141 features using ReliefF and mRMR.
In conclusion, FCRSC aims at finding a relevant and nonredundant feature subset that is said to either maintain the classification accuracy obtained on the full feature space or provide an enhanced accuracy performance (through removing the irrelevant and noisy features). It is important to note that although FCRSC chooses a final subset of features (size(F s ) < F), it is still a ranking feature selection method, which means, similarly to the other feature selection methods mentioned in this paper, subsets that are also smaller than F s can be obtained. Moreover, as FCRSC utilizes the Relief-Sc algorithm for its margin maximization objective, it was clear from the results that removing redundant features in addition to irrelevant ones allows better classification performance. Finally, using more than one classifier with different decision-making natures provided a fair evaluation of the used filter feature selection methods and proved the general independence between them and the classifiers.

Execution Time Comparison.
In this section, we consider the different supervised, unsupervised, and semisupervised feature selection methods from the execution time perspective. us, Table 2 presents the average execution time (in ms) of each feature selection method mentioned in Section 3.2 over 10 independent runs.
For instance, from this table we can see that variance score, as expected, had the least execution time with the least differences between datasets since it is the simplest method among all. However, Laplacian, ReliefF, and Simba-Sc appeared to have a great increase in execution time as the number of data points increase significantly as was the case between Spambase (4601 data points) and Sonar (208 data points) with approximately equal number of features. Although Relief-Sc and FCRSC also had a significant increase in the latter case, this increase was less steep. Noting that Simba-Sc, Relief-Sc, and FCRSC are all dependent on constraints and are provided with the same number on each dataset, we declare that Simba-Sc consumes much more time due to it being repeated from 5 different starting points (gradient ascent method) to optimize its objective function.  On the other hand, as the number of features increased significantly between 13 for Wine and 60 for Sonar (having close number of data points and the same number of provided constraints, i.e., 20), the execution time by all the algorithms increased reasonably; however, Relief-Sc increased very little compared to FCRSC, and this is due to the time needed by FCRSC to apply the steps of feature clustering and cluster-representative choosing.
Hence, the overall average execution time for each method on all datasets, as can be seen in the last row of Table 2, shows that FCRSC was generally faster than Laplacian, ReliefF, and Simba-Sc. us, in terms of computational complexity, we can say that FCRSC is mainly related to the following: (i) e number of cannot-link constraints |C| (ii) e number of data points N (iii) e dimension of feature space F In big O notation, the constrained Relief-Sc can be calculated in O(|C|NF). Furthermore, the ranking step that is used within all ranking feature selection methods needs O(F log(F)). e proposed FCRSC is divided into multiple steps, some of which can be done in parallel (clustering the features and calculating their margin weights). For instance, the construction of the sparse graph costs O(F 2 ) [36], and the single-linkage hierarchical clustering of features also needs O(F 2 ). Hence, the overall computational complexity by FCRSC can be calculated as O(MAX(|C|NF, F 2 , F log (F))).

Conclusion
In this work, a semisupervised feature selection approach was proposed. e main contribution is the novel combination of feature clustering upon a sparse graph with a margin-based objective function called FCRSC.
is approach is said to handle the two core aspects of feature selection (relevance and redundancy) in three main building blocks where it first constructs the similarity matrix between features through sparse representation. Second, on top of the latter, feature clustering is applied simultaneously with the application of the margin-based algorithm Relief-Sc. Finally, FCRSC obtains its final feature subset by choosing the feature that most enlarges the margin from each cluster of features, hence maximizing relevance while minimizing redundancy. e performance of this approach was compared to that of supervised, unsupervised, and semisupervised filter feature selection methods using four different classification schemes on six UCI well-known benchmark machine learning datasets. e results showed  the satisfactory performance of FCRSC that outperformed the unsupervised and semisupervised methods on most datasets and also competed with the supervised ones. We believe that this research can be eye-opening to some interesting future work tackling some limitations or ideas that were not covered in this paper.
is includes, first, tailoring the work to a specific application like document classification where the used cannot-link constraints can be domain-specific and actively chosen (instead of being randomly selected as in this work) which we believe would enhance their quality and thus enhance the classification performance of FCRSC. Second, the problem of dendrogram cutoff choice for finding the final clustering solution can sometimes be tricky; this can hold more analysis with a domain-specific dataset. Although in our work we empirically chose cutoff when clusters of clusters start to be grouped together, it can be a drawback when the feature space becomes too large.

Data Availability
e data used to support the findings of this study are openly available in UCI archive.

Conflicts of Interest
e authors declare that they have no conflicts of interest.