Feature Selection for Better Identification of Subtypes of Guillain-Barré Syndrome

Guillain-Barré syndrome (GBS) is a neurological disorder which has not been explored using clustering algorithms. Clustering algorithms perform more efficiently when they work only with relevant features. In this work, we applied correlation-based feature selection (CFS), chi-squared, information gain, symmetrical uncertainty, and consistency filter methods to select the most relevant features from a 156-feature real dataset. This dataset contains clinical, serological, and nerve conduction tests data obtained from GBS patients. The most relevant feature subsets, determined with each filter method, were used to identify four subtypes of GBS present in the dataset. We used partitions around medoids (PAM) clustering algorithm to form four clusters, corresponding to the GBS subtypes. We applied the purity of each cluster as evaluation measure. After experimentation, symmetrical uncertainty and information gain determined a feature subset of seven variables. These variables conformed as a dataset were used as input to PAM and reached a purity of 0.7984. This result leads to a first characterization of this syndrome using computational techniques.


Introduction
Guillain-Barré syndrome (GBS) is an autoimmune neurological disorder characterized by a fast evolution, generally from a few days up to four weeks [1]. GBS has an incidence of 1.3 to 2 per 100,000 people and a mortality rate from five to fifteen percent. The exact cause of GBS is unknown; however, it is frequently preceded by either a respiratory or a gastrointestinal infection. The diagnosis of GBS includes clinical, serological, and electrophysiological criteria [2]. The severity of GBS varies among subtypes, which can be mainly acute inflammatory demyelinating polyneuropathy (AIDP), acute motor axonal neuropathy (AMAN), acute motor sensory axonal neuropathy (AMSAN), and Miller-Fisher syndrome [1]. Electrodiagnostic criteria for distinguishing AIDP, AMAN, and AMSAN are well established in the literature [3], while the Miller-Fisher subtype is characterized by the clinical triad: ophthalmoplegia, ataxia, and areflexia [1].
A better understanding of the differences in the GBS subtypes is critical for the implementation of appropriate treatments for total recovery and in certain cases for the survival of patients. Hospitalization time and the cost of treatments vary according to the severity of the specific subtype. Finding a minimum feature subset to accurately identify GBS subtypes could lead to a simplified and cheaper process of diagnosis and treatment of the GBS case. The ultimate goal of a physician is to get patients to a full recovery. This can be more effectively achieved when an early diagnosis of the case is performed using a minimum number of medical features.
This work constitutes a first attempt to using machine learning techniques, specifically cluster analysis in combination with filter methods for feature selection. We aim at finding a small feature subset to identify four GBS subtypes. Machine learning techniques have been found in the literature to predict the prognoses of this syndrome [4,5] as well as 2 Computational and Mathematical Methods in Medicine to find predictors of respiratory failure and necessity of mechanical ventilation in GBS patients [6][7][8]. Nevertheless, no previous publications about specific subtypes identification of the syndrome using machine learning techniques were found in the literature.
Cluster analysis is a computational technique from the machine learning area that is shown to be useful to find different groups of objects in datasets [9][10][11][12]. However, datasets might contain a mixture of "bad" and "good" features. "Bad" features are redundant or noisy features and make algorithms slow and inaccurate. Feature selection techniques allow reducing the dimensionality of a dataset such that it only contains "good" features which would maximize the performance of the algorithms and thus enabling the possibility of reaching a higher accuracy [13]. For feature selection, several machine learning methods are available, which are usually classified as filter [14][15][16][17][18][19], wrapper [20][21][22], embedded [23][24][25], and hybrid [26][27][28][29]. From the machine learning point of view it is interesting to analyze the performance of feature selection methods in diverse scenarios with real data, as this case is.
In this work we use a real dataset consisting of 156 features and 129 cases of GBS patients; these are 20 AIDP cases, 37 AMAN, 59 AMSAN, and 13 Miller-Fisher cases. The dataset contains clinical, serological, and nerve conduction tests data.
We use PAM (Partitions Around Medoids) clustering algorithm to identify with the highest purity groups corresponding to four subtypes of GBS. A group with high purity contains the largest number of elements of the same type and the fewest number of elements of a different type. Purity is an external clustering validation metric that evaluates the quality of a clustering based on the grouping of objects into clusters and comparing this grouping with the ground truth. Although there are several clustering validation metrics, both internal and external [30], we selected purity since our interest was to find "pure" groups and to take advantage of the available prior knowledge of the true labels. The use of a prior knowledge to evaluate a clustering process is also known as supervised or semisupervised clustering; some examples can be found in [31][32][33][34].
In order to achieve the identification of the four groups with a high purity it is necessary to select the relevant features in the dataset; otherwise the purity magnitude would be compromised as stated in [13]. For this initial exploratory study, we chose filter methods as they are the simplest and lowest computational demanding methods available in the literature and as they work independently of the clustering algorithms. We focus on five filter methods: correlation-based feature selection (CFS), chi-squared, information gain, consistency, and symmetrical uncertainty methods.
The experimental results showed a good performance of the method and allowed us to obtain a first characterization of GBS using machine learning techniques.  [1][2][3]. This dataset is not yet publicly available and this is the first time it is used in an experimental study. No public dataset was found to be used as a benchmark.

Materials and Methods
Originally, the dataset consisted of 365 attributes corresponding to epidemiological data, clinical data, results from two nerve conduction tests, and results from two cerebrospinal fluid (CSF) analyses. The second nerve conduction test was conducted in 22 patients and the second CSF analysis was conducted in 47 patients only. Therefore, data from these two tests were excluded from the dataset.
The diagnostic criteria for GBS are established in the literature [1][2][3]. These formal criteria were considered to determine which variables from the original dataset could be important in the characterization of the four subtypes of GBS. We made a preselection of variables based on these criteria. Originally, the dataset had 365 variables. After preselection, it was left with 156 variables: 121 variables from the nerve conduction test, 4 variables from the CSF analysis, and 31 clinical variables. As for the type of attributes, these are 28 categorical and 128 numeric attributes. The situation of dealing with mixed data types was solved using Gower's similarity coefficient, as explained later.

Filter Methods.
We selected filter methods for this initial exploratory study as they are in computational terms the fastest and simplest methods available in the literature for feature selection. Filters work independently from any clustering algorithm and base their decision solely on characteristics of data.
We chose these five particular methods based on their performance reported in the literature [15,17,35,36]. Chosen filters apply diverse criteria to evaluate feature relevance. Filters investigated are CFS, chi-squared, information gain, symmetrical uncertainty, and consistency. [14] evaluates two aspects of a feature subset: its capacity to predict the class and the correlation between the features of the subset. This method seeks to maximize the first aspect and minimize the second one. This method results in a feature subset with the highest capacity to predict the class and the least correlation between features of the subset. Given a feature subset containing features, CFS finds the goodness of denoted ( ) as follows:

Correlation-Based Feature Selection (CFS). CFS
where is the average correlation of all feature-feature pairs, and is the average correlation of all feature-class pairs.

Chi-Squared.
This method evaluates the chi-square statistic of each feature taken individually with respect to the Computational and Mathematical Methods in Medicine 3 (1) objects are arbitrarily selected as the initial medoids .
(2) repeat (2.1) The distance is computed between each remaining object and the medoids .
(2.2) Each object is assigned to the cluster with the nearest medoid . (2.3) An initial total cost ini is calculated.
(2.4) A random is selected.
(2.5) A total cost fin is calculated as a result of swapping an arbitrary with the randomly selected.
class [15] and provides a feature ranking as a result. The chisquare test for a feature and the class is defined as follows: where is the number of observations in the dataset, ( , ) is the joint probability of and , and ( ) is the marginal probability of .

Information Gain.
Information gain measures the goodness of a feature to predict the class given that the presence or absence of the feature in the dataset is known. This method delivers a ranking according to the goodness of each feature. Information gain [16] of a feature and a class is defined as follows: where ∈ { , } is the set of all classes, ∈ { , } is the set of all features, ( , ) is the joint probability of feature and class , and ( ) and ( ) are the marginal probabilities of and , respectively.

Consistency.
This method finds the smallest feature subset that presumably improves the discriminatory power of the original feature subset. This subset has the highest consistency. The consistency for a given feature subset is computed as follows [17]: Let us define a pattern as a set of values for . An inconsistency arises when two patterns match exactly all attributes except for the class. The inconsistency count for a pattern is the number of times it appears in the dataset minus the number of times it appears in the majority class. The inconsistency rate is the sum of all the inconsistency counts for all possible patterns of divided by the total number of patterns [18].

Symmetrical
Uncertainty. This method measures the correlation between pairs of attributes using normalization of information gain. The normalization is performed to compensate for the bias of information gain to benefit attributes with more values and to ensure that they are comparable [17]. This method results in a feature ranking. Symmetrical uncertainty is computed as follows [19]: where ( ) is the marginal probability of feature , is the range of feature , and ( , ) is the joint probability of features and . Entropy is computed using the classical equation discussed in [17].

Clustering Algorithm: Partitions Around Medoids (PAM).
As stated before, the dataset used in this work combines categorical and numeric data. PAM is a clustering algorithm capable of handling such situations. It receives a distance matrix between observations as input. The distance matrix was computed using Gower's coefficient, explained later. PAM, introduced by Kaufman and Rousseeuw [37], aims to group data around the most central item of each group, known as medoid, which has the minimum sum of dissimilarities with respect to all data points. PAM forms clusters that minimize the total cost of the configuration, defined as where is the number of clusters, ∈ is the set of objects in cluster , and dist( , ) is the distance between an object and a medoid . PAM works as shown in Algorithm 1 [38].

Gower's Similarity Coefficient.
Distance metrics are used in clustering tasks to compute the distance between objects. The distance computed is used by clustering algorithms to determine how much similar or dissimilar the objects are and what cluster they belong to. There are many distance metrics. Some of them deal with numeric data, like Euclidean, Manhattan, and Minkowski [38]. To deal with binary data the Jaccard coefficient and Hamming are often used [38]. For categorical data, some distance metrics are Overlap, Goodall, and Gambaryan [39].
In this work we used for experimentation a dataset that contains mixed data, that is, both categorical and numeric data. To deal with this situation we selected Gower's coefficient. It is a robust and widely used distance metric for mixed data. We used this coefficient to obtain a matrix of distances between observations as PAM requires. It was introduced by Gower in 1971 [40]. Gower's coefficient is defined as follows [41]: where 1 is the number of quantitative variables, 2 is the number of binary variables, 3 is the number of qualitative variables, is the number of coincidences for qualitative variables, is the number of coincidences in 1 (feature presence) for binary variables, is the number of coincidences in 0 (feature absence) for binary variables, and ℎ is the range of the ℎth quantitative variable. Gower's coefficient is within the range 0-1. A value near to 1 indicates strong similarity between items and a value near to 0 indicates weak similarity.

Metrics to Evaluate the Quality of a Clustering Process.
The quality of a clustering process can be evaluated using two types of metrics: internal and external. Internal metrics evaluate the quality of a clustering process based on some intrinsic characteristics, regularly, intra-and intercluster distances. Internal metrics assign high scores to clusters with largest distances among them (separability) and shortest distances among members of the same cluster (compactness). These metrics are very useful when the number of clusters is not known at all. Examples of internal metrics are Q-modularity [42], Davies-Bouldin index, Dunn index, and silhouette [43].
External metrics evaluate the quality of clusters based on data not used during the clustering process, such as the ground truth, that is, the real classes of the instances. The larger the number of instances correctly located according to the ground truth, the higher the index. Some examples of external metrics are Rand index, Folkes and Mallows index, Hubert's T statistic [30], and purity [44].

2.5.1.
Purity. The dataset used in this work provides the ground truth. We know there are four classes in the dataset. The objective of this study was to find the features that identify with the highest accuracy possible four clusters, each corresponding to one class. To achieve this goal we selected purity as the metric to evaluate the quality of the clustering process.
Purity validates the quality of a clustering process based on the locations of data in each cluster with respect to the true classes. The more objects in each resultant cluster belong to the true class, the higher the purity. Formally [44], where is the number of samples, = { 1 , 2 , . . . , } is the set of clusters found by the clustering algorithm, = { 1 , 2 , . . . , } is the set of the classes of the objects, = | ∩ | is the number of objects of cluster being in class , is the set of objects in class , and is the set of objects in cluster .
The value of purity ranges from 0 to 1. A purity value of 1 indicates that all the objects in each cluster belong to the same class. An example of purity calculation is shown in Table 1.
The number of objects of the majority class in each cluster is shown in bold. The purity of the clustering is computed as follows: (9 + 14 + 21)/51 = 44/51 = 0.8627.

Experimental Design.
We used the 156-feature GBS dataset, described earlier, for experiments. This dataset contains a combination of categorical and numeric features. Gower's coefficient is able to deal with both types of features when present in the same dataset. We used this method to compute the distance matrix among instances, which is required as input to the PAM algorithm.
As we know beforehand, there are four GBS subtypes present in our dataset. This is why the number of clusters requested to PAM algorithm in our experiments was = 4. We expected the clustering algorithm would identify each subtype as a cluster, with the highest purity possible. Five filter methods were used for feature selection, as clustering algorithms perform more efficiently when they work only with relevant attributes [13].
The class attribute was not used when the clustering algorithm was executed. We used it to compute the purity of the clusters obtained with PAM.
A baseline purity using all the 156 features included in the dataset was computed. This value was compared with the purity obtained using only the relevant features as determined by each filter method. Such comparison would allow for a clear view of the benefits of the feature selection process over using the entire dataset, in terms of purity.
Each of the five filter methods selected for experiments in this work was applied to the 156-feature dataset. Along with the features, the class attribute was included in the dataset during the filtering process.
As previously described, CFS and consistency methods include in their output the subset with the most relevant features found. In contrast, chi-squared, information gain, and symmetrical uncertainty methods output a feature ranking.
In all scenarios, new datasets were created with the best feature subsets. The distance matrix of the new datasets was calculated and used as input to the PAM algorithm. Finally, purity of clusters was computed.  In both CFS and consistency methods, the new datasets were created with the resultant most relevant features.
For chi-squared, information gain, and symmetrical uncertainty, feature rankings they produced were used to create the new datasets. Datasets with dimension from 2 through 156 were created, with the best two features, the best three features, and so on. The reason for a dataset of dimension 2 is that the calculation of the distance matrix requires at least 2 attributes. The best feature subset was the set of features conforming the dataset which led to the highest purity in the clustering process.

Identification of the Four GBS Subtypes.
The baseline purity of the four clusters obtained using all the 156 features included in the dataset was 0.6899. After experimentation, four filter methods found feature subsets which increased the baseline purity after the clustering process. Only the feature subset selected by the consistency method as the most relevant obtained a lower purity of 0.6589 than that of the baseline experiment. Table 2 shows the results of purity of the five methods. Three methods tied with the highest purity (0.7984): information gain, symmetrical uncertainty, and CFS. Both information gain and symmetrical uncertainty selected seven relevant features while CFS selected 16 relevant features. Chi-squared method chose 41 nerve conduction test variables as the most relevant and reached 0.7829 of purity. The consistency method showed the worst performance, which reached a purity of 0.6589. The six relevant features selected by consistency method were two clinical and four corresponding to the nerve conduction test. Table 3 shows the list of the variables selected by both information gain and symmetrical uncertainty. These variables conformed as a dataset were able to identify the four subtypes of GBS with a purity of 0.7984. All these variables are related to the nerve conduction test.   Table 4.
Four variables from Tables 3 and 4, denoted by ( * ), were selected by all methods.
Purity results of the clustering process using the datasets formed with the most relevant features as ranked by chisquared, information gain, and symmetrical uncertainty, as described in methodology section, are shown in Figure 1. The three methods behave similarly. Both information gain and symmetrical uncertainty methods reached a maximum value with seven relevant variables, while chi-squared method reached its maximum with 41 variables. All three methods kept purity in the range of 0.7 and 0.8 for feature subsets of sizes between 2 and 102. For bigger subsets, purity lies in the range of 0.65 through 0.7.  (14) 0.8333 IG: information gain, SU: symmetrical uncertainty, and * one feature selected therefore purity was not computed. The number of features selected in each case is shown in parenthesis.

Pairwise Exploration of the GBS Subtypes.
In order to investigate if any two pairs of GBS subtypes were distinguishable we conducted an additional experiment. We created six new datasets, each one containing instances of only two GBS subtypes. We calculated a baseline purity of each pair of GBS subtypes using all the 156 features. Our goal was to determine a feature subset capable of identifying each pair of GBS subtypes with a higher purity than that of the baseline. We used the five filter methods investigated all along this work to determine the most relevant features for each pair of GBS subtypes. For all scenarios we used = 2, as there are only two GBS subtypes in each dataset. Finally, we applied PAM to form the clusters using only the relevant features determined with each filter method and calculated their purity. Table 5 shows the results of this experiment. Each row represents a pair of GBS subtypes. Columns 2 to 6 represent a filter method. The right-most column indicates the purity achieved using all the features in the dataset, that is, doing no feature selection at all. Table entries indicate the purity obtained in each case. Numbers in bold show the highest purity obtained for each pair of GBS subtype. Based on the purity obtained, it was found that any filter method is better than using all the features. The highest purity for all pairs of GBS subtypes was superior to 0.9. This result demonstrates the effectiveness of filter methods and highlights the importance of feature selection.

Exploring Different Values of .
As explained at the beginning of Section 3.1, we performed the clustering process requesting = 4 clusters as we know this is the number of existing GBS subtypes in the dataset. However, we wanted to explore the clustering process with different values of . Purity results were analyzed and shown in Table 6.
The results of this experiment are shown in Table 6. The first column represents the different values of analyzed. Each remaining column represents a filter method. The rightmost column represents the purity obtained using all the features, that is, doing no feature selection at all. Each row represents the results obtained for each value of . Table  entries indicate the purity obtained in each case. The results indicate that, in general, purity keeps an ascending pattern as increases. Purities for = 4 and = 5 are very close. In all cases, purity is low for = 2 and very high for = 20. The highest purity values were found for = 20 in all cases; however, these numbers do not indicate that the real number of clusters in the dataset is 20; in fact this number of clusters does not correspond with the nature of GBS subtypes in real life. This result confirms what is reported in literature; higher values of purity are easily obtained for higher values of [45]. Purity is a good evaluation metric for clustering when the number of clusters is known, as in this case.

Discussion.
Our objective in this work was to find the best feature subset to identify four GBS subtypes with the highest purity. We did not find any similar work in the literature; therefore this one represents the first effort in this direction. In order to achieve our purpose, we applied machine learning techniques. We used five filter methods for feature selection and compared their performance.

Importance of Feature Selection to Identify GBS Subtypes.
The clustering of the four GBS subtypes using all the 156 features in the dataset reached a purity of 0.6899. This means that many cases were mislocated in the clustering process. Table 2 shows that four of the five feature selection methods used in this work obtained a small feature subset that led to the identification of the four groups with a higher purity than that of the baseline. The identification of GBS subtypes pairwise was achieved with a high purity. The initial baseline purity was improved in all cases (Table 5) when the algorithm used only the relevant features.
These results demonstrate that the clustering algorithm underperforms in the presence of redundant and irrelevant features and highlight the importance of feature selection methods.

Analysis of Different Numbers of Clusters.
Purity is a good evaluation metric for clustering when the number of clusters is known, as in this case. Higher purity is easily achieved as the number of clusters increases [45] and that is demonstrated with the results shown in Table 6.

Identification of Four GBS Subtypes.
The main contribution of this work is the identification of a subset of seven relevant features from a dataset of 156 variables which identified four GBS subtypes with a purity of 0.7984. Another contribution is the analysis of the performance of five filter methods for feature selection. Finally, this work contributes with the feature rankings produced by chi-squared, information gain, and symmetrical uncertainty methods.
A remarkable finding is that all five methods coincided in four variables. It is also noteworthy that only two of the five methods selected clinical variables. It is important to highlight the fact that the consistency method was not able to select a feature subset to improve the baseline purity (0.6899), but instead the six features selected by this method achieved a worse purity (0.6589).
Information gain, symmetrical uncertainty, and CFS were showed to be highly efficient as they could obtain a reduced subset of relevant features that allow identifying four subtypes of GBS with high purity (0.7984). The first two methods coincided in the same seven variables. CSF selected 16 variables. Further studies are needed to evaluate other methods of feature selection, such as wrapper, embedded, and hybrid methods.

Conclusions
In this work, we aimed to find a reduced feature subset for identifying four subtypes of GBS with the highest purity. This work represents the first effort on using cluster analysis to identify GBS subtypes. We used for experiments a real dataset of 156 features containing clinical, serological, and nerve conduction tests data. A clustering process was performed with PAM algorithm. In order to select the most relevant features from the dataset as input for PAM, we conducted experiments with five filter methods: CFS, chi-squared, information gain, symmetrical uncertainty, and consistency.
We succeeded as two filter methods were able to find a feature subset consisting of only seven variables that allowed us to obtain a purity of 0.7984. This result originated the first computational characterization of GBS subtypes. Besides, the reduced number of features found to identify the four GBS subtypes could guide physicians to design a faster, simpler, and cheaper diagnosis of the syndrome case.
Other filter methods like FCBF (Fast Correlation-Based Filter) [46] and INTERACT [47] could be used in further studies. Also, more sophisticated methods of feature selection are recommended for analysis, such as those listed in [48][49][50].
Finally, machine learning techniques such as neural networks or support vector machines could be used for clustering. Purity on their resultant clusters can be compared to that of PAM. This study is planned to further our research.