Binary Matrix Shuffling Filter for Feature Selection in Neuronal Morphology Classification

A prerequisite to understand neuronal function and characteristic is to classify neuron correctly. The existing classification techniques are usually based on structural characteristic and employ principal component analysis to reduce feature dimension. In this work, we dedicate to classify neurons based on neuronal morphology. A new feature selection method named binary matrix shuffling filter was used in neuronal morphology classification. This method, coupled with support vector machine for implementation, usually selects a small amount of features for easy interpretation. The reserved features are used to build classification models with support vector classification and another two commonly used classifiers. Compared with referred feature selection methods, the binary matrix shuffling filter showed optimal performance and exhibited broad generalization ability in five random replications of neuron datasets. Besides, the binary matrix shuffling filter was able to distinguish each neuron type from other types correctly; for each neuron type, private features were also obtained.


Introduction
To accelerate the understanding of neuronal characteristics in the brain, the prerequisite is to classify neurons correctly. It is therefore necessary to develop a uniform methodology for their classification. The existing classification techniques are usually based on structural functions and the numbers of dendrites to fit the models [1]. As neuronal morphology is closely related to neuronal characteristics and functions, neuroscientists have been making great efforts to study neurons from the perspective of neuronal morphology. Renehan et al. [2] employed intracellular recording and labeling techniques to examine potential relationships between the physiology and morphology of brainstem gustatory neurons and demonstrated a positive correlation between the breadth of responsiveness and the number of dendritic branch points. In the study by Badea and Nathans [3], detailed morphologies for all major classes of retinal neurons in adult mouse were visualized. After analyzing the multidimensional parametric space, the neurons were clustered into subgroups by using Ward's and -means algorithms. In the study by Kong et al. [4], retinal ganglion cells were imaged in three dimensions and the morphologies of a series of 219 cells were analyzed. A total of 26 parameters were studied, of which three parameters, level of stratification, extent of the dendritic field, and density of branching, were used to get an effective clustering, and the neurons could often be matched to ganglion cell types defined by previous studies. In addition, a quantitative analysis based on topology and seven morphometric parameters was performed by Ristanović et al. in adult dentate nucleus [5], and neurons were classified into four types in this region. A number of neuronal morphologic indices such as soma surface, number of stems, length, and diameter were designed [6], which makes it possible to classify neurons based on morphological characteristics.
In the study by Li et al. [7], a total of 60 neurons were selected randomly and five of the twenty morphologic characteristics were extracted by principal component analysis (PCA), after which neurons were clustered into four types. Jiang et al. [8] extracted four principal components of neuronal morphology by PCA and employed back propagation neural network (BPNN) to distinguish the same kinds of neuron in different species. However, the above studies [2][3][4][5] only focused on a particular neuronal type or specific region of the brain, aiming to solve specific issues rather than classify neurons systematically. In this form, only a few samples were selected and the classification results were not independently tested, which is not persuasive enough. Moreover, the methodologies used in previous studies [7,8] were mainly based on PCA and cluster analysis. PCA is the optimal linear transformation to minimize the mean square reconstruction error, but it only considers second order statistics, and if the data have nonlinear dependencies, higher order statistics should be taken into account [9]. Besides, the principal component was a compression of attributes, and it was hard to interpret the respective contribution. Therefore, feature selection (FS) is necessary, which is able to simplify the model by removing redundant and irrelevant features.
Available feature selection methods fall into three categories, (i) filter methods, in which inherent features of datasets are used to rank variables, and the algorithm complexities are low. However, redundant phenomena are usually present among the selected features, which may result in low classification accuracy. Univariate filter methods include -test [10], correlation coefficient [11], Chi-square statistics [12], information gain [13], relief [14], signal-to-noise ratio [15], Wilcoxon rank sum [16], and entropy [17]. Multivariable filter methods include mRMR [18], correlation-based feature selection [19], and Markov blanket filter [20]. There are also (ii) wrapper methods, where the training precision and algorithm complexity are high, which usually leads to overfitting. Representative methods include sequential forward selection [21], sequential backward selection [21], sequential floating selection [22], genetic algorithm [23], and ant colony algorithm [24]. SVM and ANN are usually used for implementation. There are also (iii) embedded methods, including support vector machine recursive feature elimination (SVM-RFE) [25], support vector machine with RBF kernel based on recursive feature elimination (SVM-RBF-RFE) [26], support vector machine and T statistics recursive feature elimination (SVM-T-RFE) [27], and random forest [28], which use internal information of the classification model to evaluate selected features.
In this work, a new feature selection method named BMSF was used. It not only overcame over fitting problem in a large dimensional search space but also took potential feature interactions into account during feature selection. Seven types of neurons, including pyramidal neuron, Purkinje neuron, sensory neuron, motoneuron, bipolar interneuron, tripolar interneuron, and multipolar interneuron, that have different characteristics and functions in the NeuroMorpho.org database were selected, being derived from all the existing species or brain regions (up to version 6.0). BMSF was used to reduce features nonlinearly, and support vector classification (SVC) model was built to classify neurons based on the reserved morphological characteristics. SVM-RFE and rough set theory were used to give a comparison with the introduced feature selection methods, while another two classifiers including the back propagation neural network (BPNN) and Naïve Bayes (NB), which are widely used in the pattern recognition field, were employed to test the robustness of the BMSF. A systematic classification of neurons would facilitate the understanding of neuronal structure and function.

Data Sources.
Data sets used in this work were downloaded from the NeuroMorpho.org database [6,29]. Neuro-Morpho.org is a web-based inventory dedicated to densely archiving and organizing all publicly shared digital reconstructions of neuronal morphology. NeuroMorpho.org was started and maintained by the Computational Neuroanatomy Group at the Krasnow Institute for Advanced Study, George Mason University. This project is part of a consortium for the creation of a "neuroscience information framework, " endorsed by the Society for Neuroscience, funded by the National Institutes of Health, led by Cornell University (Dr. Daniel Gardner), and including numerous academic institutions such as Yale University, Stanford University, and University of California, San Diego (http://neuromorpho .org/neuroMorpho/myfaq.jsp). The data sets used in this study were documented in Table 1. A total of 5862 neurons were selected, and training and test sets were divided randomly in the percentage of 2 : 1 in each neuron type. Finally, we obtained five pairs of data sets, each with random samples.

Feature Extraction and Selection. Dendritic cells in the
NeuroMorpho.org database were cut into a series of compartments, and each compartment was characterized by an identification number, a type, and the spatial coordinates of the cylinder ending point, the radius value, and the identification number of the "parent. " Although the digital description constituted a completely accurate mapping of dendritic morphology, it bore little intuitive information [30]. In this work, 43 attributes that held more intuitive information were extracted with L-measure software [31], and related morphological indices and descriptions are shown in Table 2. For convenience, we gave an abbreviation for each neuronal morphological index, as listed in the second column of Table 2.
It was considered redundant among attributes. Feature selection was able to save the cost of computational time and storage and simplify models when dealing with high dimensional data sets, and it was also useful to improve classification accuracy by removing redundant and irrelevant features.

Binary Matrix Shuffling Filter.
For rapid and efficient selection of high-dimensional features, we have reported a novel method named binary matrix shuffling filter (BMSF) based on support vector classification (SVC). The method was successfully applied to the classification of nine cancer datasets and obtained excellent results [32]. The outline of the algorithm is as follows.
Firstly, denoting the original training set as ( , , ), which includes samples and features, where = 1, 2, . . . , , = 1, 2, . . . , , we randomly generate a matrix with dimensions × with entries being either 1 or 0, representing whether the feature in that column is included in the modeling or not. Where is the given number of combinations ( = 50 in this paper), the number of 1 or 0 in each column (each feature) is equal.
Secondly, for each combination, there will be a reduced training set from the original training set according to the subscripts of those selected features, and classification accuracy can be obtained through tenfold cross validation. By repeating this process times, values of accuracy are obtained.
Thirdly, taking the values of accuracy as the new dependent variable, the × random 0 or 1 matrix as the independent variable matrix, a new training set is constructed. To evaluate the contribution of a single feature to the model, we change all the 1 in th column to 0 and all the 0 in that column to 1 (keeping the other columns unchanged) to produce two test sets with all the elements of 0 or 1 in th column. The newly produced training set is used to build the model to predict the two kinds of test sets, and the predictive vectors 1 and 0 are then obtained.
Comparing the mean value of vectors 1 and 0 , if the mean value of 1 is bigger than that of 0 , the feature corresponding to this column tends to give better classification performance. Otherwise, this feature should be excluded. Repeating this process, the features are screened in multiple rounds until no more can be deleted.
Detailed procedures can be found in our previous study [32]. This method is able to find a parsimonious set of features which has high joint prediction power.

Support Vector Machine Recursive Feature Elimination.
SVM-RFE is an application of recursive feature elimination (RFE) using the weight magnitude as the ranking criterion. It eliminates redundant features and yields more compact feature subsets. The features are eliminated according to a criterion related to their support to the discrimination function, and the support vector is retrained at each step. This method was first successfully used in gene feature selection and afterwards in the fields of bioinformatics, genomics, transcriptomics, and proteomics. For the technical details of the method, refer to the original study by Guyon et al. [25].

Rough Set Theory.
Rough set theory, introduced by Pawlak [33] in the early 1980s, is a tool for representing and reasoning about imprecise and uncertain data. It constitutes a mathematical framework for inducing minimal decision rules from training examples. Each rule induced from the decision table identifies a minimal set of features discriminating one particular example from other classes. The set of rules induced from all the training examples constitutes a classificatory model capable of classifying new objects. The selected feature subset not only retains the representational power but also has minimal redundancy. A typical application of the rough set method usually includes three steps: construction of decision tables, model induction, and model evaluation [34]. The algorithm used in this work is derived from the study by Hu et al. [35][36][37].

Classification Techniques
2.3.1. Support Vector Classification. Support vector classification, based on statistic learning theory, is widely used in the machine learning field [38]. In SVM, structural risk minimization is a substitution of traditional empirical risk minimization, and it is particularly suitable for small sample size, high-dimensional, nonlinearity, overfitting, dimension disaster, local minima, and strong collinear problems. Meanwhile, it also performs excellent generalization abilities. In this work the nonlinear radial basis function (RBF) was selected, where the ranges of parameters and for optimization were −5 to 15 and 3 to −15 (base-2 logarithm), respectively. The cross validation and independent test were carried out using in-home programs written in MATLAB (version R2012a).

Back Propagation Neural
Network. BPNN is one of the most widely employed techniques among the artificial neural network (ANN) models. The general structure of the network consists of an input layer, a variable number of hidden layers containing any number of nodes, and an output layer. The back propagation learning algorithm modifies the feed-forward connections between the input and hidden units and the hidden and outputs units to adjust appropriate connection weights to minimize the error [39]. Java-based software WEKA [40] was used to fit the model.

Naïve Bayes.
Naïve Bayes is a classification technique obtained by applying a relatively simple method to a training dataset [41]. A Naïve Bayes classifier calculates the probability that a given instance belongs to a certain class. Considering its simple structure and ease of implementation, Naïve Bayes often performs well. Naïve Bayes models were also implemented in the WEKA software, and all the parameters were set by default.

Selected Feature Subsets.
Feature selection methods are applied to training sets to get optimal feature subsets. For each method, five sets of features were obtained. Table 3 shows the reserved feature subsets derived from BMSF, SVM-RFE, and rough set theory, respectively. Five feature subsets are numbered with Roman numerals I to V for five replications. The number of selected features is also listed in Table 3. As shown in Table 3, approximately eight features on average were reserved by BMSF, while the number of features derived from SVM-RFE and rough set theory was more than ten. BMSF retained fewer features, which were more informative and easy to interpret. The feature ranking list showed the importance of a certain feature. In the feature subsets of BMSF and rough set, ranked first in five replications, which indicated that had a strong ability to discriminate neuron types. We calculated the frequency of each of the selected features in the five replications. Except for , features NW, HT, and Ta2 were also reserved in five random replications simultaneously, and their ranking lists were similar in the five BMSF subsets.

Comparison of Independent Test Accuracies Using Different Models.
In order to evaluate the performance of BMSF and make a comparison with SVM-RFE and rough set, three classifiers were employed to perform independent test. Including the classification performance without features selection, there were twelve classification accuracies. The average accuracies on five random datasets are presented in Table 4.
The independent classification accuracy is the ratio of the total correctly classified samples to the total test samples. As shown in Table 4, of the twelve results obtained, the optimal classification model based on the five datasets is BMSF-SVC (97.84%), followed by SVC without feature selection (97.1%). Excellent classification results on the SVC classifier indicated that all the extracted features were useful in identifying neurons, and few irrelevant features were extracted. Further, after feature selection by BMSF, the classification accuracy of SVC increased. This phenomenon suggested that BMSF deleted redundant features successfully and simplified models with fewer features. On the other hand, the feature subsets derived from SVM-RFE and rough set did not contribute to increasing the accuracies on SVC; in fact, they decreased sharply. A similar finding can be found for Naïve Bayes, as the two feature selection methods decreased the performance of Naïve Bayes, while BMSF improved the performance. The classifier BPNN showed little sensitivity to feature subsets, and the classification performance was at similar levels. With fewer features, BMSF also obtained good accuracy on BPNN, and a simplified model may be useful in further interpretation.
The above independent accuracies indicated that BMSF has an excellent generalization ability and robustness on the three classifiers. We also calculated the average performance of each feature selection method on the three classifiers and the classification performance based on the three different feature selection methods. The results are listed in the last row and column of Table 4. The average classification accuracy based on BMSF was also the best.
As the datasets used in this work are unbalanced (as shown in Table 1), it is necessary to break down the independent test accuracy to obtain the classification performance of each cell type. Based on the predicted labels, the sensitivities of each cell type in the five replications are presented in Table 5.
For seven neuron types, BMSF-SVC exhibited the best performance on pyramidal neuron, motoneuron, sensory neuron, and bipolar neuron. Though tripolar and multipolar  neurons showed excellent performance on Naïve Bayes, they did not do very well on other neuron types. The classification result of multipolar neuron was poor; however, SVM-RFE and rough set also performed less well on SVC. We found that the predicted labels of multipolar neuron are almost the same as those of the pyramidal neuron in all the models, which indicated that the unbalanced datasets had an effect on the prediction of multipolar neuron.

Distinguishing a Certain Neuron Type from Others by BMSF-SVC.
To evaluate whether a certain feature subset is useful in identifying only a single cell type, the optimal model (BMSF-SVC) in this study was employed. For seven neurons types, six hierarchy models were established. In each hierarchy model, it was a binary classification problem. Due to the imbalanced datasets in this paper, accuracy and the Matthews correlation coefficient (MCC) were used to evaluate the established models, and recall was used to evaluate the classification performance of single neuron type as follows: where TP, TN, FP, and FN were true positive, true negative, false positive, and false negative, respectively, which derived from the confusion matrix. In this paper, positive samples were a certain neuron type and all the rest of the neuron types were negative samples. Positive samples were selected according to the number of samples in each type, and the datasets in each hierarchy are presented in Table 6. For each neuron type, private feature subsets were obtained. As shown in Table 6, the accuracies and MCC in each hierarchy indicated the effectiveness of the models. We obtained private feature subsets for each neuron type. These features were useful in identifying the corresponding neurons, and the perfect recall may support our conclusion. The above finding suggested that BMSF was not only useful in identifying all seven cell types but also able to discriminate specific neuron types.
In this paper, we used a new feature selection method named BMSF for neuronal morphology classification. Interactions are taken into consideration to get highly accurate classification of neurons, and this method usually selects a small amount of features for easy interpretation. As shown in Table 3, eight features were reserved via BMSF, which was less than the number of features obtained by the other two feature selection methods. The BMSF method automatically conducts multiple rounds of filtering and guided random search in the large feature subset space and reports the final list of features. Though this process is wrapped with SVC, the features selected have general applicability to multiple classification algorithms. This conclusion can be demonstrated by the classification performance shown in Table 4.
We should point out that different runs of BMSF may produce different lists of feature subsets. This phenomenon arises from the fact that there are many possible characteristics that may be used to distinguish neurons. For example, feature subsets derived from rough set theory and BMSF 100.00 ± 0.00 100.00 ± 0.00 100.00 ± 0.00

DR DR Pa
Pa achieve similar classification accuracy when applied to SVC classifier. Our goal is to find a minimal set of such features that the combination of them can well differentiate the dependent variables.
The reserved feature subsets on the same data set that resulted from different feature selection methods differed greatly. Li et al. [7] and Jiang et al. [8] selected features from the first twenty attributes of Table 1 only, so they inevitably ignore the attributes that were reserved by BMSF. Therefore, feature extraction by L-measure software was necessary. Another drawback of their feature selection methods was that they did not reduce the variables in the nonlinear manner. For example, PCA only considers second order statistics, and interactions cannot be taken into account.
Conventional classification techniques were built on the premise that the input data sets were balanced; if not, the classification performance would decrease sharply [42]. There were 3908 neurons in the training set, but the number of neurons in each type differed greatly (Table 1). For example, there were only 24 and 11 multipolar interneurons and Purkinje neurons, respectively, whereas the number of pyramidal neurons was 3172, and the unbalanced data sets would have a negative effect on the classification results (Table 5). Therefore, we conducted the hierarchy model for each neuron type, and BMSF was demonstrated as useful in distinguishing specific neuron types from others.

Conclusion
We introduced a new feature selection method named BMSF for neuronal morphology classification, obtained satisfactory accuracy for all of the datasets and each hierarchy model, and were able to select private parsimonious feature subsets for each neuron type. However, it was obvious that classification based simply on neuronal morphology was inadequate. As time goes by, dendrites may continue to grow and axons will generate additional terminals, which will undoubtedly lead to changes in the vital parameters [8]. Therefore, combining 8 Computational and Mathematical Methods in Medicine biophysical characteristics with function characteristics to investigate the neuronal classification problem will be a productive direction in the future.