K Important Neighbors: A Novel Approach to Binary Classification in High Dimensional Data

K nearest neighbors (KNN) are known as one of the simplest nonparametric classifiers but in high dimensional setting accuracy of KNN are affected by nuisance features. In this study, we proposed the K important neighbors (KIN) as a novel approach for binary classification in high dimensional problems. To avoid the curse of dimensionality, we implemented smoothly clipped absolute deviation (SCAD) logistic regression at the initial stage and considered the importance of each feature in construction of dissimilarity measure with imposing features contribution as a function of SCAD coefficients on Euclidean distance. The nature of this hybrid dissimilarity measure, which combines information of both features and distances, enjoys all good properties of SCAD penalized regression and KNN simultaneously. In comparison to KNN, simulation studies showed that KIN has a good performance in terms of both accuracy and dimension reduction. The proposed approach was found to be capable of eliminating nearly all of the noninformative features because of utilizing oracle property of SCAD penalized regression in the construction of dissimilarity measure. In very sparse settings, KIN also outperforms support vector machine (SVM) and random forest (RF) as the best classifiers.


Introduction
The aim of classification methods is to assign true label to a new observation. Despite the fact that classification is one of the oldest statistical methods, finding the mechanism by which new observations are classified with the lowest error is still challenging. Although Fernández-Delgado et al. showed that there was no classifier which has the highest accuracy in all the situations, they present random forest (RF) and support vector machine (SVM) as the best classifiers among 182 classifiers [1].
nearest neighbors (KNN) are known as one of the simplest nonparametric classifiers. For a fixed value of , KNN assigns a new observation to the class of majority of the nearest neighbors [2,3]. Nevertheless, in high dimensional setting, it is affected by nuisance (noninformative) features and suffers from "curse of dimensionality" [4][5][6]. In recent years, the effect of the curse of dimensionality on KNN has been studied by many authors. For example, Pal et al. showed that, in high dimensional setting, KNN classifier misclassifies about half of the observations [3,7] and Lu et al. have noted that the nature of sparsity in high dimensional situation can lead to unstable results [5]. As a result of dimensionality curse, it has been argued by some authors that nearest neighbor can become ill defined because all pairwise distances concentrate around a single value (distance concentration) [3,4,7]. Beyer et al. stated that distance concentration can occur even with as few as 15 dimensions [7]. In 2010, Radovanović et al. introduced k-occurrences as follows: "the number of times a point appears among the nearest neighbors of other points in a data set." They also showed the deleterious impact of points with very high -occurrences called hubs [6]. Another challenge in KNN method is about ties when sample size is small. Empirical practice showed that is not greater than square root of number of training data items [2]. Therefore for binary classification, when is even, there is the chance of ending with a tie vote. To eliminate this challenge, KNN only considers odd numbers [2,8].
In the last decade, dimension reduction techniques as a remedial method for classification with KNN in high dimensional settings have been more attentive. Fern and Brodley proposed random projection, which was based on a random matrix. This random matrix projects the data along a subspace with lower dimension, so KNN classifier utilizes the reduced subspace for classification task [9]. Deegalla and Boström proposed principal component based projection when the number of PCs was lower than data dimensions. They recommended using aforementioned PCs instead of initial features for dissimilarity measure construction and finding the nearest neighbors [10]. Another popular approach is to employ a threshold (so called hard threshold) and truncate less important features. In this approach, only features greater than the threshold are contributed to KNN classifier [11]. Pal et al. proposed a new dissimilarity measure based on mean absolute difference of distances (MADD) to cope with curse of dimensionality [3]. Finally in 2013, Lu et al. stated that, in the sparse situations to enhance accuracy, a classifier should combine both linearity and locality information [5].
In this manuscript, we suggest a hybrid method called K important neighbors (KIN) that implements smoothly clipped absolute deviation (SCAD) regression and uses a function of the obtained coefficients as weights in construction of dissimilarity measure. Proposed method combines information of features employing logit link function (i.e., linearity information) and distances (i.e., locality information) in the dissimilarity measure, thereby leading to both feature selection and classification. In facing ties, KIN assign new observation to a class with lower amount of dissimilarity measure.
The rest of this paper is organized as follows: Section 2 presents a brief description about KNN, SCAD penalized regression, random forest (RF), and support vector machine (SVM). In Section 3, we present our proposed method. Section 4 compares the accuracy of KIN with KNN, RF, and SVM using simulation studies and benchmark data sets and finally, we provide discussion about the proposed classifier and conclude this manuscript in Section 5.

2.1.
Nearest Neighbors (KNN). The nearest neighbors classifier assigns a new observation into a class with majority votes in nearest neighbors [12,13]. The dissimilarity measure in KNN is usually defined in terms of Minkowski distance as follows: where is the number of features, is a positive constant (usually 1 or 2), and ( , ) is distance between and points. Optimum amounts of (number of neighbors) can be obtained using cross validation technique [2,8].

Smoothly Clipped Absolute Deviation (SCAD).
Variable selection is one of the key tasks in high dimensional statistical modeling. Penalized likelihood approach by handling curse of dimensionality performs estimation and variable selection simultaneously [14]. Smoothly clipped absolute deviation (SCAD) logistic regression proposed for feature selection in high dimension and low sample size settings by Fan and Li is as follows: where = ( 1 , 2 , . . . , ) is vector of coefficients, ( ) is maximum likelihood estimator of regression model, ( ) is penalty function, and is a positive constant called regularization (tuning) parameter [15,16]. The amount of penalty depends on which is estimated using 5-or 10-fold cross validation technique. SCAD has good properties of both best subset and ridge regression which yield continuous and unbiased solutions. Moreover, it can estimate nuisance features as zero and signal (informative) features as nonzero with probability very close to one. This advantage of SCAD regression called "oracle" property and means that SCAD is able to estimate coefficients of all the features truly with probability which tends to one [15]. In short, SCAD selects the correct model as well as we hope, even in very sparse and low sample size situations.

Random Forest (RF).
Random forest (RF) is a method for regression or classification that is based on an ensemble of unpruned trees. In RF, each tree is built on a bootstrap sample (almost two-thirds of the observation) and grows via a random sample of features at each split. For classification tasks, this random sample is the square root of the total features. This is repeated hundreds of times for building a forest. Optimum number of trees in RF can be estimated by out of bag error and the class with majority votes is considered as the class of new observation [8,17]. In the current study, random-Forest package was used and default number of trees set at 500.

Support Vector Machine (SVM).
The aim of support vector machine (SVM) is to find a line which maximizes the margin between two classes. To attain this goal, SVM incorporates kernel trick that allows the expansion of the feature space. Also, support vector refers to any observation which for its class lies on the wrong side of the margin. Expansion of the feature space depends on the number of support vectors estimated by cross validation [8,18]. In the current study, we used linear kernel and cost function ranging between 0.001 and 5 in e1071 package.

Important Neighbors (KIN) Algorithm for Binary Classification
Suppose that = {( , ), = 1, . . . , } is a training data set and ∈ {0, 1} denotes class membership and vector of p predictor features for th observation represented as After random division of data into training and testing set, SCAD logistic regression was fitted on training data set which leads to estimating coefficients of nuisance features to be exactly zero. In the next step, the contribution (importance) of each feature is calculated using the following formula: where is coefficient of th feature in SCAD logistic regression. By imposing the obtained vector of contributions into Euclidian distance, we introduce our proposed dissimilarity measure as follows: where ( , ) is distance between and points. In the next stage, we obtain optimum number of neighbors ( ) using the proposed dissimilarity measure and considering both even and odd values. A new observation was assigned to class one ( = 0) if 1 > 2 and assigned to class two ( = 1) if 1 < 2 where is number of observations in the th class among nearest neighbors. When a tie occurs ( 1 = 2 ) assignment rule is as follows: it means assigning new observation into class with lower dissimilarity index.
To avoid a significant decrease in sample size of each fold, 5-fold cross validation was implemented for choosing optimum number of neighbors ( ) because sample size in training data set may be as small as 30. In 5-fold cross validation technique, training data set (40% of total sample size in the current study) may randomly be divided into 5 equal parts. Each time one part is considered as validation while another part was used for training the model. This is repeated 5 times, so all the parts are used just once as validation set and mean error of the 5 repeat was calculated as cross validation error. Finally, after obtaining the optimum value of neighbors and using a matrix of dissimilarity measure, testing set (60% of total sample size in the current study) was assigned into the groups. In order to calculate misclassification rate (MC), the following formula was used: where , , and represent ratio of observation, number of misclassifications, and sample size of the desired class, respectively. The algorithm used is described in a flowchart and displayed in Figure 1.

Simulation Framework.
In the following scenarios, the misclassification rate of the proposed method called KIN was numerically compared with the traditional KNN, random forest (RF), and support vector machine (SVM) methods. The reason for the choice of RF and SVM is that they are the best among all of the current classifiers. All the simulations are performed in R 3.1.3 software and 5-fold cross validation was used to estimate optimum number of trees and support vectors in RF and SVM, respectively, or optimum number of neighbors in KNN and KIN methods. We simulated 250 data sets for each scenario, comprising 100 or 200 observations from the model ∼ Bernoulli( ( )), where denotes class membership, ( ) = (exp( ))/(1 + exp( )), and is a vector of features and each feature has standard normal distribution. Let = ( non zero , zero ), where non zero is a vector of 1 for their odd components and 2 for their even components and zero is vector of zero components. Degree of sparsity was determined by zero / which was considered as 90, 95, or 98% and number of features was set to 100, 300, or 500. Moreover to assess effect of correlation between features on the accuracy of the proposed classifier, a kind of autoregressive correlation was used. In this correlation pattern, the closer two variables are together, the more correlation is between them as follows: the correlation between and (two arbitrary features) was considered as | − | where was 0.8 or 0.4. In all the scenarios, we split simulated data set randomly into training and testing set with ratio of 40% and 60%, respectively. The reason for choosing smaller sample size for training set was assessing the accuracy of the proposed model compared to the best classifiers in low sample size settings.    false positive variables was 1.9, 2.7, and 3.0 when numbers of variables were 100, 300, and 500, respectively. In fact, the proposed method successfully eliminates 98.8, 98.7, and 98.8% of noisy features in 90, 95, and 98% degree of sparsity scenarios, respectively. Our results indicated that KIN also has good performance in terms of assigning true weight to signal (nonzero) features. We called this true contribution (TC). Table 1 showed that the average true contributions were 80.2, 77.1, and 69.8% for 100, 300, and 500 predictors, respectively. Misclassification (MC) rate of KIN was compared to KNN, RF, and SVM for the above scenarios in Figure 2. This figure indicates that the superiority of proposed KIN rather than KNN is obvious in all the situations. Also in very sparse situations where degree of sparsity is 98%, KIN outperforms RF and SVM most of the times and has comparable accuracy in the other sparse situations. We also introduced the probability of achieving the maximum accuracy (PAMA) for each of the classifiers, as the number of scenarios for which the classifier achieves the highest accuracy (among 4 classifiers) is divided by the total number of scenarios. Table 2 shows the values of PAMA for each classifier in different degrees of sparsity. We can infer that the probability of achieving the maximum accuracy in KIN increases when degree of sparsity increases to 100% as the highest amount of PAMA for KIN is 66.7%, where only 2% of features are signal. Note that PAMA values are very far from 100% indicating that no classifier is the best for all settings.

Simulation Results.
Another useful measure which can be taken into consideration with very near accuracy from the best classifier is the probability of achieving more than 95% of the maximum accuracy (P95). The P95 for each classifier is estimated as the number of scenarios in which it achieves 95% or more of the maximum accuracy (among 4 classifiers), divided by the total number of scenarios. Once again, we can see that the

Benchmark Data Sets.
In order to further assess the KIN classifier, we analyzed five data sets. The first two data sets were taken from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets.html). Prostate cancer data set was from SIS package (only the first 600 features) and colon cancer data set from HiDimDA package in R software [19,20]. We also used liver transplant data set as described in [21] to examine the accuracy of KIN in very unbalance class membership situations. In liver transplant data set, only 11% of patients were dead ( = 1) and the rest were alive ( = 0). In these data sets, instead of using specific training and testing set, we used random partitioning of the whole data and for each of them, we form 200 training and testing sets and average accuracy rate was computed over these 200 partitions.
The results of classification on benchmark data sets are summarized in Table 3. For data sets with small or moderate number of features such as liver transplant, connectionist bench, and ozone, there was ignorable difference between accuracy of KIN and that of KNN. The accuracy of KIN was higher than KNN in very high dimensional data sets (prostate and colon cancer). Although simulation results showed that accuracy of KIN is affected by data sets' degree of sparsity, in comparison to SVM and RF as the best classifiers, proposed KIN has comparable accuracy in high dimension and low sample size (in training data set) settings.

Discussion
Regarding the idea of Lu et al. that demonstrates how to enhance accuracy, a classifier should combine both linearity and locality information [5], we proposed a novel dissimilarity measure for nearest neighbors classifier. To avoid deleterious effects of curse of dimensionality on KNN method, all the proposed solutions up to now can be summarized into two main categories: (1) dimension reduction which is based on feature selection or feature extraction [22][23][24][25] or (2) introducing a new dissimilarity measure [3]. From this perspective, assigning KIN in both of the above categories can be justified. By handling curse of dimensionality, KIN is capable classifier to overcome distance concentration and does not allow creating hubs. Moreover, managing ties challenge in small sample size leads to stable results.
Proposed feature extraction techniques for dimension reduction in KNN such as principal component analysis [10], linear discriminant analysis [26], locality preserving projections [27], random projection [9,10], and nearest feature subspace [24] have two main defects: (1) feature extraction does not explain 100% of features information, thereby leading to waste of some valuable information and (2) since extracted features are combination of both signals and noises, the importance of each feature in classification may not be clearly achievable.
Our idea in present study is very close to Chan and Hall approach in 2009. They suggested truncated nearest neighbor which implements feature selection via a threshold before classification task [11]. Fan and Li called this threshold hard threshold and proposed a threshold in SCAD regression as SCAD threshold [15]. Hence, against truncated nearest neighbor, KIN use SCAD threshold that simultaneously satisfies unbiasedness and sparsity [15]. Another important difference between two aforementioned methods is that selected features in KIN do not have the same contribution in construction of dissimilarity measure which comprise an obvious advantage. Although MADD index as a novel dissimilarity measure for KNN classifier has good accuracy in high dimensional problems, compared to our hybrid dissimilarity measure, it does not take into consideration importance of features and is only based on distances [3]. Considering this shortcoming, we can infer that, as the degree of sparsity tends to one, MADD index becomes weaker but KIN becomes stronger in terms of accuracy.
Consequently, imposing features contribution as a function of SCAD coefficients on Euclidean distance (novelty of the present study) leads to four good properties: (1) It uses information of both variables and locations instead of usual dissimilarity measure in KNN which ignores information of features.
(2) It performs dimension reduction because only those variables that contribute in construction of dissimilarity measure have nonzero coefficients.
(3) It increases accuracy by eliminating noisy features from classification procedure and considering relative importance of the signal features.
(4) It does not choose necessarily the nearest neighbors. The nature of this hybrid measure leads to choosing important neighbors (KIN); that helps to find more complex patterns in the presence of a huge number of noisy features.

Conclusion.
In summary, KIN has a good performance in terms of both accuracy and dimension reduction. The proposed KIN also in very sparse settings outperforms support vector machine (SVM) and random forest (RF) as the best classifiers. The KIN approach was found to be capable of eliminating nearly all of the noninformative features because of utilizing oracle property of SCAD penalized regression in the construction of dissimilarity measure. What distinguishes KIN from KNN, SVM, and RF classifiers is that not only does the proposed KIN perform classification task, but it can also perform feature selection. In fact, KIN implements classification only with very small subgroup of features which can affect class assignment.

Disclosure
This article was adapted from the Ph.D. Dissertation of Hadi Raeisi Shahraki.