A Survey of k Nearest Neighbor Algorithms for Solving the Class Imbalanced Problem

k nearest neighbor (kNN) is a simple and widely used classifier; it can achieve comparable performance with more complex classifiers including decision tree and artificial neural network. Therefore, kNN has been listed as one of the top 10 algorithms in machine learning and data mining. On the other hand, in many classification problems, such as medical diagnosis and intrusion detection, the collected training sets are usually class imbalanced. In class imbalanced data, although positive examples are heavily outnumbered by negative ones, positive examples usually carry more meaningful information and are more important than negative examples. Similar to other classical classifiers, kNN is also proposed under the assumption that the training set has approximately balanced class distribution, leading to its unsatisfactory performance on imbalanced data. In addition, under a class imbalanced scenario, the global resampling strategies that are suitable to decision tree and artificial neural network often do not work well for kNN, which is a local information-oriented classifier. To solve this problem, researchers have conducted many works for kNN over the past decade. This paper presents a comprehensive survey of these works according to their different perspectives and analyzes and compares their characteristics. At last, several future directions are pointed out.


Introduction
k nearest neighbor (kNN) [1] has simple implementation and supreme performance and can achieve comparable performance with more sophisticated classifiers including decision tree [2], artificial neural network [3], and support vector machine [4]. Therefore, kNN has been listed as one of the top 10 algorithms in data mining and machine learning [5,6]. kNN has been utilized in many applications, such as pattern recognition [7], feature selection [8], and outlier detection [9]. For a test example with unknown class label, kNN makes a decision by employing the local information surrounding the test example. Concretely, kNN first simply stores all the training examples; then, in the classification phase, it takes the class occurring most frequently in the k (k ≥ 1) nearest training examples of the test example as the classification result. That is, kNN makes a decision according to the class distribution characteristics in the k neighborhood of a test example.
Nowadays, machine learning and data mining techniques are widely used in many aspects of the information society. However, for some applications such as medical diagnosis [10], system intrusion detection [11], and network fraud detection [12], the collected training example set is usually class imbalanced, i.e., there is a large difference among the sizes of different classes. . But if a special patient is erroneously classified as a normal patient, the best treatment time will be missed and serious consequences will be caused. Similarly, misclassifying illegal access as a legal one will lead to the disclosure of a unit's inner data or the steal of bank account information. From the above two instances, it can be seen that, in class imbalanced data, although the positive class is heavily outnumbered by the negative class, the positive class is usually the one in which we are more interested and is more important than the negative class. The positive class is also named the minority class while the negative class is also called the majority class.
Similar to classical classifiers such as decision tree, artificial neural network, and support vector machine, kNN is also proposed based on the assumption that a training set has approximately balanced class distribution, i.e., the classes have roughly the same number of training examples. In addition, these algorithms all employ the overall classification accuracy as the optimization objective in the classifier training phase, leading to their unsatisfactory performance on class imbalanced data. For kNN, it takes the majority class in the k neighborhood of a test example as the classification result; this majority voting-based classification rule further degrades its performance on a class imbalanced problem. This is because the positive examples are usually sparse in the k neighborhood of a test example [6] Experiments conducted in Reference [13] indicate that SMOTE oversampling integrated with Random Undersampling (RUS) [14] or SMOTE oversampling integrated with the cost-sensitive MataCost method [15] can both significantly improve the performance of C4.5 decision tree [2] on class imbalanced data. Unfortunately, these strategies do not work well for improving kNN in a class imbalanced scenario. The authors in [13] give the explanation from the following aspect: kNN makes a decision by investigating the local neighborhood of a test example, while the resampling and cost-sensitive strategies are global methods and are naturally inappropriate to kNN. Therefore, special methods for kNN need to be designed under the class imbalanced scenario.
As can be seen from the above illustration, improving k NN performance on imbalanced data is an important topic, which is of great significance to the expansion of its application fields and the enhancement of its practical utility. Over the past decade, researchers have conducted many works and proposed many methods. This paper tries to give a comprehensive survey of these works according to their perspectives and analyzes and compares their characteristics, which serves as a foundation for further study in this field.
The rest of this paper is organized as follows. The weighting strategy-based methods are illustrated in Section 2, the local geometrical structure-based methods are illustrated in Section 3, Section 4 introduces the fuzzy logic-based methods and followed by a category of methods based on missing positive example estimation in Section 5, Section 6 presents methods based on novel distance metrics while Section 7 presents dynamic-sized neighborhood-based methods, and conclusions and future work are presented in Section 8.

Methods Based on Weighting Strategy
This section introduces a category of methods that assigns weights to training examples in the neighborhood of a test example. In general, these methods can be divided into 5 subcategories as shown in the following.
2.1. Weighting Strategy Considering the Class Distribution around the Neighborhood. The authors in [12] claim that the reason for the unsatisfactory performance of kNN on the imbalanced data lies in the following: it only utilizes the local prior probabilities of each class in the neighborhood of a test example but does not employ the class distribution information around the neighborhood. In Figure 1, if the imbalanced class distribution around the test example's neighborhood is considered, i.e., the area surrounded by the dotted rectangle, then the test example can be correctly classified as the positive class because in this dotted area, the positive nearest neighbors of the test example are much more than the negative ones. Therefore, the classification performance of kNN can be improved if such local class distribution information is utilized.
Based on the above observation, a weighting-based method is proposed in [12] to assign a test exampledependent local weight to each class, i.e., the examples' weight in a class varies with the change of test examples rather than being a constant value. Concretely, for test example x t ∈ ℝ d , the weight w l t of examples in class C l (1 ≤ l ≤ L, L is the total number of classes in a classification problem) is calculated as follows. For the dk/Le number of nearest neighbors of x t in class C l , if they are erroneously classified by traditional kNN, it is likely that these dk/Le neighbors belong to the minority class (the positive class) in the neighborhood of x t ; in this case, the weight of class C l is enlarged. Therefore, 2 Wireless Communications and Mobile Computing the learned weights take into consideration the class distribution information around the neighborhood. For the binary classification problem in Figure 1, L = 2, when k equals 7, we have dk/Le = d7/2e = 4; the test example's 4 nearest neighbors in positive class C 1 are P1 to P4, while its 4 nearest neighbors in negative class C 2 are N1 to N4. It can be seen that P1, P2, and P4 are misclassified to be negative examples as the majority members of their 7 nearest neighbors are all negative examples; thus, the weight of these positive examples should be enlarged. Based on the enlarged weight of positive class, in the classification phase of 7NN for the test example, the 3 positive neighbors P1 to P3 have much larger weight than the 4 negative neighbors, i.e., P1 to P3 make more contribution to the classification result; thus, the correct classification is achieved.
However, the shortage of this weighting-based method is that approximately k (the exact number is dk/Le * L) times of extra running of kNN is required around the neighborhood of each test example; thus, the computation cost is enlarged.

Weighting Strategy Based on Examples' Informativeness.
The authors in [16] believe that some examples carry more information than other examples: if an example is close to the test example and far from examples of other classes, then it is considered to be more informative. Following this idea, it is easily seen from Figure 2 that the example with index 2 carries more information than the one with index 1. The reason is that the two examples have roughly the same distance to the test example (the "query point" in Figure 2), but the example with index 1 is nearer to the class boundary, i.e., it is closer to the other classes. Based on the above consideration, the authors propose two informative kNN algorithms: the local information-based version LI-kNN and the global information-based version GI-kNN.
2.2.1. The Idea of LI-kNN. LI-kNN first finds the k nearest neighbors x 1 t , x 2 t , ⋯, x k t of test example x t in the training set, then employs the designed metric to evaluate the informativeness of each training example in the k neighborhood, i.e., the evaluation scope is local, and selects the first I Experimental results indicate that GI-kNN and LI-kNN are not very sensitive to the change of parameter I and can achieve comparable performance with SVM. One drawback of GI-kNN is that the robustness of its adopted informativeness metric needs to be enhanced when there exist noisy examples in the training set.

Class
Confidence-Based Weighting Strategy. The class confidence-weighted (CCW) kNN method [17] is proposed to assign weights to training examples in the neighborhood. As shown in Figure 3, the real boundary between the negative class (denoted by blue triangles) and the positive class (denoted by red circles) is represented by the solid blue line. There are 4 negative examples and 1 positive example in the k (k equals 5 in this case) neighborhood of the test example (denoted by the solid green circle), and the nearest training example of the test example is a negative one. In this case, the classification result is certainly the negative class if the traditional majority voting-based classification rule is adopted. However, the test example actually belongs to the positive class and the negative examples in its neighborhood are also positive ones in reality. Thus, for each training example in the neighborhood, the probability that it belongs to its current class should be considered.
Based on this idea, for each training example ðx

Wireless Communications and Mobile Computing
test example x t , the confidence of its belonging to the current class y j t is calculated according to its attributes' values ðx On the other hand, the class confidence-weighed method has to calculate the class confidence for each training example in the neighborhood, which increases the computation cost to some extent.

Weighting Strategy Based on Nearest Neighbor Density.
A nearest neighbor density-based weighted class-wise kNN (WCkNN) algorithm is proposed in [18], and its basic idea is as follows.
First, the k nearest neighbor density of test example x t is determined in each class. This is implemented by constructing a k radius sphere S l,k ðx t Þ that takes test example x t as its center and contains at last k nearest examples from class C l (1 ≤ l ≤ L); then, the volume V l,k ðx t Þ of this k radius sphere is used to denote the density: 1/V l,k ðx t Þ. It is not hard to see that, for a test example, its k radius sphere in the positive class usually has much larger volume than the one in the negative class due to the sparse distribution of positive class examples. As the radius of k radius sphere S l,k ðx t Þ is determined by the distance d l,k ð x t Þ between test example x t and its kth nearest neighbor in class C l , thus d l,k ðx t Þ is often used to approximately denote the volume of sphere S l,k ðx t Þ.
Second, the posterior probability of test examples belonging to each class is calculated based on the above k nearest neighbor density, which is shown in In formula (1), the weight β l of class C l is obtained by employing a certain convex optimization technique to optimize a nonlinear metric on the training set, and from this formula, we have the following observations. (a) For a class balanced data, two classes C 1 and C 2 have equal weights For an imbalanced data, compared with negative class C n , positive class C p is more likely to be sparsely distributed around x t , i.e., d C p ,k ðx t Þ > d C n ,k ðx t Þ; fortunately, the effect of this imbalanced distribution can be overcome by assigning larger weight for the positive class, i.e., β p > β n .
At last, the class having the largest posterior probability is considered as the classification result: y = arg max l PðC l | x t Þ, l = 1, 2, ⋯, L:.
In terms of complexity, when classifying test example x t , WCkNN needs to run one time of kNN on each class to determine the k nearest neighbor density in this class. Thus, L (the total number of classes) times running of kNN are needed to classify a test example.

Weighting
Strategy Integrated with Self-Adaptive k. The methods introduced above all use a constant k value, i.e., for each test example, k is the sum of the number of its positive neighbors and the number of its negative ones. Thus, the number of neighbors is not considered separately for each class.
To further improve the performance of weighted kNN methods, the authors in [19] propose to integrate the selfadaptive k technique with the example weighting strategy. In terms of weight determination, the positive examples are assigned larger weights than the negative ones; in terms of neighborhood size, the positive class is given small neighborhood size k p while the negative class is given relative large neighborhood size k n , i.e., k n > k p . In this way, the test example's k p positive neighbors and k n negative neighbors constitute its neighborhood with size k p + k n .
Accordingly, the classification result is determined by two aspects: (a) the weighted sum of the test example's k p and (b) the weighted sum of the test example's k n negative neighbors: W n ðx t Þ = ∑ k n j=1 w j t . The class with a larger value is the corresponding decision result. To sum up, the self-adaptive k-based weighted kNN is simple and flexible. As to the formula used in the assignment of each class's neighborhood size, more efforts need to be made to ensure that it is theoretically sound.

Methods Based on Local Geometric
Structure of Data An algorithm named class conditional nearest neighbor distribution (CCNND) is presented in [20], which alleviates the class imbalanced problem by using the local geometric structure of data, and its basic idea is as follows.

Calculating the k Nearest Neighbor Distances in Each
Class. For each training example x m in class C l (1 ≤ l ≤ L), the distances to its k nearest neighbors (without considering itself) in class C l are calculated:

Making Decisions Based on the k Nearest Neighbor
Distances of Test Example in Each Class. First, for test example x t , the distances to its k nearest neighbors in class where distðx t , x l tq Þ is the distance between x t and its qth nearest neighbor x l t q (1 ≤ q ≤ k) in class C l . Second, for each class, the number of its training examples with larger k nearest neighbor distances than the test example is determined: x l tk ÞÞ. It can be seen that the more such examples of a class have, the closer its class center to the test example is, i.e., the more likely the test example belongs to this class. Thus, the classification result is denoted as y t = arg max l∈f1,2,⋯,Lg N l ðx t Þ.
The conducted experiments demonstrate that, compared with the classical resampling and cost-sensitive methods, CCNND can achieve comparable or even better performance. As shown in Figure 4, the decision boundary obtained using CCNND is closer to the real boundary than that obtained using SVM and nearest neighbor. In addition, another advantage of CCNND is that it still works when the imbalance degree in a training set changes with time, e.g., in the case of online streaming data [21]. Therefore, CCNND can be applied in streaming data such as the oil and natural gas industrial data.

Fuzzy Logic-Based Methods
In fuzzy logic-based [22] classification methods, the membership of belonging to each class is assigned to an example rather than a crisp class label, which can preserve abundant classification information and thus make a full classification. Based on this conclusion, a fuzzy weighted kNN algorithm is proposed in [22] by integrating the advantages of both fuzzy logic and weighted kNN, which is the first method introduced in Subsection 4.1, and the second method in Subsection 4.2 is a further improvement of fuzzy kNN itself.

Fuzzy Weighted kNN
Algorithm. The fuzzy weighted k NN in [22] improves the weighted kNN method by utilizing the advantage of fuzzy logic, and it has the following three steps.

Determining the Class Membership of Each Example.
The membership of example x ∈ ℝ d for class C l (1 ≤ l ≤ L) is calculated using formula (2), where n C l is the number of training examples belonging to class C l in the k neighborhood of example x and C ðxÞ is the true class label of x.

5
Wireless Communications and Mobile Computing this class. It is easy to see that the positive class is assigned a weight of 1 while the negative class is assigned a weight less than 1, and the more examples in the negative class the smaller its weight.  (4), where μ C l ,j ðx t Þ is the class membership of x t 's jth (j = 1, 2, ⋯, k) nearest neighbor for class C l and w j is the weight of the class to which this neighbor belongs.
At last, the decision is the class having the largest membership: yðx t Þ = arg max fl=1,2,⋯,Lg μ C l ðx t Þ.
4.2. Self-Adaptive k-Based Fuzzy kNN. Although the weighted fuzzy kNN introduced in Subsection 4.1 can achieve good performance, the fuzzy kNN algorithm itself can not accurately compute examples' class membership under the class imbalanced scenario. To solve this problem, an improved fuzzy kNN algorithm based on self-adaptive k strategy is proposed in [23] and it contains the following steps.

Determining the Neighborhood Size k for Each Class.
The basic idea is to use relatively large neighborhood for the negative class and small neighborhood for the positive class. Concretely, the neighborhood size of each class is determined using formula (5). Where NðC l Þ is the number of training examples in class C l , λ is a constant (e.g., take the value 1) with the purpose of preventing the value of k C l from being too small.

Calculating the Class Membership of Training Examples
According to the Obtained Neighborhood Size. Formula (6) adopted here is different from formula (2): the corresponding class's k value k C l is utilized when calculating the class membership of example x, and CðxÞ is the true class label of x.

Determining the Class Membership of the Test Example.
The class membership of test example x t for class C l (l = 1, 2, ⋯, L) is calculated using formula (7), where μ C l ð x l t,j Þ is the class membership that example x l t,j belongs to class C l , and x l t,j is the jth ð1 ≤ j ≤ k C l Þ neighbor of test example x t in class C l . It can be seen from formula (7) that, in fact, test example x t 's class membership for class C l is the distance weighted sum of the corresponding class memberships of x t 's k C l nearest neighbors.
where l takes values from f1, 2, ⋯, Lg and p is an integer and is larger than 1.
At last, the class having the largest membership is the classification decision:yðx t Þ = arg max fl=1,2,⋯,Lg μ C l ðx t Þ.
By adopting different neighborhood sizes for different classes, this self-adaptive k-based fuzzy kNN can effectively alleviate the adverse influence of negative examples in the neighborhood of a positive example, making the obtained class membership more objective and thus improving the classification performance of fuzzy kNN on imbalanced data.

Methods Based on Missing Positive Data Estimation
The class imbalanced problem is regarded as a missing positive data estimation problem in [24]. From this perspective, a method called Fuzzy-based Information Decomposition (FID) is proposed and its main idea is as follows. According to the values of the current training set on attr s (i.e., attr s 1 , attr s 2 , ⋯, attr s N ) and the number t of positive examples to be generated, t intervals are obtained: q 1 = ½a, a + h, 6 Wireless Communications and Mobile Computing where attr s i is the value of the ith (i = 1, 2, ⋯, N) training example on attribute attr s , N is the total number of current training examples, a and b are, respectively, the minimum and maximum values among ðattr s 1 , attr s 2 , ⋯, attr s N Þ, and h is the step length: h = ðb − aÞ/t.
(2) Generating a synthetic attribute value for each interval For the mth ð 1 ≤ m ≤ tÞ interval q m , a synthetic value attr s q m of attribute attr s is generated in the following way. The fuzzy membership μðx i , q m Þ of training example x i ′s ði = 1, 2, ⋯, NÞ attribute value attr s i with respect to interval q m is calculated, which is used as the weight w m i of example x i in estimating the mth missing value of attribute attr s . For instance, if the attribute value attr s j of example x j ðj = 1, 2, ⋯ , NÞ is a "neighbor" to the center ðða + ðm − 1Þ * hÞ + ða + m * hÞÞ/2 of interval q m , i.e., their distance is less than the step length h, then the corresponding fuzzy membership μð x j , q m Þ is calculated and served as the weight w m j = μðx j , q m Þ of example x j ; otherwise, the weight of example x j is set to 0: w m j = 0. Therefore, the mth estimated value for attribute attr s can be represented in formula (8).
The weights satisfy that ∑ N j=1 w m j = 1. That is to say, only when an example's attribute value is close to the center of interval q m (m = 1, 2, ⋯, t) can it effectively influences the calculation of the mth synthetic attribute value attr s qm . Thus, the mth estimated value for attribute attr s is the weighted sum of these effective training examples on this attribute.
The advantage of FID is that it can deal with data with arbitrary dimension as it separately generates the missing values for each attribute. Traditional methods like Random OverSampling (ROS) [14] and Clustering Based OverSampling (CBOS) [25] have the tendency of overfitting due to the replication of existing positive examples; for methods like SMOTE [26] and Majority Weighted Minority (MWM) oversampling [27], an approximate positive example needs to be selected before generating a synthetic positive example using linear interpolation. However, these traditional methods have poor performance when the positive examples in the original training set are not enough. Fortunately, FID can overcome this problem as it generates the synthetic values separately for each attribute. As to the disadvantage of FID, when calculating a synthetic value for an attribute, the memberships of all training examples' values on this attribute with respect to the current interval need to be computed, leading to high computation and time complexity in the case of large training set size.

Novel Distance Metric-Based Methods
Euclidean distance is usually adopted as the metric to evaluate the similarity between two examples. However, this metric does not treat the positive and negative examples separately in the calculation of distance. To make up this shortcoming, the following works, respectively, present a novel distance metric that is sensitive to positive examples. For instance, Figure 5 displays the Gaussian balls of three positive examples numbered 1 to 3, which are denoted using dashed circles. It is easy to see that there is no negative example (denoted by symbol "-") in each Gaussian ball; thus, the three positive examples are all exemplar ones.

Defining the Distance Between the Test and Training
Examples. When classifying test example x t , its distance to each training example x i (1 ≤ i ≤ N) needs to be computed.

Wireless Communications and Mobile Computing
In k-ENN, if x t is an exemplar positive example, then the distance is defined as If x i is not an exemplar positive example, i.e., an ordinary positive example or a negative one, then the distance is still the Euclidean distance: The distance in formula (9) subtracts the radius of the Gaussian ball of exemplar positive example x i ; in reality, it is the distance between test example x t and the boundary of x i 's Gaussian ball. In this way, the distance from exemplar positive examples to the test example is reduced such that these exemplar positives are given more attention in the classification phase, consequently improving the classification performance for positive examples.

Distance
Based on Examples' Weights. The example weighting-based distance is proposed in [29]; it considers the relative importance of each training example x i (1 ≤ i ≤ N) when classifying test example x t , which is implemented by utilizing training examples' weights rather than simply computing the Euclidean distance.
It can be seen from formula (11) Figure 6(a), the test example is denoted using symbol " * " and the positive and negative examples are denoted using symbols "+" and "-," respectively. When k = 5, the neighborhood having dk/2e = d5/2e = 3 positive training examples contains r = 8 examples in total, in this case r > k, and an "extended neighborhood" S p k/2 ðx t Þ is obtained for test example x t .
(2) Making decision based on whether "k/2-PNN" is a positive subconcept Figure 6: Schematic diagram of neighborhood in PNN.

Wireless Communications and Mobile Computing
If the ratio of positive examples in S p k/2 ðx t Þ is much higher than that in the overall training set D, then S p k/2 ðx t Þ is considered to be a positive class subconcept and dk/2e/k > 1/2 is set as the posterior probability of the test example for the positive class, i.e., x t is classified as a positive class example. Otherwise, if S p k/2 ðx t Þ is not a positive subconcept, then the ratio (i.e., dk/2e/r) of positive examples in S p k/2 ðx t Þ is regarded as the posterior probability of x t for the positive class; in this case, the probability is usually less than 0.5, i.e., x t is classified as a negative example.
For instance, if the neighborhood in Figure 6(a) is a positive class subconcept, then the probability of the test example with respect to the positive class is Pð+|x t Þ = dk/2e/ k = 3/5 > 1/2; thus, the classification result of PNN is the positive class; otherwise, the probability is Pð+|x t Þ = dk/2e/r = 3 /8 < 1/2; thus, the test example is classified as a negative example. For the case that there are less than k examples in S p k/2 ðx t Þ as shown in Figures 6(b) and 6(c), which rarely occurs under the class imbalanced scenario, it indicates that the positive examples are densely distributed around the test example and the corresponding probability is Pð+|x t Þ = dk/ 2e/r = 3/4 > 1/2, i.e., the decision of PNN is the positive class. For the case that S p k/2 ðx t Þ has the same size with k neighborhood S k ðx t Þ as shown in Figure 6(d), PNN degrades to kNN.
Experiments in [31] indicate that the simple and effective decision bias of PNN can better classify positive examples, and PNN usually outperforms k-ENN [28] and achieves comparable performance with CCW-kNN [17] mentioned in previous sections. In terms of efficiency, PNN has a much lower computation cost than the two methods that require a "training phase": (1) k-ENN needs to determine all the exemplar positive examples in the training phase to expand its decision boundary while (2) CCW-kNN computes the weight of each training example by using the mixture model and the Bayesian network. Therefore, both the two methods have a high computation cost. In addition, PNN also outperforms oversampling techniques like SMOTE as well as costsensitive strategies like MetaCost. 7.1.2. k Rare-Class Nearest Neighbor Algorithm. k Rare-class Nearest Neighbor (kRNN) is proposed in [32], which has a similar idea with PNN introduced in the previous subsection. kRNN also constructs a test example-dependent dynamicsized neighborhood and then adjusts the posterior probability of the test example according to the positive examples' distribution in the extended neighborhood. The differences between kRNN and PNN mainly lie in the following two aspects.
(1) For test example x t , its neighborhood constructed by kRNN contains at least k ′ positive examples, where k ′ is set to be a constant and takes values 1 or 3 in most cases (2) When calculating the test example's posterior probability of belonging to the positive class, the local and global confidence intervals of the positive class are both utilized, making the obtained probability Pð+| x t Þ more accurate than the one (i.e., dk/2e/k) obtained by PNN It is experimentally demonstrated that kRNN significantly improves the classification performance of kNN for positive class and often outperforms the resampling and cost-sensitive strategies employing base classifiers like decision tree and support vector machine.
In addition, the global information is not fully utilized in these methods.
To overcome these drawbacks, a gravitational fixed radius nearest neighbor (GFRNN) algorithm is proposed in [34], which is inspired by the concept of gravitation in classical dynamics. GFRNN is formed by introducing the "gravitation between example pair" into the fixed radius nearest neighbor method. Concretely, GFRNN operates as follows.
(1) The distance between each example pair is first calculated, and then, their average value is adopted as the neighborhood radius R of a test example, which is shown in In formula (12), D denotes the training set. Thus, the neighborhood of test example x t can be described in formula (13), which is constituted by training examples having a distance no more than R with x t .
(2) Computing the gravitation between each training example in the neighborhood and the test example, which can be achieved using To simplify the computation, both the gravitational constant G and the mass m x t of test example x t are set to 1, where mass m x t is essentially the weight of test example x t . Thus, only the mass m x i of training example x i in the neighborhood needs to be determined. To balance the effects of the positive where n pos ðx t Þ and n neg ðx t Þ in formula (16) are, respectively, the numbers of positive and negative examples in the neighborhood Sðx t Þ of test example x t . Formula (16) indicates that GFRNN makes a decision in this way: if the gravitation sum of positive examples in Sðx t Þ is larger than that of the negative examples, then x t is classified to be a positive example; otherwise, x t is classified as a negative example.
It can be learned from the above illustration that, in determining the neighborhood for a test example, GFRNN does not require any parameter and only utilizes this global information: the average distance among training example pairs. In addition, another global information, i.e., the class imbalance ratio IR in the training set, is used as the weight of positive neighbors. To sum up, GFRNN has the following advantage: it can effectively address the class imbalanced problem and does not require the initialization or adjustment of any parameters, which further extends the family of kNN classification algorithms. On the other hand, GFRNN only employs the overall class imbalance ratio IR in the training set to set weights for positive neighbors but does not utilize any local information concerning training examples, which can be seen as its disadvantage. To solve this problem, the following two works, respectively, present a solution.

Two Improvement Algorithms for GFRNN
(1) The First Improvement Algorithm. An entropy and gravitation-based dynamic radius nearest neighbor (EGDRNN) is proposed in [35]; its differences with GFRNN mainly lie in the following two aspects.
(a) Neighborhood radius determination EGDRNN determines the radius of test example x t 's neighborhood by first computing its average distance avgdist D p ðx t Þ to the positive examples D pos in training set D and its average distance avgdist D n ðx t Þ to the negative examples D neg in training set D, respectively, and then taking the sum of these two values as the neighborhood radius, which is shown in formula (17).
Therefore, the radius determined by EGDRNN for a test example depends on the location of test example with respect to the positive and negative classes and varies with different test examples.

(b) Weighting strategy for examples in neighborhood
In addition to IR, EGDRNN also introduces the information entropy concept to make examples in different locations have different degrees of importance. Concretely, for training example x i (i = 1, 2, ⋯, | Sðx t Þ | ) in neighborhood, EGDRNN computes its information entropy Eðx i Þ using formula (18). C 1 and C 2 denote the positive and negative classes, respectively; pðx i , C 1 Þ denotes the probability of example x i belonging to the positive class while pðx i , C 2 Þ denotes the probability of example x i for the negative class, where probability pðx i , C 1 Þ is calculated using the proportion of positive examples in the k neighborhood of example x i while pðx i , C 2 Þ is calculated using the corresponding proportion of negative examples. It can be seen that the smaller the information entropy of x i the higher the certainty degree of its belonging to a certain class; otherwise, the larger the information entropy, the lower the certainty degree, i.e., it is closer to the decision boundary.
To sum up, for a test example, the gravitation sum of examples in its neighborhood is calculated as follows: Formula (19) demonstrates that EGDRNN pays more attention to the positive examples in the neighborhood as well as the examples that are close to the class boundary. Experimental results indicate that EGDRNN not only achieves a high classification accuracy but also has the lowest time cost among the comparison algorithms.
(2) The Second Improvement Algorithm. The improvements made in [33] lie in the following aspects. When determining the weight for neighboring training example x i (i = 1, 2, ⋯, | Sðx t Þ | ), in addition to IR, the gravitation from other examples x j (x j ∈ D, x j ≠ x i ) to the current example x i is also considered to calculate its weight. The authors in [30] believe that, for a training example, the larger the sum of gravitation from other examples, the denser its surrounding examples (local information), and they regard such training example as unimportant and assign relatively low mass (weight) for it. In this way, both the global example information (i.e., IR) and this local information are utilized in determining the weights for neighbors.

Conclusion
kNN is a simple and effective base learning algorithm and can achieve comparable classification performance with more complex classifiers such as decision tree and artificial neural networks. However, kNN does not work well on imbalanced data due to the usage of overall classification accuracy as its optimization objective as well as its majority voting-based classification rule. To solve this problem, researchers have conducted many works and proposed a lot of solutions. This paper gives a comprehensive survey of these works according to their adopted perspectives and analyzes and compares their characteristics. What is more, there are still some problems that deserve further study in this field. For instance, we list three of them in the following: (1) Most algorithms introduced in this paper mainly consider the case that there is only one positive class in imbalanced data; thus, in the case of two or more positive classes, how to adjust these algorithms to make them work is an important problem (2) For the global information-based algorithm GI-kNN, how to improve the robustness of its adopted information metric to noisy training examples needs to be investigated (3) For the online streaming data in which the class imbalance degree can change with time, only the class conditional nearest neighbor distribution algorithm CCNND introduced in Section 2 is applicable to this scenario. Thus, more efforts need to be made to make other methods introduced in this paper also suitable to online streaming data

Conflicts of Interest
The authors declare that they have no conflicts of interest.