A Neighborhood Model with Both Distance and Quantity Constraints for Multilabel Data

In this paper, a novel distance-based multilabel classification algorithm is proposed. The proposed algorithm combines k-nearest neighbors (kNN) with neighborhood classifier (NC) to impose double constraints on the quantity and distance of the neighbors. In short, the radius constraint is introduced in the kNN model to improve the classification accuracy, and the quantity constraint k is added in the NC model to speed up computing. From the neighbors with the double constraints, the probabilities for each label are estimated by the Bayesian rule, and the classification judgment is made according to the probabilities. Experimental results show that the proposed algorithm has slight advantages over similar algorithms in calculation speed and classification accuracy.


Introduction
e multilabel classification problem stems from the text and image classification [1]. In practice, a text often has multiple keywords, and an image often has multiple scenes [2]. e authors of [3] proposed a multi-instance multilabel learning (MIML) approach for large datasets via subspace technique and stochastic gradient descent. To handle a data stream, an approach called multilabel learning with emerging new labels (MuENL) was proposed [4]. Combining the mixed dependency graph and the class cooccurrence, the authors of [5] constructed an optimization problem to deal with the label relevance for the multilabel learning with missing labels (MLML) and applies the ADMM algorithm to solve it. In reference [6], Zhang et al. reviewed the problem of binary correlation in multilabel learning. e multilabel classification can be converted into a single-label problem by treating the labels as the vector value. However, the computational complexity is intolerable. In order to improve the computational speed, an improved algorithm called random k labelsets (RAkEL) was proposed [7]. Zhang and Zhou [8] established a multilabel version of kNN by Bayesian rule, and named it multilabel k-nearest neighbors (ML-kNN). In their experiments, ML-kNN was compared with the multilabel classification algorithms BOOSTEXTER [9], ADTBOOST.MH [10], and RANK-SVM [11]. ML-kNN treats each label as an independent binary classification and ignores the correlation between the labels. In order to make use of the correlation between the labels, the authors of [12] regarded the neighbor's labels as the features of instances and gave an extension of ML-kNN. For multilabel data streams, self-adjusting memory (SAM) [13] was adopted to deal with data drifting, and a multi-label k-nearest neighbor (ML-SAM-kNN) was established [14].
As well known, the kNN algorithm makes prediction by investigating the k-nearest neighbors of the unknown instances. Each instance is assigned a uniform parameter k.
is is based on the assumption that the data is evenly distributed. In reality, most data are not uniformly distributed. e certainty factor measure [15] was designed for kNN classification to deal with to the skewed class distribution. Meanwhile, the shell neighbor imputation [16] fills the incomplete data by left and right nearest neighbors. For noisy data, the pseudo-nearest neighbors was identified by mutual k-nearest neighbor [17].
For a long time, the choice of parameter k was empirical. For example, the authors of [18] have tried to take it as the square root of the sample size, which does not address the problem of the uneven data. In view of this, the method of selecting different parameter k for different samples was proposed [19]. Different k values were learned by correlation matrices and assigned to different test data points [20]. A kTree method [21] to learn different optimal k values for different test samples was proposed with a sparse reconstruction model. e cost-sensitive kNN classifiers were designed and further improved with minimum-cost k-value selection, feature selection, and ensemble selection [22]. However, it will greatly increase the complexity of the kNN model and reduce the robustness of the algorithm. In extreme cases, if the different parameter is set for each sample, the computational complexity is unacceptable, especially for large data. In the [23], large data were first separated into several parts by k-means clustering, each of which has then conducted the kNN classification. e kNN is a distance-based classifier, which estimates the labels of unknown instances according to the labels of the nearest neighbors. erefore, the distance between the nearest neighbor and the unknown instance has a great impact on the accuracy of the estimation. In fact, with the increase of the distance from the unknown instance, the reference value of the nearest neighbors decreases gradually. In [24], a distance-weighted k-nearest neighbor algorithm (DW-kNNA) was introduced to solve a permanent magnet synchronous linear motor (PMSLM) model. Alfeilat, Hassanat, Lasassmeh, et al. showed that the performance of kNN classifier depends significantly on the distance used [25]. e l 21 -norm based distance measurement was adopted in the loss function to improve the model robustness [26]. Meanwhile, the mathematical framework based on differential evolution with compressed sensing can learn the sparse module dictionaries and levels from the low-dimensional random composite measurements for reconstructing the high-dimensional data [27]. In 2020, a localityconstrained graph was introduced in the nonnegative matrix factorization algorithm to discover the geometric structure of the data [28]. A novel machine learning method based on the modified kNN algorithm was proposed. More features can be extracted from the datasets, and the datasets were updated during the training process instead of constructing the dataset beforehand [29], where a computational framework based on compressed sensing can be adopted to reduce dimensionality [30].
However, when the distance between the training instance and the unknown instance exceeds a certain value, the information of the nearest neighbors not only has no reference value for the judgment of the unknown instance, but also sometimes leads to misleading. Figure 1 shows a binary classification problem with nonuniformly distributed data. Assuming that the problem is linearly separable, the straight lines in the figure represents the classification hyperplane. e question mark "?" is an unknown instance, as shown in Figure 1(a), whose 3NN contains two positive instances and one negative instance. However, the correct judgment is that the unknown instance belongs to the negative class. e distance range investigated by the algorithm 3NN is too large in the sparse area. When the training instances are far away from the unknown instances, it means that their features are quite different from those of the unknown instances. At this point, it is easy to get wrong estimates if these training instances are considered as the classification basis. In view of the above analysis, we can assume that the referential meaning of the training instance is lost when its distance from the unknown instance exceeds a certain value. In other words, we just take the training instance within a certain range centered the unknown instance as classification basis.
Based on the above assumptions, we append a new parameter in the kNN model, which is the distance centered on an unknown instance. Only the k-nearest neighbors whose distance from the unknown instance is less than the given value are utilized for prediction. e neighbors for prediction are constrained by both quantity and distance. e distant constraint aims to pick up a certain number of neighbors among the k-nearest neighbors. As shown in Figure 1(b), the label of the unknown instance can be correctly judged by the nearest neighbor with the distance constraint. At this point, the two positive instances in 3 nearest neighbors will no longer be used as reference.
In the neighborhood classifier, the neighborhood contains a large number of instances for the data with high distribution density, especially numerical data. e computation complexity will be greatly increased by counting the label information of the neighbors in the neighborhood. In order to improve the computational efficiency, we only consider the k-nearest neighbors in the neighborhood as the reference for classification.
Different from simple label classification and the classification based on density estimation, multilabel classification predicts a set of labels associated with each sample, where the number and category of labels are both random. erefore, the complexity of multilabel classification is exponentially higher than that of unilabel classification and the classification based on density estimation, which is also one of the main problems faced by multilabel classification. In order to alleviate the exponential growth, multilabel classification tasks are decomposed into independent singlelabel classification tasks in this paper. It not only reduces the complexity of the problem, but also facilitates parallel computing.
In Section 2, we present a mathematical description of the multilabel classification and introduce some of the symbols and concepts to be used. A novel multilabel classification algorithm based on ML-kNN and the neighborhood classifier will be presented in Section 3. In order to verify the effectiveness of the algorithm, we selected some common multilabel data and carried out some comparative experiments. e results will be reported and analyzed in Section 4.

Preliminaries
Multilabel training data are composed of features and labels. e features are measurable properties of an instance. e labels represent the classes to which the instance belongs. e purpose of classification learning is to train a classifier from the feature and label data. e classifier can predict the class labels from the measurable features.

Computational Intelligence and Neuroscience
Let X be a nonempty finite set of instances. An instance can be denoted as a vector x � (x 1 , x 2 , . . . , x n ) ∈ X, which implies that its feature values are x 1 , x 2 , . . . , x n , respectively. e numerical and nominal feature values are expressed as real numbers and natural numbers, respectively. e distance between the instances is defined as (1) For numeric features, let be nominal features. For each instance x ∈ X, let N k (x) denote the set of kNNs of x, and the set N δ (x) � y ∈ X: 0 < d(x, y) ≤ δ is said to be the δ-neighborhood of the instance x. e elements in the intersection are called δ-kNNs of the instance x.
Let L be the collection of labels. e multilabel classification is to learn a function h: X ⟶ 2 L from the training data where the label subset L x ⊂ L is the set of labels associated with the instance x. Each label l ∈ L defines a random variable l: to count the number of the instances with the label l ∈ L in the neighborhood N k (x). In the same way, a random variable C δ : X ⟶ Z + is defined as to count the number of the instances with the label l ∈ L in the neighborhood N δ (x), where Z + is the positive integer set.
Multilabel k-nearest neighbor (ML-kNN) makes prediction with the label information embodied in the k neighbors by the maximum a posteriori (MAP) rule. e pseudo-code of ML-kNN is shown in Table 1. Here, we only listed the basic framework to illustrate its idea, see [1,8] for details. e innovation of this paper lies in the establishment of a new algorithm combining the ideas of ML-kNN and neighborhood classifier (NC) [31]. e pseudo-code of NC is given in Table 2 to illustrate the basic idea of the proposed algorithm.
From Table 1 and Table 2, it can be seen that the difference between the two algorithms lies in the definitions of the neighborhoods. e algorithm ML-kNN limits the number of the neighbors, while NC restricts the distance of neighbors. e proposed algorithm in this paper will double constrain the nearest neighbor from both quantity and distance.

Bounded ML-kNN
In order to reduce the computational burden, a new radius parameter is introduced into the kNN model to reduce the number of neighbors, and a novel mult-label classification algorithm is established in this section.
We decompose the multilabel data D into a series of binary classification data.
which are single-label data with the same feature data as the multilabel one D. en, the classifier deals with each binary classification independently. e neighborhood N k δ (x) of each instance x ∈ X is calculated according to the distance defined in Section 2.
e random variables C: to count the number of the instances with the label l ∈ L in the neighborhood N k δ (x). e label l of each instance x ∈ X is estimated by maximizing the posteriori probability p(l | C), i.e., Computational Intelligence and Neuroscience where the symbol C(x) represents the random event C � C(x) that there are C(x) instances with the label l in the neighborhood N k δ (x) of the instance x ∈ X. e number C(x) could be any positive integer between 0 and k. Bayes' theorem implies Since p(C) is constant whether l � 1 or l � 0, the optimization problem can be reduced to In the training phase, we learn the probability p(l) from the training data D l . e symbol |X| represents the cardinality of the set X, i.e., the total number of instances in the set X. e ratio is taken as an estimate of the probability p(l � 1). For the convenience of computer calculation, we introduce an auxiliary parameter s, so that the probabilities can be approximated as e second probability to be learned is the conditional probability p(C | l). Consider the following instance sets which are the sets of the training instances x with the label l (when l(x) � 1) or without the label l (when l(x) � 0), while each x's neighborhood N k δ (x) contains exactly j neighbors with the label l.
erefore, the conditional probabilities p(C � j | l � 1) can be approximated by the ratio where the cardinality of the set D 1 j is denoted by |D 1 j |, which is equal to the number of the instances in the set D 1 j . e auxiliary parameter s is introduced to establish the estimation Similarly, we can obtain the estimation For each unknown instance t, the label is determined by the following principle: e pseudo-code for bounded ML-kNN (BML-kNN) is presented in Table 3.
In the pseudo-code, the multilabel classification is decomposed into the independent single-label problem for each label l ∈ L. From Step 2 to Step 16, the probabilities p(l) and p(C | l) are learned from the single-label training data. e counter c(j) records the number of the training instances with the label l, and whose neighborhood contains j instances with the label l for each j � 0, 1, 2·, k. Simultaneously, the counter c(j) is the total number of training instances without the label l, and whose neighborhood contains j instances with the label l. From Step 17 to Step 20, the optimization problem can be solved by applying the probabilities (13), (14), (17), and (18). e labels of the unknown instances are given one-by-one.

Experiment Results
In this section, some experimental results are reported to compare our proposed methods with ML-kNN. e experiments were conducted on a computer with Intel(R) Step 1 For x ∈ X Step 2 Identify k neighbors N k (x) Step 3 End for Step 4 For l ∈ L Step 5 Learn the probabilities p(C k (x) � j) for j � 1, 2, . . . , k.
Step 8 End for Step 9 Identify k neighbors N k (t) Step 5 For l ∈ L Step 6 Calculate the statistics C k (t) � x′∈N k (t) l(x′).
Step 8 End for Step 9 Identify k neighbors N δ (t) Step 5 For l ∈ L Step 6 Calculate the statistics C δ (t) �  Table 4. We divided each data set into two parts, eight tenths of which were used as training sets and two as test sets. First, we examine the sensitivity of the classification effect with respect to the radius of the neighborhood. We run the algorithm at different radii and calculate the corresponding classification metrics. e curves of the classification measures varying with radius are drawn in the figures. e experimental results show that the variation on each dataset is roughly the same. Here, we only list the results for the datasets Scene and Yeast, as shown in Figures 2 and 3.
It can be seen from Figures 2 and 3 that, with the increase of radius, the five classification evaluation measures get better uniformly. At first, the classification effect does not increase significantly and increases sharply when the radius reaches a certain value. However, when the radius increases to a certain extent, the values of the five classification evaluation measures are almost unchanged. On the other hand, corresponding to different parameter k, the trend of the curve is basically similar.
From the above analysis, we can conclude that there is an optimal radius for each dataset, and different data correspond to different optimal radius, which does not change with the change of parameter k.
In the second experiment, with a given radius, we investigated the sensitivity of the classification accuracy to the parameter k. We also draw the curve as shown in Figure 4.
We can see that the accuracy curve varies greatly at different radii. When the radius is too small, the parameter k does not play any role in the algorithm. For example, the blue curve in Figure 4 when the radius is 0.6. is indicates that the influence of radius on accuracy is more significant than that of parameter k in this algorithm. On the other hand, we can roughly conclude from the figures that the accuracy increases with the increase of k in the first stage, but decreases when k reaches a certain value. Similar to the radius, there is also an optimal parameter k for each dataset. e last pictures in Figures 2, 3, 4, and 5 show the variation of running time with parameters k and radius. e running time tends to increase slightly as the two parameters increase, but not significantly.
is means that the two parameters have little effect on the speed of the algorithm. Figure 5 shows the joint effects of two parameters on classification accuracy. When the parameter k is below 10, its influence on classification accuracy is obvious. When the parameter k is above 10, the classification accuracy is mainly determined by the radius. We can see the parallel mountains in the graph, their height depends on the radius. is indicates that, relative to parameter k, the radius is the determinant of the classification effect. From the point of experiment, it is reasonable to introduce the radius as a parameter in this paper.
In the second part, we compare the operation speeds of five algorithms neighborhood classifier (NC), ML-kNN, BOOSTEXTER, RANK-SVM, and bounded ML-kNN (BML-kNN). Under the same experimental conditions, the algorithms are applied to some generic datasets, and the average running time (s) are shown in Table 5. e underlining indicates the best of the three algorithms.
e experimental results show that the algorithm NC is not advantageous in running time compared with the other algorithms, especially for some datasets with high density. e algorithms ML-kNN and BML-kNN are faster than the other algorithms. However, BML-kNN has a slight numerical advantage, and it performed well on six of the ten datasets, especially on the larger datasets Mediamill and Nus-wide.
In the last part of the experiment, we try to compare the classification accuracy of the three algorithms. At present, the evaluation criteria of multilabel classification in related literature are various. In this paper, we chose five common metrics Hamming loss, One-error, Coverage, Ranking loss, Input: X, L, k, radius δ > 0, and unknown instance t Output: l(t), l ∈ L Step 1 For l ∈ L Step 2 Compute p(l � 1) and p(l � 0) as (13) and (14) Step 3 c � zeros(k + 1), c � zeros(k + 1) Step 4 For x ∈ X Step 5 Compute C(x) as (8) Step 6 If End if Step 11 End for Step 12 For j � 0: k Step 13 Compute p(C � j | l � 1) and p(C � j | l � 0) as (17) and (18) Step 15 End for Step 16 End for Step 17 For l ∈ L Step 18 Compute C(t) as (8) Step 19 Compute l(t) as (19) Step 20 End for  Nominal  1702  1001  53  Text  Medical  Nominal  978  1449  45  Text  Core15k  Nominal  5000  499  374  Images  Genbase  Nominal  662  1185  27  Biology  Yeast  Numerical  2417  103  14  Biology  Emotion  Numerical  593  72  6  Music  CAL500  Numerical  500  62  174  Music  Scene  Numerical  2407  294  6  Images  Mediamill Numerical  43907  120  101  Video  Nus-wide Numerical 269648  129  81  Images   Computational Intelligence and Neuroscience  5 and Average precision. For the definitions and calculation methods, please refer to Reference [1]. Among them, the larger the value of metric Average precision is, the better the classification effect is, and the smaller the value of other measures, the better the classification accuracy. In this paper, we apply the five algorithms to the same datasets and compare the five classification metrics to measure the classification effect of the algorithms.      Computational Intelligence and Neuroscience Tables6-10 show theclassification accuracy under differentmeasures.
e numerical values shown in the tables are the average of ten parallel experiments, where the underlines highlight the best over the other algorithms.
As shown in Table 6, from the Hamming loss corresponding to the algorithms, the proposed algorithm BML-kNN has a slight advantage over the other algorithms. However, it is not significant, the numerical difference is the third significant digit in general.    Computational Intelligence and Neuroscience Table 7 reports the experimental results of the classification evaluation One-error. Except the datasets Genbase, Medical, Core15k, and CAL500, BML-kNN works slightly better than the other algorithm. From the experimental results, we can also find that BML-kNN has a prominent performance in all numerical data. e metric Coverage evaluates the average cost to cover all the true labels. e test results on Coverage are listed in Table 8. e experimental results show that NC and BML-kNN outperform the others. However, the algorithms do not differ much in the numerical value of Coverage. Meanwhile, BML-kNN does not have obvious advantages in the symbol dataset Enron, Medical, Core15k, and Genbase.
Different from other evaluation metrics, the higher the average precision, the better the classification effect.
According to the evaluation metrics Average precision and Ranking loss as shown in Tables 9 and 10, BML-kNN is slightly better than NC, and NC is better than the others.

Conclusions
e proposed algorithm BML-kNN is based on the framework of ML-kNN. ML-kNN only restricts the number of the nearest neighbors, while NC only limits the distance of the nearest neighbors. BML-kNN considers both the two factors at the same time and gives the estimation of test instances based on up to k training instances in the neighborhood. Experimental results illustrate that the classification accuracies of BML-kNN and NC are slightly higher than that of ML-kNN. e calculation speeds of      BML-kNN and ML-kNN are basically equal, while the algorithm NC has higher computational complexity. e algorithms involved in this paper divide the multilabel classification problem into a series of single-label problems. Specifically, each label is extracted and combined with the feature data to form a single-label binary classification problem. In this way, the single-label problems are independent of each other, and the correlation between labels is not considered. How to learn the label correlation and apply the correlation into the algorithm may be a meaningful topic for future work.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.