Semisupervised Particle Swarm Optimization for Classification

A semisupervised classification method based on particle swarm optimization (PSO) is proposed. The semisupervised PSO simultaneously uses limited labeled samples and large amounts of unlabeled samples to find a collection of prototypes (or centroids) that are considered to precisely represent the patterns of the whole data, and then, in principle of the “nearest neighborhood,” the unlabeled data can be classified with the obtained prototypes. In order to validate the performance of the proposed method, we compare the classification accuracy of PSO classifier, k-nearest neighbor algorithm, and support vectormachine on sixUCI datasets, four typical artificial datasets, and the USPS handwritten dataset. Experimental results demonstrate that the proposed method has good performance even with very limited labeled samples due to the usage of both discriminant information provided by labeled samples and the structure information provided by unlabeled samples.


Introduction
The particle swarm optimization (PSO) [1,2], originally proposed by Eberhart and Kennedy, is a population-based stochastic search process.It is inspired by the social interaction behavior of birds flocking and fish schooling.In the context of PSO, a swarm refers to a number of potential solutions to the optimization problem, where each particle represents a potential solution.The particles fly through the search space with a velocity that is dynamically adjusted according to its local information (the cognitive component) and neighbor information (the social component), and trend to fly toward better and better search areas [3].PSO has been widely applied to acquire the solution to the machine learning problems involved in various fields [4,5].In the area of machine learning, traditional learning methods can be divided into two categories: supervised learning and unsupervised learning.In many of the realworld applications of machine learning, a certain number of labeled samples are usually needed to train a classifier so as to perform the learning process, which is usually named as the supervised learning, such as the decision tree [6], support vector machines (SVMs) [7][8][9][10], neural network classifier [11], and the PSO-based classifiers [12][13][14][15][16][17].As a matter of fact, labeled instances are difficult, expensive, and time consuming to obtain.In cases that only unlabeled samples are available, the learning can be achieved in an unsupervised way, such as K-means clustering and fuzzy Cmeans clustering.It is often the case that both labeled and unlabeled samples are available but the labeled samples are too limited to obtain a favorable performance in supervised way, while abundant unlabeled samples are easy to obtain.Therefore, semisupervised learning [18] and reinforcement learning [19] are introduced and have been proved to be quite promising.Semisupervised learning tries to improve the performance via combining limited labeled samples and large amounts of unlabeled ones to perform the classification.It has recently become more and more popular among the variety of problems such as text classification [20], mail category [21], and human action recognition [22], for which the labeled samples are highly limited.
The nearest neighbor (NN) classification is one of the popular classification methods.It is a "lazy" learning method because it does not train the classifier using labeled training data in advance [23].The nearest neighbor decision rule assigns an unknown input sample vector to the class label of its nearest neighbor [24], which is measured in terms of a distance defined in the feature space.In this space, each class defines a region, which is called the Voronoi region [25].When the distance is defined as the classical Euclidean distance, the Voronoi regions are delimited by linear borders.This method can be extended to the Knearest neighbors when more than one nearest neighbor is considered.In addition, some other distance measures other than the Euclidean distance also can be used.
A further improvement of NN method replaces the original training data by a set of prototypes that correctly "represent" the original data [23].Namely, the classifier assigns class labels by calculating distances to the prototypes rather than to the original training data.As the number of prototypes is much smaller than the total number of original training data sets, classification of new sample is performed much faster due to the reduced computational complexity of the solution (measured by the number of prototypes).These nearest prototype algorithms are able to achieve better accuracy of the solution than the basic NN classifiers.An evolutionary approach to the prototype selection problem can be found in [26].
PSO has been used in classification in many literatures.Most of the PSO-based classification methods combine PSO with an existing machine learning or classification algorithm such as NN [12,27], neural network [13], and rough set theory [28].In [12], an unsupervised learning algorithm is proposed by minimizing the distances within clusters.In [13], an evolutionary approach-based nearest prototype classifier is introduced.In [27], PSO is applied to find the optimal positions of class centroids in the feature space of dataset using the examples contained in the training set.
PSO has shown competitive performance in classification problem.However, it usually needs many labeled data points to obtain the optimal positions of class centroids.Semisupervised learning provides a better solution by making full use of the abundant unlabeled data along with the limited labeled samples to improve the accuracy and robustness of class predictions [18], and [29][30][31].In this paper, we propose a semisupervised classification method based on the standard PSO, namely, semisupervised PSO (SSPSO), in which some available supervised information and the wealth of unlabeled data points are simultaneously used to search for the optimal positions of class centroids.The key point in SSPSO is to introduce the unlabeled information to the fitness function of PSO naturally.The advantages of SSPSO can be concluded as follows: firstly, it is a semisupervised learning method which can be applied with limited labeled samples; secondly, with less number of prototypes, the classification of new patterns will be performed faster; thirdly, SSPSO is able to achieve competitive or even better accuracy than the basic NN classifier and SVM.
The rest part of this paper is organized as follows.In Section 2, the theory of nearest neighbor classification and PSO is presented.The classification method based on PSO and our proposed SSPSO are described in Section 3. Experimental results and analysis on UCI datasets, some typical artificial datasets, and the USPS handwritten dataset are shown and discussed in Section 4. Finally, Section 5 concludes this paper.

Review of Related Methods
2.1.K-Nearest Neighbor Algorithm.In pattern recognition, the k-nearest neighbor algorithm (KNN) is a simple method for classification.KNN is a type of lazy learning where the function is only approximated locally and all computation is deferred until classification.The simplest 1-NN algorithm assigns an unknown input sample to the class of its nearest neighbor from a stored labeled reference set.Instead of looking at the closest labeled sample, the KNN algorithm seeks k samples in the labeled reference set that are closest to the unknown sample and applies a voting mechanism to make a decision for label prediction.
Suppose T = {(x  ,   )} is the training set, where x  ∈ R  denotes the training example in a continuous multidimensional feature space and   ∈ R is class label of x  .For 1-NN classification, the class label of a test sample x ∈ R  can be obtained by finding the training example that is the nearest to x according to some distance metrics, such as the Euclidean distance in (1), and assigning the class label of this training sample to it.For KNN classification, the class label of the test sample can be obtained with a method of majority voting.Consider (1)

Particle Swarm Optimization.
PSO is based on a swarm of  individuals called particles, each representing a solution to the problem with  dimensions.Its genotype consists of 2 parameters, with the first  parameters representing the coordinates of particle's position and the latter  parameters being its velocity components in the -dimensional problem space, respectively.Besides the two basic properties, the following properties exit: a personal best position pbest  of each particle in the searching space and the global best position gbest of the whole swarm.A fitness function corresponding to the problem is used to evaluate the "goodness" of each particle.Given a randomly initial position and velocity, the particles can be updated with the following: where p ()  and k ()  are the position and velocity of the th particle at the th iteration, respectively.The two positive factors  1 and  2 , known as the cognitive and social coefficients, control the contributions of the best local solution pbest  (cognitive component) and the global best solution gbest (social component), respectively. 1 and  2 are two independent random variables within [0, 1].The inertia weight factor  is used to control the convergence of the swarm.In this paper, a nonlinear changing inertia factor for PSO and SSPSO is used as [32], which is shown in the following: where  max is the maximum number of iterations and  is the current number of iteration.Note that, during the iteration, every dimension of the velocity is defined in the range [− max ,  max ] to limit the maximum distance that a particle will move.

Semisupervised Particle Swarm Optimization for Classification
In the context of PSO-based classification on the dataset X with  classes and  attributes, classification problem can be seen as that of searching for the optimal positions for the  centroids of data clusters in a -dimensional space with the labeled samples [23].Then, NN method is applied as the classifier that assigns class labels by calculating distances to the centroids to classify the unlabeled instances.Data to be classified are a set of samples which are defined by continuous attributes, and the corresponding class is defined by a scalar value.Different attributes may take values in different ranges.To avoid one of attributes with large value dominating the distance measure, all the attributes are normalized to the range [0, 1] before classification.
As a particle demotes a full solution to the classification of data with  attributes and  classes,  centroids are encoded in each particle.A centroid corresponds to a class, so it is defined by  continuous values for the attributes.Table 1 describes the structure of a single particle's position.Centroids are encoded sequentially in the particle, and a separate array determines the class of each centroid.Namely, the class for each centroid is defined by its position inside the particle.The total dimension of the position vector is  * , and similarly the velocity of the particle is made up of  *  real numbers representing its  *  velocity components in the problem space.To simplify the notation for representation, we denote p , as the th class centroid vector p(( − 1) *  + 1 :  * ) encoded in the th particle.Fitness function plays an important role in PSO.A good fitness function can quickly find the optimization positions of the particles.In [26], the fitness function  of the classical PSO classification method is computed as the sum of the Euclidean distances between all the training samples and the class centroids encoded in the particle they belong to.Then, the sum is divided by , which is the total number of training samples.The fitness of the th particle is defined as where () denotes the class label of the training sample x  , p (), denotes the centroid vector of the class () encoded in the th particle, and (x  , p (), ) is the Euclidean distance between the training sample x  and the class centroid p (), .Equation ( 5) only considers the labeled samples which are used to provide the discriminant information.However, in the case that the labeled samples are limited, the labeled samples are too few to represent the real distribution of dataset, while the abundant unlabeled samples are often available and may be helpful to capture the real geometrical structure of the whole dataset.To take full advantage of the existing unlabeled samples, we modify the fitness function by introducing the structure information of unlabeled samples to the original fitness function of PSO classifier.With the assumption of NN method that the neighborhood samples should have the same labels, we propose to use a new fitness function in our proposed SSPSO as follows: where (p  ) is the fitness value of the th particle;  is a weight factor in the range between [0, 1], which controls the ratio of information obtained from the labeled and unlabeled samples;  is the number of the unlabeled samples X  ; and  is the number of labeled samples X  .The first term on the left side of the fitness function is the discriminate constraint, which means that a good classifier should have a better result on the labeled samples.The second term is the structure constraint, which is helpful to find the real distribution of the whole dataset so as to improve the classification performance.
When  = 1, we obtain the standard PSO algorithm, and when  = 0, we can obtain an unsupervised PSO clustering method.
The detailed process of the proposed SSPSO is as follows.
Output.The labels of the unlabeled samples.
Step 1. Load training dataset and unlabeled samples.
Step 3. Initialize the swarm with  particles by randomly generating both the position and velocity vectors for each particle with the entry value between the range [0, 1].It is noticed that the dimension of each particle equals the product of the number of attributes  and the number of classes .
Step 4. Iterate until the maximum number of iterations is reached.

Experimental Results and Analysis
In this section, we assess our proposed method SSPSO on six UCI datasets, four artificial datasets, and the USPS handwritten dataset.The datasets have different attributes and classes, involving different problems including balanced and unbalanced ones.
To evaluate the performance of SSPSO, we make comparisons of the classification results with the PSO-based classifier, the traditional NN classifier, and the classical SVM classifier.In order to compare the algorithms reasonably, all the parameters of PSO and SSPSO are selected to make them obtain the best results.The parameter settings are as follows.The inertia weight factor  used in PSO and SSPSO decreases linearly from 0.9 to 0.4.Both  1 and  2 are set to 2. The velocity is defined in the range [−0.05, 0.05].The swarm scale  is set to 20 and the maximum number of iterations  max is 1000.The parameters of SVM with Gaussian kernel function are selected by using the gridding search method on the training dataset.In addition, we analyze the effect of the number of unlabeled samples on the classification accuracy on USPS dataset.In order to test the robustness of the parameter  in the fitness function to the classification performance, we conduct experiments on UCI datasets with different values of  and analyze the effect of  on the classification performance.

Artificial Two-Dimension Problems.
To test the feasibility of SSPSO for classification, the proposed method is first conducted on four artificial two-dimension datasets, that is, long1, sizes5, square1, and square4.The details are shown in Table 2, and the distributions of the four datasets are shown in Figure 1.
In the experiments, for the first two datasets, we randomly select 1∼30 labeled samples per class as the training data, and for the last two datasets we randomly select 5∼40 labeled samples per class as the training set, and the rest are used as the test set.Figure 2 plots the curves of accuracy with respect to the number of labeled samples, which shows the average results over 100 runs of the proposed SSPSO comparing with PSO, NN, and SVM on the four datasets.The weight factor  in the fitness function of SSPSO (in (6)) is selected as 0.5.From Figure 2, it can be observed that SSPSO can obtain favorable classification results on the four datasets, which means that SSPSO is feasible for the classification problem.Among the four datasets, long1 is the easiest to classify, on which all the four methods acquire 100% classification accuracy when the number of labeled samples per class exceeds 10.But when the labeled instances are few, for example, only 3 instances per class are labeled, PSO, NN, and SVM cannot classify all the test data correctly, while SSPSO can still obtain 100% classification accuracy.In Figure 2(b), the performance difference among SSPSO, NN, and SVM is not noticeable when the number of labeled samples per class is up to 15, but when the number of labeled instances is small, for example, less than 10, SSPSO can obtain obvious better accuracy than the other methods.It is because SSPSO utilizes the information of unlabeled instances which is helpful to capture the global structure.For square1 and

UCI Dataset.
To further investigate the effectiveness of SSPSO for classification, we also conduct the experiments on six real-life datasets with different numbers of attributes and classes from the UCI machine learning repository [33].The description of the datasets used in experiments is given in Table 3.
For datasets with 2 classes, we randomly select 1∼15 labeled samples per class as the training data, and, for datasets with 3 classes, we randomly select 1∼10 labeled samples per class as the training data, and the rest are used as the test set.The results are averaged over 100 runs.The weight factor  in the fitness function of SSPSO (in (6)) is selected as 0.5.Figure 3 shows the classification accuracy with different numbers of training samples on the 6 datasets.
From Figures 3(a), 3(b), 3(c), and 3(e), it can be observed that the proposed SSPSO method outperforms the other three methods on the Heart, Wine, Thyroid, and SPECT datasets, especially when the number of the labeled samples per class is small.It is because that SSPSO uses the information of available unlabeled data which is of benefit to the classification.With the increase of the labeled samples, the superiority becomes weak.From Figure 3(d), it is seen that SSPSO can obtain comparative accuracy with the other three methods.From Figure 3(f), SSPSO is slightly better than SVM, but it is much better than PSO and NN methods.Therefore, it can be concluded that SSPSO works well for some reallife classification tasks especially in the case that the labeled samples are highly limited.
From an evolutionary point of view, in Figure 4 we report the behavior of a typical run of SSPSO in terms of the best individual fitness and average fitness in the population as a function of the number of iterations.It is carried out on the Thyroid database.As can be seen, SSPSO shows two phases.In the first phase with about 50 iterations, the fitness value  decreases sharply, starting from 0.6712 for the best and 0.9192 for the average, and reaching about 0.2305 for the best and 0.2356 for the average.Then, the second phase follows, lasting about 50 iterations, in which both the best and the average fitness values decrease slowly and tend to become closer and closer, until they reach 0.2231 and 0.2247, respectively.And then the average and the best fitness values become more and more similar.Finally, both the two values get to 0.2230.

USPS Digital Recognition.
We also conduct experiments on the USPS handwritten digits dataset to test the performance of SSPSO.This dataset consists of 9298 samples with 10 classes and each sample has the size of 16×16 pixels.Firstly, we apply the principal component analysis on the dataset for feature extraction and select the first 10 principle components as the new features.We consider four subsets of the dataset in the experiment, that is, the images of digits 0 and 8, with 2261 examples in total, the images of digits 3, 5, and 8, with 2248 examples in total, the images of digits 3, 8, and 9, with a total number of 2429 examples, and the images of digits 1, 2, 3, and 4, with a total number of 3874 examples.We randomly select 1∼10 samples per class, respectively, as the training data, and randomly select 200 unlabeled samples to construct the unlabeled sample set X  , which is used for semisupervised learning.The weight factor  in the fitness function of SSPSO (in (6)) is selected as 0.7.
The recognition results averaged over 100 independent trials are summarized in Figure 5, where the horizontal axis represents the number of randomly labeled digital images per class in the subset, and the vertical axis represents the classification accuracy.From Figure 5(a), it is shown that when the number of labeled samples per class is below 14, SSPSO can obtain comparable performance with SVM and KNN and be better than PSO.In particular, SSPSO can outperform the other methods when the labeled samples are few.For the results on the USPS subset of digits 3, 5, and 8 and the subset of digits 3, 8, and 9, shown in Figures 5(b) and 5(c), respectively, one can clearly see that SSPSO method outperforms SVM and is much better than KNN and PSO methods when the number of labeled samples is small.In Figure 5(d), SSPSO still works better than the other methods but the superiority of the proposed SSPSO over the other methods decrease with the increase of labeled samples.

The Sensitivity Analysis of the Number of Unlabeled
Samples.In this section, we validate the effect of the number of the unlabeled samples on the classification accuracy.This experiment is carried on the subset of the USPS dataset with the digit images of 3, 5, and 8.We vary the size of the unlabeled set X  to be 10, 50, 100, 200, 400, and 600. Figure 6 illustrates the classification accuracy as the function of the size of the unlabeled set and the number of labeled samples.From Figure 6, one can see that the number of the unlabeled samples affects the accuracy slightly when the number of the labeled samples is small.The plot with 10 unlabeled samples gets much lower accuracy than the other plots, which indicates that the number of unlabeled samples used should not be too small.With the increase of the size of X  , SSPSO can obtain better classification accuracy because the proposed method can capture the real structure of the whole dataset more precisely with more unlabeled samples, but the gaps between the plots of SSPSO with different sizes of X  become very small.It is noted from the above experiment that, for an unlabeled dataset X  with a certain scale, when more unlabeled data are added, the classification accuracy of SSPSO may increase a bit, but it will also bring higher computation cost.So in the experiments, 100 to 200 unlabeled samples are proper to use for SSPSO.
4.5.The Sensitivity Analysis of the Parameter . in fitness function is an important parameter in our proposed SSPSO, which controls the contributions of information obtained from the labeled and unlabeled samples to the classification.In this section, we will analyze the sensitivity of  in SSPSO.The experiments are conducted on two UCI datasets, that is, Thyroid and Heart.We randomly select 5 samples per class to form the labeled dataset, and the rest are used for test.The mean results over 100 times of randomly selected training datasets with different values of  are shown in Figure 7.
From Figures 7(a) and 7(b), it can be observed that with different values of , SSPSO is not always better than NN and SVM methods.When  is small, SSPSO may obtain a bad performance; that is, the accuracy is much lower than NN and SVM.When the value of  is small, the effect of the labeled samples is weakened while the effect of the unlabeled samples is strengthened, which is much more like the unsupervised learning.With the increase of , the accuracy raises sharply.After  gets to 0.4 for the Thyroid dataset and 0.5 for the Heart dataset, the performance keeps stable and even decreases more or less.To balance the effects of the labeled instances and the available unlabeled instances,  is set to 0.5 in our experiments.

Conclusions
In this paper, a semisupervised PSO method for classification has been proposed.PSO is used to find the centroids of the classes.In order to take advantage of the amount of unlabeled instances, a semisupervised classification method is proposed based on the assumption that near instances in feature space should have the same labels.Since the discriminative information provided by the labeled samples and the global distribution information provided by the large number of unlabeled samples is used to find a collection of centroids, SSPSO obtains better performance than traditional PSO classification method.In the experiments, four artificial datasets, six real-life datasets from the UCI machine learning repository, and the USPS handwritten dataset are applied to evaluate the effectiveness of the method.The experimental results demonstrated that our proposed SSPSO method has a good performance and can obtain higher accuracy in comparison to the traditional PSO classification method, NN method, and SVM when there are only few labeled samples available.

Figure 6 :
Figure 6: {3, 5, and 8} digit recognition on the USPS dataset by SSPSO with different numbers of unlabeled samples.

Figure 7 :
Figure 7: Classification accuracy as a function of  in SSPSO on (a) Thyroid and (b) Heart.

Table 1 :
Encoding of a set of centroids in a particle for PSO.

Table 2 :
Artificial datasets used in experiments.

Table 3 :
UCI datasets used in experiments.