swarm for attribute selection in Bayesian classification: an application to protein function

The discrete particle swarm optimization (DPSO) algorithm is an optimization technique which belongs to the fertile paradigm of Swarm Intelligence. Designed for the task of attribute selection, the DPSO deals with discrete variables in a straightforward manner. This work empowers the DPSO algorithm by extending it in two ways. First, it enables the DPSO to select attributes for a Bayesian network algorithm; which is more sophisticated than the Naive Bayes classiﬁer previously used by the original DPSO algorithm. Second, it applies the DPSO to a set of challenging protein functional classiﬁcation data, involving a large number of classes to be predicted. The work then compares the performance of the DPSO algorithm against the performance of a standard Binary PSO algorithm on the task of selecting attributes on those data sets. The criteria used for this comparison are (1) maximizing predictive accuracy, and (2) ﬁnding the smallest subset of attributes.


INTRODUCTION
Most of the particle swarm algorithms present in the literature deal only with continuous variables [1][2][3]. This is a significant limitation, because many optimization problems are set in a search space featuring discrete variables. Typical examples include problems which require the ordering or arranging of discrete variables, such as scheduling or routing problems [4]. Therefore, the design of particle swarm algorithms that deal directly with discrete variables is pertinent to this field of study.
The work in [5] proposed a discrete particle swarm optimization (PSO) algorithm for attribute selection in Data Mining. Hereafter, this algorithm will be refereed to as the discrete particle swarm optimization (DPSO) algorithm. The DPSO deals directly with discrete variables, and its population of candidate solutions contains particles of different sizes-the DPSO forces each particle to carry a constant number of attributes across iterations. The DPSO algorithm interprets the concept of velocity, used in traditional PSO, as "probability;" renders velocity as a proportional likelihood; 1 and uses this information to sample new particle positions. The motivation behind the DPSO algorithm is indeed to introduce a probability-like approach to particle swarm.
Although specifically designed for the task of attribute selection, the DPSO is not limited to this kind of application. By performing a few modifications, one can apply this algorithm to many other discrete optimization problems, such as facility location problems [6].
Many data mining applications involve the task of building a model for predictive classification. The goal of such a model is to classify examples-records or data instancesinto classes or categories of the same type. Noise or unnecessary attributes may reduce the accuracy and reliability of a classification or prediction model. Unnecessary attributes also increase the costs of building and running a modelparticularly on large data sets. Before performing classification, it is therefore important to select an appropriate subset of "good" attributes. Attribute selection tries to simplify a data set by reducing its dimensionality and identifying relevant underlying attributes without sacrificing predictive accuracy. As a result, it reduces redundancy in the information provided by the attributes used for prediction. For a more detailed review of the attribute selection task using genetic algorithms, see [7].
The main difference between the DPSO and other traditional PSO algorithms is that the particles in the DPSO do not represent points inside an n-dimensional Euclidean 2 Journal of Artificial Evolution and Applications space (continuous case) or lattice (binary case) as in the standard PSO algorithms [8]. Instead, they represent a combination of selected attributes. In previous work, the DPSO was used to select attributes for a Naive Bayes (NB) classifier. The resulting NB classifier was then used to predict postsynaptic function in proteins.
The study presented here extends previous work reported in [5,9] in two ways. First, it enables the DPSO to select attributes for a Bayesian network algorithm, which is more sophisticated than the Naive Bayes algorithm previously used. Second, it increases the number of data sets used to evaluate the PSO from 1 to 6. All the 6 functional classification data sets used have a much greater number of classes to be predicted-in contrast with the postsynaptic data set used in [5] which had just two classes to be predicted.
The work is organized as follows. Section 2 briefly addresses Bayesian networks and Naive Bayes classifier. Section 3 shortly discusses PSO algorithms. Section 4 describes the standard binary PSO algorithm and Section 5 the DPSO algorithm. Section 6 describes the G-proteincoupled receptors (GPCRs) and Enzyme data sets used in the computational experiments. Section 7 reports computational experiments-it also includes a discussion of the results obtained. Section 8 presents conclusions and points out future research directions.

BAYESIAN NETWORKS AND NAIVE BAYES
The Naive Bayes (NB) classifier uses a probabilistic approach to assign each record of the data set to a possible class. In this work, the NB classifier assigns a protein of a data set of proteins to a possible class. A Naive Bayes classifier assumes that all attributes are conditionally independent of one another given the class [10].
A Bayesian network (BN), by contrast, detects probabilistic relationships among these attributes and uses this information to aid the attribute selection process.
Bayesian networks are graphical representations of a probability distribution over a set of variables of a given problem domain [11,12]. This graphical representation is a directed acyclic graph in which nodes represent the variables of the problem and arcs represent conditional probabilistic independencies among the nodes. A directed acyclic graph G is an ordered pair G = (V , E), where V is a set whose elements are called vertices or nodes and E is a set whose elements are called directed edges, arcs, or arrows. The graph G contains no directed cycles-for any vertex v ∈ V , there is no directed path that starts and ends on v.
An example of a Bayesian network is as follows. ( This  2 is a modified version of the so-called "Asia" problem, [13], given in [2.5.3]. ) Suppose that a doctor is treating a patient 3 who has been suffering from shortness of breath-called dyspnoea. The doctor knows that diseases such as tuberculosis, bronchitis, and lung cancer are possible causes for that. The doctor also knows that other relevant information includes whether the patient is a smoker-increasing the chances of lung cancer and bronchitis-and what sort of air pollution the patient has been exposed to. A positive x-ray would indi-  Table 1. Figure 1 shows a Bayesian network representing this problem. For applications of Bayesian networks on evolutionary algorithms and optimization problems, see [14,15].
More formally, let X = {X 1 , X 2 , . . . , X n } be a multivariate random variable whose components X i are also random variables. A corresponding lower-case letter x i denotes an assignment of state or value to the random variable X i . Parents (X i ) represent the set of nodes-variables or attributes in this work-that have a directed edge pointing to X i . Let us consider a BN containing n nodes, X 1 to X n , taken in that order. A particular value of X = {X 1 , X 2 , . . . , X n } in the joint probability distribution is represented by or more compactly, p(x 1 , x 2 , ..., x n ). The chain rule of probability theory allows the factorization of joint probabilities, therefore As the structure of a BN implies that the value of a particular node is conditional only on the values of its parent nodes, (2) may be reduced to Learning the structure of a BN is an NP-hard problem [16,17]. Many algorithms that developed to this end use 4 a scoring metric and a search procedure. The scoring metric evaluates the goodness-of-fit of a structure to the data. The search procedure generates alternative structures and selects the best one based on the scoring metric. To reduce the search space of networks, only candidate networks in which each node has at most k-inward arcs (parents) are considered -k is a parameter determined by the user. In the present work, k is set to 20 (k = 20) to avoid overly complex models.
A greedy search algorithm is used to generate alternative structures for the BN starting with an empty network, the greedy search algorithm adds into the network the edge that most increases the score of the resulting network. The search stops when no other edge addition improves the score of the network. Algorithm 1 shows the pseudocode of this generic greedy search algorithm.  if i / = j then 10: if there is no edge between the nodes i and j in G then 11: Modify G : add an edge between the nodes i and j in G such that i is a parent of j: (i → j) 12: if the resulting G is a DAG then 13: if (Score (G ) > BEST) then 14: BEST = Score (G ) 15 To evaluate the "goodness-of-fit" (score) of a network structure to the data, an unconventional scoring metricspecific for the target classification task-is adopted. The entire data set is divided into mutually exclusive training and test sets-the standard methodology for evaluating classifiers, see Section 7.1. The training set is further divided into two mutually exclusive parts. The first part is used to compute the probabilities for the Bayesian network. The second part is used as the validation set. During the search for the best possible network structure, only the validation set is used to compute predictive accuracy. The score of a candidate network is given by the classification accuracy in the validation set. The graphical model of the network that shows the highest predictive accuracy on the validation set-during the entire PSO run-is then used to compute the predictive accuracy on the test set.
Once the best network structure is selected, at the end of the PSO run, the validation set and the other part of the training set are merged and this merged data-that is, the entire original training set-is used to compute the probabilities for the selected Bayesian network. The predicted accuracy-reported as the final result-is then computed on the previously untouched test set. This process is discussed again, in more details, in Section 7.1. A similar process is adopted for the computation of the predictive accuracy using the Naive Bayes classifier.

A BRIEF INTRODUCTION TO PARTICLE SWARM OPTIMIZATION
Particle swarm optimization (PSO) comprises a set of search techniques, inspired by the behavior of natural swarms, for solving optimization problems [8]. In PSO, a potential solution to a problem is represented by a particle, ) in an n-dimensional search space. Y(i) represents the ith particle in the population and n represents the number of variables of the problem. The coordinates Y (i,d) of these particles have a rate of change (velocity) V (i,d) , d = 1, 2, . . . , n. Note that the use of the double subscript notation "(i, d)" like in Y (i,d) represents the dth component of the ith particle in the swarm Y(i)-the same rationale is used for V (i,d) , and so forth. Every particle keeps a record of the best position that it has ever visited. Such a record is called the particle's previous best position and denoted by B(i). The global best position attained by any particle so far is also recorded and stored in a particle denoted by G. An iteration comprises evaluation of each particle, then stochastic adjustment of V (i,d) in the direction of particle Y(i)'s previous best position and the previous best position of any particle in the neighborhood [18]. There is much variety in the neighborhood topology used in PSO, but quite often gbest or lbest topologies are used. In the gbest topology, the neighborhood of a particle consists of all the other particles in the swarm, and therefore all the particles will have the same global best neighbor-which is the best particle in the entire population. In the lbest topology, each particle has just a "local" set of neighbors, typically much fewer than the number of particles in the swarm, and so different particles can have different best local neighbors.
For a review of the neighborhood topologies used in PSO the reader is referred to [8,19].
As a whole, the set of rules that govern PSO are evaluate, compare, and imitate. The evaluation phase measures how well each particle (candidate solution) solves the problem at hand. The comparison phase identifies the best particles. The imitation phase produces new particle positions based on some of the best particles previously found. These three phases are repeated until a given stopping criterion is met. The objective is to find the particle that best solves the target problem.
Important concepts in PSO are velocity and neighborhood topology. Each particle, Y(i), is associated with a velocity vector. This velocity vector is updated at every generation. The updated velocity vector is then used to generate a new particle position Y(i). The neighborhood topology defines how other particles in the swarm, such as B(i) and G, interact with Y(i) to modify its respective velocity vector and, consequently, its position as well.

THE STANDARD BINARY PSO ALGORITHM
Potential solutions to the target problem are encoded as fixed size binary strings; that is, Given a list of attributes A = (A 1 , A 2 , . . . , A n ), the first element of Y(i), from the left to the right hand side, corresponds to the first attribute "A 1 ," the second to the second attribute "A 2 ," and so forth. A value of 0 on the site associated to an attribute indicates that the respective attribute is not selected. A value of 1 indicates that it is selected.

The initial population for the standard binary PSO algorithm
For the initial population, N binary strings of size n are randomly generated. Each particle Y(i) is generated independently. For every position Y (i,d) in Y(i), a uniform random number ϕ is drawn on the interval (0, 1). If ϕ < 0.5, then

Updating the records for the standard binary PSO algorithm
At the beginning, the previous best position of Y(i), denoted by B(i), is empty. Therefore, once the initial particle Y(i) is generated, represents the fitness function used to measure the quality of the candidate solutions. A similar process is used to update the global best position G. Once that all the B(i) have been determined, G is set to the fittest B(i) previously computed. After that, G is updated if the fittest B(i) in the swarm is better than G. And, in that case, f (G) is set to f (G) = f (fittest B(i)). Otherwise, G remains as it is.

Updating the velocities for the standard binary PSO algorithm
Every particle Y(i) is associated to a unique vector of velocities V(i) = (V (i,1) , V (i,2) , . . . , V (i,n) ). Note that, for simplicity, this work uses row vectors rather than column vectors. The elements where w (0 < w < 1), called the inertia weight, is a constant value chosen by the user and d = 1, 2, . . . , n. Equation (4) is a standard equation used in PSO algorithms to update the velocities [20,21]. The factors ϕ 1 and ϕ 2 are uniform random numbers independently generated in the interval (0, 1).

Sampling new particle positions for the standard binary PSO algorithm
For each particle Y(i) and each dimension d, the value of the new coordinate Y (i,d) ∈ Y(i) can be either 0 or 1. The decision of whether Y (i,d) will be 0 or 1 is based on its respective velocity V (i,d) ∈ V(i) and is given by the equation where 0 ≤ rand ≤ 1 is a uniform random number and is the sigmoid function. Equation (5) is a standard equation used to sample new particle positions in the binary PSO algorithm [8]. Note that the lower the value of V (i,d) is, the more likely the value of Y (i,d) will be 0. By contrast, the higher the value of V (i,d) is, the more likely the value of Y (i,d) will be 1. The motivation to use the sigmoid function is to map the in- for all i, d into the interval (0, 1) which is equivalent to the interval of a probability function.

THE DISCRETE PSO (DPSO) ALGORITHM
The DPSO algorithm deals directly with discrete variables (attributes) and, unlike the binary PSO algorithm, its population of candidate solutions contains particles of different sizes. Potential solutions to the optimization problem at hand are represented by a swarm of particles. There are N particles in a swarm. The size of each particle may vary from 1 to n, where n is the number of variables-attributes in this work-of the problem. In this context, the size of a particle refers to the number of different attribute indices that the particle is able to represent at a single time.
For example, given i, j ∈ {1, 2, . . . , N}, in DPSO it may occur that a particle Z(i) in the population has size 6 (Z(i) = { * , * , * , * , * , * }), whereas another particle Z( j) in the same population has size 2 (Z(i) = { * , * }), and so forth, or any other sizes between 1 and n ). 5 Each particle Z(i) keeps a record of the best position it has ever attained. This information is stored in a separate vector labeled as B(i). The swarm also keeps a record of the global best position ever attained by any particle in the swarm. This information is also stored in a separate vector labeled G. Note that G is equal to the best B(i) present in the swarm.

Encoding of the particles for the DPSO algorithm
Each attribute is represented by a unique positive integer number, or index. These numbers, indices, vary from 1 to n. A particle is a subset of nonordered indices without repetition, for example, Z(k) = {2, 4, 18, 1}, k ∈ {1, 2, . . . , N}.

The initial population for the DPSO algorithm
The original work on DPSO [5] used a randomly generated initial population for the standard PSO algorithm and a new randomly generated initial population for the DPSO algorithm, when comparing these algorithms' performances in a given data set. However, the way in which those populations were initialized generated a doubt about a possible advantage of one initial population over the other-which would bias the performance of one algorithm over the other. In this work, to eliminate this possible bias, the initial population used by the DPSO is always identical to the initial population used by the binary PSO. They differ only in the way in which solutions are represented. The conversion of every particle in the initial population of solutions of the binary PSO to the Discrete PSO initial population is as follows.
The index of every attribute that has value 1 is copied to the new solution (particle) of the DPSO initial population. For instance, an initial candidate solution for the binary PSO algorithm equal to Y(k) = (1, 0, 1, 1, 0) is converted into Z(k) = {1, 3, 4} for the DPSO algorithm-because attributes A 1 , A 3 , and A 4 are set to 1 (are present) in Y(k), k ∈ {1, 2, . . . , N}. Note that the same initial population of solutions is used to both algorithms, binary PSO and DPSO, to make the comparison between the performances of these algorithms as free from initialization bias as possible.
In the DPSO algorithm, for simplicity, once the size of a particle is determined at the initialization, the particle will keep that same size during the entire execution of the algorithm. For example, particle Z(k) = {2, 3, 4, 5} above, which has been initialized with 4 indices, will always carry exactly 4 indices, Z(k) = { * , * , * , * }. The values of those 4 indices, however, are likely to change every time that the particle is updated.

6
Journal of Artificial Evolution and Applications

Velocities = proportional likelihoods
The DPSO algorithm does not use a vector of velocities as the standard PSO algorithm does. It works with proportional likelihoods instead. Arguably, the notion of proportional likelihood used in the DPSO algorithm and the notion of velocity used in the standard PSO are somewhat similar. DPSO uses M(i) to represent an array of proportional likelihoods and M(i, d) to represent one of M(i)'s components.
Every particle in DPSO is associated with a 2-by-n array of proportional likelihoods, where 2 is the number of rows in this array and n is the number of columns-note that the number of columns in M(i) is equal to the number of variables of the problem n.
This is an example of a generic proportional likelihood array Each of the n elements in the first row of M(i) represents the proportional likelihood that an attribute be selected.
After the initial population of particles is generated, this array is always updated before a new configuration for the particle associated to it is generated.
Note that index 1 is absent in Z(i), B(i), and G. Therefore, the proportional likelihood of attribute 1 in M(i) remains as it is. In this work, the values used for α, β, and γ were α = 0.10, β = 0.12, and γ = 0.14. These values were empirically determined in preliminary experiments; but this work makes no claim that these are optimal values. Parameter optimization is a topic for future research. As a whole, these values make the contribution of B(i) (β = 0.12) to the updating of the V(i) a bit stronger than the contribution of Z(i) (α = 0.10); and the contribution of G (γ = 0.14) even stronger.
The new updated array M(i) replaces the old one and will be used to generate a new configuration to the particle associated to it as follows.

Sampling new particle positions for the DPSO algorithm
The proportional likelihood array M(i) is then used to sample a new instance of particle Z(i)-the particle associated to M(i where ϕ 1 , . . . , ϕ 5 are uniform random numbers independently drawn on the interval (0, 1). Suppose that this is the resulting array M(i) after the multiplication The next operation now is to select the indices that will compose the new particle position. After ranking the array M(i), the first s i indices (in the second row of M(i)), from left to right, are selected to compose the new particle position. Note that s i represents the size of the particle Z(i)-the particle associated to the ranked array M(i). Once the algorithms have been explained, the next section briefly introduces the particular data sets (case studies) used to test the algorithms.

CASE STUDY: THE GPCR AND ENZYME DATA SETS USED IN THE COMPUTATIONAL EXPERIMENTS
The experiments involved 6 data sets comprising two kinds of proteins, namely, G-protein-coupled receptors (GPCRs) and Enzymes. The G-protein-coupled receptors (GPCRs) are a protein superfamily of transmembrane receptors. Their function is to transduce signals that induce a cellular response to the environment. GPCRs are involved in many types of stimulus-response pathways, from intercellular communication to physiological senses. GPCRs are of much interest to the pharmaceutical industry because these proteins are involved in many pathological conditions-it is estimated that GPCRs are the target of 40% to 50% of modern medical drugs [22] Enzymes are proteins that accelerate chemical reactions-they participate in many processes in a biological cell. Some enzymes are used in the chemical industry and other industrial applications where extremely specific catalysts are required. In Enzyme Nomenclature, enzymes are assigned and identified by an Enzyme Commission (EC) number. For instance, EC 2.3.4 is an enzyme with class value 2 in the first hierarchical class level, class value 3 in the second class level, and so forth. This work uses the GPCRs and EC data sets described in Table 2.
These data sets were derived from the data sets used in [23,24]. Note that both the GPCR and the Enzyme data sets have hierarchical classes. Each protein in these data sets is assigned one class at the first (top) hierarchical level, corresponding to a broad function, another class at the second level, corresponding to a more specialized function, and another class at the third level, corresponding to an even more specialized function, and so forth. This work copes with these hierarchical classes in a simple way by predicting classes one level at a time, as explained in more detail later.
The data sets used in the experiments involved four kinds of protein signatures (biological "motifs"), namely, PROSITE Table 2: GPCR and EC data sets. "Cases" represents the number of proteins in the data set, "Attributes" represents the total number of attributes that describe the proteins in the data set and "L1",. . ., "L4" represent the number of classes at hierarchical class levels 1,. . ., 4 respectively. Data set  Cases  Attributes  L1  L2  L3  L4  GPCR-PRINTS  330  281  8  36  52  44  GPCR-PROSITE  190  127  8  32  32  -GPCR-InterPro  580  448  12  43  67    PROSITE is a database of protein families and domains. 6 It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families (a protein consists of a sequence of amino acids). PROSITE patterns are essentially regular expressions describing small regions of a protein sequence which present a high sequence similarity when compared to other proteins in the same functional family.

# Classes at
In the data sets, the absence of a given PROSITE pattern is indicated by a value of 0 for the attribute corresponding to that PROSITE pattern. The presence of it is indicated by a value of 1 for that same attribute.
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family. In the PRINTS data sets, a fingerprint corresponds to an attribute. The presence of a fingerprint is indicated by a value of 1 for that same attribute; the absence by a 0.
Pfam signatures are produced by hidden Markov models, and InterPro integrates a number of protein signature databases into a single database. In this work, Pfam and In-terPro entries also correspond to binary attributes indicating whether or not a protein matches those entries, using the same codification described for the PROSITE patterns and Fingerprints.
The objective of the binary PSO and DPSO algorithms is to classify each protein into its most suitable functional class level. The classification of the proteins is performed in each class level individually. For instance, given protein Υ, at first, a conventional "flat" classification algorithm assigns a class to Υ at the first class level only. Once Υ has been classified at the 8

EXPERIMENTS
The quality of a candidate solution (fitness) is evaluated in three different ways: (1) by a baseline algorithm-using all possible attributes; (2) by the binary PSO-using only the attributes selected by this algorithm; and (3) by the discrete PSO (DPSO) algorithm-using only the attributes selected by this algorithm. Each of these algorithms computes the fitness of every given solution using two distinct techniques: (a) using a Naive Bayes classifier; and (b) using a Bayesian network.

Experimental methodology
Note that the computation of the fitness function f (·) for the particles Y(i) (binary PSO algorithm) and Z(i) (DPSO algorithm) follows the description given below. For simplicity, only the process using Y(i) is described-but the same is applicable to Z(i). f (Y(i)) is equal to the predictive accuracy achieved by the Naive Bayes classifier-and the Bayesian network-on each data set and using only the attributes selected in Y(i).
The measurement of f (Y(i)) follows a wrapper approach. The wrapper approach searches for an optimal attribute subset tailored to a particular algorithm, such as the Naive Bayes classifier or Bayesian network. For more information on wrapper and other attribute selection approaches, see [25].
The computational experiments involved a 10-fold crossvalidation method [25]. First, the data set being considered is divided into 10 equally sized folds. The folds are randomly generated but under the following criterion. The proportion of classes in every single fold must be similar to the proportion of classes found in the original data set containing all records. This is known as stratified crossvalidation.
Each of the 10 folds is used once as a test set and the remaining of the data is used as training set. Out of the 9 folds in the training set one is reserved to be used as a validation set. The Naive Bayes classifier and the Bayesian network use the remaining 8 folds to compute the probabilities required to classify new examples. Once those probabilities have been computed, the Naive Bayes (NB) classifier and the Bayesian network (BN) classify the examples in the validation set.
The accuracy of this classification on the validation set is the value of the fitness functions f NB (Y(i)) and f BN (Y(i))the same for f NB (Z(i)) and f BN (Z(i)). When the run of the PSO algorithm is completed, the 9 folds are merged into a full training set. The Naive Bayes classifier and the Bayesian network are then trained again on this full-training set (9 merged folds), and the probabilities computed in this final, full-training set are used to classify examples in the test set (the 10th fold), which was never accessed during the run of the algorithms.
The reasons for having separate validation and test sets are as follows. In the classification task of data mining, by definition, the goal is to measure predictive accuracygeneralization ability-on a test set unseen during training. Hence, the test set cannot be accessed by the PSO, and is reserved just to compute the predictive accuracy associated with the Bayesian classifier constructed with the best set of attributes selected at the end of the PSO run.
Concerning the validation set, which is used to compute the fitness of particles during the PSO run, this is a part of the original training set which is different from the part of the training set used to build the Bayesian classifier, and the reason for having these two separate parts of the training set is to avoid overfitting of the classifier to the training data; for overfitting in the context of classification, see [7, pages 17, 18]. In other words, if the same training set that was used to build a Bayesian classifier was also used to measure the fitness (accuracy) of the corresponding particle, there would be no pressure to build classifiers with a good generalization ability on data unseen during training, and a classifier could obtain a high accuracy by simply being overfitted to idiosyncrasies of the training set which are unlikely to generalize well to unseen data. By measuring fitness on a validation set separated from the data used to build the classifier, this is avoided, and a pressure to build classifiers with good generalization ability is introduced in the fitness function.
In each of the 10 iterations of the crossvalidation procedure, the predictive accuracy of the classification is assessed by 3 different methods as follows.
(1) Using all possible original attributes: all possible attributes are used by the Naive Bayes classifier and the Bayesian network-there is no attribute selection. (2) Standard binary PSO algorithm: only the attributes selected by the best particle found by the binary PSO algorithm are used by the Naive Bayes classifier and the Bayesian network. (3) DPSO algorithm: only the attributes selected by the best particle found by the DPSO algorithm are used by the Naive Bayes classifier and the Bayesian network.
Since the Naive Bayes and Bayesian network classifiers used in this work are deterministic, only one run-for each of these algorithms-is performed for the classification using all possible attributes.
For the binary PSO and the DPSO algorithms, 30 independent runs are performed for each iteration of the crossvalidation procedure. The results reported are averaged over these 30 independent runs and over the 10 iterations of the crossvalidation procedure.  The population size used for both algorithms (binary PSO and DPSO) is 200 and the search stops after 20 000 fitness evaluations-or 100 iterations.
The binary PSO algorithm uses an inertia weight value of 0.8 (i.e., w = 0.8). The choice of the value of this parameter was based on the work presented in [26].
Other choices of parameter values for the DPSO were α = 0.10, β = 0.12, and γ = 0.14, chosen based on empirical experiments but probably not optimal values.
The measurement of the predictive accuracy rate of a model should be a reliable estimate of how well that model classifies the test examples-unseen during the training phase-on the target problem.
In Data Mining, typically, the equation standard accuracy rate = TP + TN TP + FP + FN + TN (16) is used to assess the accuracy rate of a classifier-where TP, TN, FP, FN are the numbers of true positives, true negatives, false positives, and false negatives, respectively [25]. However, if the class distribution is highly unbalanced, (16) is an ineffective way of measuring the accuracy rate of a model. For instance, in many problems, it is easy to achieve a high value for (16) by simply predicting always the majority class. Therefore, on the experiments reported on this work, a more demanding measurement for the accuracy rate of a classification model is used.
This measurement has been used before in [27]. It is given by the equation where, TPR = TP/(TP+FN) and TNR = TN/(TN+FP)-TPR stands for true positive rate and TNR stands for true negative rate.
Note that if any of the quantities TPR or TNR is zero, the value returned by (17) is also zero.

Discussion
Computational results are reported in Tables 5 and 6. Let us focus the discussion on the results obtained by the 3 algorithms (binary PSO, DPSO, and baseline algorithm) for attribute selection on the GPCR-PROSITE data set, see Table 5. The results obtained for the other 5 data sets are similar. To start with, the results obtained using the Naive Bayes classifier are presented.

Results obtained using the Naive Bayes classifier approach
To assess the performance of the algorithms, two criteria were considered: (1) maximizing predictive accuracy; and (2) finding the smallest subset of attributes.
The results for the first criterion, accuracy, show that both versions of the PSO algorithm did better-in all class levels-than the baseline algorithm using all attributes.
Furthermore, the DPSO algorithm did slightly better than the binary PSO algorithm also in all class levels. Nevertheless, the difference in the predictive accuracy performance between these algorithms is, in some cases, statistically insignificant. Table 3 shows the results of a paired two-tailed t-test for the predictive accuracy of the binary PSO versus the predictive accuracy of the DPSO at a significance level of 0.05. Table 3 shows that, using Naive Bayes as classifier, the only statistically significant difference in performance-in terms of predictive accuracy-between the algorithms binary PSO and DPSO is at the third class level. By contrast, using Bayesian networks as classifier, the difference in performance is statistically significant at all class levels.
Nevertheless, the discriminating factor between the performance of these algorithms is on the second comparison criterion-finding the smallest subset of attributes.
The DPSO not only outperformed the binary PSO in predictive accuracy, but also did so using a smaller subset of attributes in all class levels. Moreover, when it comes to effectively pruning the set of attributes, the difference in performance between the binary PSO and the DPSO is always statistically significant, as Table 4 shows.

Results obtained using the the Bayesian network approach
Again, the predictive accuracy attained by both versions of the PSO algorithm surpassed the predictive accuracy obtained by the baseline algorithm in all class levels.
DPSO obtained the best predictive accuracy of all algorithms in all class levels. Regarding the second comparison criterion, finding the smallest subset of attributes, again DPSO always selected the smallest subset of attributes in all hierarchical levels.
The results on the performance of the classifiers-Naive Bayes versus Bayesian networks-show that Bayesian networks did a much better job. For all class levels, the predictive accuracy obtained by the 3 approaches (baseline, binary PSO and DPSO) using Bayesian networks was significantly better than the predictive accuracy obtained using Naive Bayes classifier. The Bayesian networks approach also enabled the two PSO algorithms to do the job using fewer selected attributes-compared to the Naive Bayes approach.
The results emphasize the importance of taking relationships among attributes into account-as Bayesian networks do-when performing attribute selection. If these relationships are ignored, predictive accuracy is adversely affected.
The results also show that for all 6 data sets tested, the DPSO algorithm not only selected the smallest subset of attributes, but also obtained the highest predictive accuracy in every single class level.

CONCLUSIONS
Computational results show that the use of unnecessary attributes tends to derail classifiers and hurt classification accuracy. Using only a small subset of selected attributes, the binary PSO and DPSO algorithms obtained better predictive accuracy than the baseline algorithm using all attributes. Previous work had already shown that the DPSO algorithm outperforms the binary PSO in the task of attribute selection [5], but that work involves only one data set. This current work