The discrete particle swarm optimization (DPSO) algorithm is an optimization technique which belongs to the fertile paradigm of Swarm Intelligence. Designed for the task of attribute selection, the DPSO deals with discrete variables in a straightforward manner. This work empowers the DPSO algorithm by extending it in two ways. First, it enables the DPSO to select attributes for a Bayesian network algorithm, which is more sophisticated than the Naive Bayes classifier previously used by the original DPSO algorithm. Second, it applies the DPSO to a set of challenging protein functional classification data, involving a large number of classes to be predicted. The work then compares the performance of the DPSO algorithm against the performance of a standard Binary PSO algorithm on the task of selecting attributes on those data sets. The criteria used for this comparison are (1) maximizing predictive accuracy and (2) finding the smallest subset of attributes.

Most of the particle swarm algorithms present in the
literature deal only with continuous variables [

The work in [

Although specifically designed for the task of
attribute selection, the DPSO is not limited to this kind of application. By
performing a few modifications, one can apply this algorithm to many other
discrete optimization problems, such as facility location problems [

Many data mining applications involve the task of
building a model for predictive classification. The goal of such a model is to
classify examples—records or data instances—into classes or categories of
the same type. Noise or unnecessary attributes may reduce the accuracy and
reliability of a classification or prediction model. Unnecessary attributes also
increase the costs of building and running a model—particularly on large
data sets. Before performing classification, it is therefore important to
select an appropriate subset of “good" attributes. Attribute selection
tries to simplify a data set by reducing its dimensionality and identifying
relevant underlying attributes without sacrificing predictive accuracy. As a
result, it reduces redundancy in the information provided by the attributes
used for prediction. For a more detailed review of the attribute selection task
using genetic algorithms, see [

The main difference between the DPSO and other
traditional PSO algorithms is that the particles in the DPSO do not represent
points inside an

The study presented here extends previous work
reported in [

The work is organized as follows. Section

The Naive Bayes
(NB) classifier uses a probabilistic approach to assign each record of the data
set to a possible class. In this work, the NB classifier assigns a protein of a
data set of proteins to a possible class. A Naive Bayes classifier assumes that
all attributes are conditionally independent of one another given the class
[

A Bayesian network (BN), by contrast, detects probabilistic relationships among these attributes and uses this information to aid the attribute selection process.

Bayesian networks are graphical representations of a
probability distribution over a set of variables of a given problem domain
[

An example of a Bayesian network is as follows. (This is a modified version of
the so-called “Asia" problem [

Bayesian network: nodes and values for the lung cancer problem. L = low, H = high, T = true, F = false, Pos = positive, and Neg = negative.

Node name | Values |
---|---|

Pollution | |

Smoker | |

Cancer | |

Dyspnoea | |

X-ray |

Figure

A Bayesian network representing the lung cancer problem.

More formally, let

As the structure of a BN implies that the value of a
particular node is conditional only on the values of its parent nodes, (

Learning the structure of a BN is an NP-hard problem
[

A greedy search algorithm is used to generate
alternative structures for the BN starting with an empty network, the greedy
search algorithm adds into the network the edge that most increases the score
of the resulting network. The search stops when no other edge addition improves
the score of the network. Algorithm

To evaluate the “goodness-of-fit” (score) of a
network structure to the data, an unconventional scoring metric—specific for
the target classification task—is adopted. The entire data set is divided
into mutually exclusive training and test sets—the standard methodology for
evaluating classifiers, see Section

Once the best network structure is selected, at the
end of the PSO run, the validation set and the other part of the training set
are merged and this merged data—that is, the entire original training set—is used to compute the probabilities for the selected Bayesian network. The
predicted accuracy—reported as the final result—is then computed on the
previously untouched test set. This process is discussed again, in more
details, in Section

Particle swarm optimization (PSO) comprises a set of
search techniques, inspired by the behavior of natural swarms, for solving
optimization problems [

Every particle keeps a record of the best position
that it has ever visited. Such a record is called the particle's previous best
position and denoted by

As a whole, the set of rules that govern PSO are evaluate, compare, and imitate. The evaluation phase measures how well each particle (candidate solution) solves the problem at hand. The comparison phase identifies the best particles. The imitation phase produces new particle positions based on some of the best particles previously found. These three phases are repeated until a given stopping criterion is met. The objective is to find the particle that best solves the target problem.

Important concepts in PSO are velocity and
neighborhood topology. Each particle,

Potential solutions to the target problem are encoded
as fixed size binary strings; that is,

For the initial
population,

At the
beginning, the previous best position of

Every particle

For each
particle

The DPSO
algorithm deals directly with discrete variables
(attributes) and, unlike the binary PSO algorithm, its population of candidate
solutions contains particles of different sizes. Potential solutions to the
optimization problem at hand are represented by a swarm of particles. There are

For example, given

Each particle

Each attribute
is represented by a unique positive integer number, or index. These numbers,
indices, vary from 1 to

The original work on DPSO [

The index of every attribute that has value 1 is
copied to the new solution (particle) of the DPSO initial population. For
instance, an initial candidate solution for the binary PSO algorithm equal to

Initializing the particles

In the DPSO algorithm, for simplicity, once the size
of a particle is determined at the initialization, the particle will keep that
same size during the entire execution of the algorithm. For example, particle

The DPSO
algorithm does not use a vector of velocities as the standard PSO algorithm
does. It works with proportional likelihoods instead. Arguably, the notion of
proportional likelihood used in the DPSO algorithm and the notion of velocity
used in the standard PSO are somewhat similar. DPSO uses

Every particle in DPSO is associated with a 2-by-

This is an example of a generic proportional
likelihood array

There is a one-to-one correspondence between the
columns of this array and the attributes of the problem domain. At the
beginning, all elements in the first row of

Note that

For instance, given

Note that index 1 is absent in

The new updated array

The
proportional likelihood array

To illustrate, suppose that

Suppose that this is the resulting array

A new particle position is then defined by ranking the
columns in

The next operation now is to select the indices that
will compose the new particle position. After ranking the array

Suppose that
the particle

The updating of

Once the algorithms have been explained, the next section briefly introduces the particular data sets (case studies) used to test the algorithms.

The experiments involved 6 data sets comprising two kinds of proteins, namely, G-protein-coupled receptors (GPCRs) and Enzymes.

The G-protein-coupled receptors (GPCRs) are a protein
superfamily of transmembrane receptors. Their function is to transduce signals
that induce a cellular response to the environment. GPCRs are involved in many
types of stimulus-response pathways, from intercellular communication to
physiological senses. GPCRs are of much interest to the pharmaceutical industry
because these proteins are involved in many pathological conditions—it is
estimated that GPCRs are the target of 40% to 50% of modern medical drugs [

Enzymes are proteins that
accelerate chemical reactions—they participate in many processes in a
biological cell. Some enzymes are used in the chemical industry and other
industrial applications where extremely specific catalysts are required. In
Enzyme Nomenclature, enzymes are assigned and identified by an Enzyme Commission
(EC) number. For instance, EC 2.3.4 is an enzyme with class value 2 in the
first hierarchical class level, class value 3 in the second class level, and so forth.
This work uses the GPCRs and EC data sets described in Table

GPCR and EC data sets. “Cases" represents the
number of proteins in the data set, “Attributes" represents the total
number of attributes that describe the proteins in
the data set, and “L1",

# Classes at | ||||||
---|---|---|---|---|---|---|

Data set | Cases | Attributes | L1 | L2 | L3 | L4 |

GPCR-PRINTS | 330 | 281 | 8 | 36 | 52 | 44 |

GPCR-PROSITE | 190 | 127 | 8 | 32 | 32 | — |

GPCR-InterPro | 580 | 448 | 12 | 43 | 67 | 46 |

EC-PRINTS | 500 | 380 | 6 | 43 | 83 | — |

EC-PROSITE | 570 | 583 | 6 | 42 | 84 | — |

EC-Pfam | 730 | 706 | 6 | 41 | 92 | — |

These
data sets were derived from the data sets used in [

The data sets used in the experiments involved four kinds of protein signatures (biological “motifs"), namely, PROSITE patterns, PRINTS fingerprints, InterPro entries, and Pfam signatures.

PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families (a protein consists of a sequence of amino acids). PROSITE patterns are essentially regular expressions describing small regions of a protein sequence which present a high sequence similarity when compared to other proteins in the same functional family.

In the data sets, the absence of a given PROSITE pattern is indicated by a value of 0 for the attribute corresponding to that PROSITE pattern. The presence of it is indicated by a value of 1 for that same attribute.

PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family. In the PRINTS data sets, a fingerprint corresponds to an attribute. The presence of a fingerprint is indicated by a value of 1 for that same attribute; the absence by a 0.

Pfam signatures are produced by hidden Markov models, and InterPro integrates a number of protein signature databases into a single database. In this work, Pfam and InterPro entries also correspond to binary attributes indicating whether or not a protein matches those entries, using the same codification described for the PROSITE patterns and Fingerprints.

The objective of the binary PSO and DPSO algorithms is
to classify each protein into its most suitable functional class level. The
classification of the proteins is performed in each class level individually.
For instance, given protein

The quality of a candidate solution (fitness) is evaluated in three different ways: (1) by a baseline algorithm—using all possible attributes; (2) by the binary PSO—using only the attributes selected by this algorithm; and (3) by the discrete PSO (DPSO) algorithm—using only the attributes selected by this algorithm. Each of these algorithms computes the fitness of every given solution using two distinct techniques: (a) using a Naive Bayes classifier; and (b) using a Bayesian network.

Note that the
computation of the fitness function

The measurement of

The computational experiments involved a 10-fold
crossvalidation method [

Each of the 10 folds is used once as a test set and the remaining of the data is used as training set. Out of the 9 folds in the training set one is reserved to be used as a validation set. The Naive Bayes classifier and the Bayesian network use the remaining 8 folds to compute the probabilities required to classify new examples. Once those probabilities have been computed, the Naive Bayes (NB) classifier and the Bayesian network (BN) classify the examples in the validation set.

The accuracy of this classification on the validation
set is the value of the fitness functions

The reasons for having separate validation and test sets are as follows. In the classification task of data mining, by definition, the goal is to measure predictive accuracy—generalization ability—on a test set unseen during training. Hence, the test set cannot be accessed by the PSO, and is reserved just to compute the predictive accuracy associated with the Bayesian classifier constructed with the best set of attributes selected at the end of the PSO run.

Concerning the validation set, which is used to
compute the fitness of particles during the PSO run, this is a part of the
original training set which is different from the part of the training set used
to build the Bayesian classifier, and the reason for having these two separate
parts of the training set is to avoid overfitting of the classifier to the
training data; for overfitting in the context of classification, see [

In each of the 10 iterations of the crossvalidation procedure, the predictive accuracy of the classification is assessed by 3 different methods, as follows.

Since the Naive Bayes and Bayesian network classifiers used in this work are deterministic, only one run—for each of these algorithms—is performed for the classification using all possible attributes.

For the binary PSO and the DPSO algorithms, 30 independent runs are performed for each iteration of the crossvalidation procedure. The results reported are averaged over these 30 independent runs and over the 10 iterations of the crossvalidation procedure.

The population size used for both algorithms (binary PSO and DPSO) is 200 and the search stops after 20 000 fitness evaluations—or 100 iterations.

The binary PSO algorithm uses an inertia weight value
of 0.8 (i.e.,

Other choices of parameter values for the DPSO were

The measurement of the predictive accuracy rate of a model should be a reliable estimate of how well that model classifies the test examples—unseen during the training phase—on the target problem.

In Data Mining, typically, the
equation

However, if the class distribution is highly
unbalanced, (

This measurement has been used before in [

Note that if any of the quantities

Computational
results are reported in Tables

To assess the performance of the algorithms, two criteria were considered: (1) maximizing predictive accuracy; and (2) finding the smallest subset of attributes.

The results for the first criterion, accuracy, show that both versions of the PSO algorithm did better—in all class levels—than the baseline algorithm using all attributes.

Furthermore, the DPSO algorithm did slightly better than the binary PSO algorithm also in all class levels. Nevertheless, the difference in the predictive accuracy performance between these algorithms is, in some cases, statistically insignificant.

Table

Predictive accuracy:
binary PSO versus DPSO. Paired two-tailed

Class level | Naive Bayes | Bayesian network |
---|---|---|

1 | ||

2 | ||

3 |

Table

Nevertheless, the discriminating factor between the performance of these algorithms is on the second comparison criterion—finding the smallest subset of attributes.

The DPSO not only outperformed the binary PSO in
predictive accuracy, but also did so using a smaller subset of attributes in
all class levels. Moreover, when it comes to effectively pruning the set of
attributes, the difference in performance between the binary PSO and the DPSO
is always statistically significant, as Table

Number of selected attributes: binary PSO versus DPSO. Paired two-tailed

Class level | Naive Bayes | Bayesian network |
---|---|---|

1 | ||

2 | ||

3 |

Results for the
GPCRs data sets. For the binary PSO and DPSO algorithms, 30 independent runs
are performed. The results reported are averaged over these 30 independent
runs. The best result on each line for each performance criterion is marked
with an asterisk (^{*}).

GPCR-PRINTS (281 attributes) | ||||||
---|---|---|---|---|---|---|

Average predictive accuracy | Average number of selected attributes | |||||

Method | Class level | Using all attributes | Binary PSO | Discrete PSO | Binary PSO | Discrete PSO |

Naive Bayes | 1 | 72.36 | 73.10 | ^{*}73.98 | 97.40 | ^{*}73.30 |

2 | 35.56 | 37.10 | ^{*}40.74 | 130.30 | ^{*}117.30 | |

3 | 27.00 | 29.05 | ^{*}31.55 | 171.10 | ^{*}145.70 | |

4 | 24.26 | 26.97 | ^{*}30.14 | 165.00 | ^{*}141.30 | |

Bayesian network | 1 | 88.67 | 89.46 | ^{*}89.97 | 89.30 | ^{*}63.80 |

2 | 53.46 | 56.75 | ^{*}58.91 | 123.70 | ^{*}103.00 | |

3 | 38.93 | 43.08 | ^{*}50.33 | 158.20 | ^{*}134.50 | |

4 | 28.47 | 30.56 | ^{*}39.52 | 152.60 | ^{*}126.80 | |

GPCR-PROSITE (127 attributes) | ||||||

Average predictive accuracy | Average number of selected attributes | |||||

Method | Class level | Using all attributes | Binary PSO | Discrete PSO | Discrete PSO | Binary PSO |

Naive Bayes | 1 | 71.27 | 72.88 | ^{*}73.05 | 85.60 | ^{*}74.90 |

2 | 30.00 | 31.34 | ^{*}32.60 | 101.50 | ^{*}83.80 | |

3 | 20.47 | 21.47 | ^{*}23.25 | 102.30 | ^{*}87.50 | |

Bayesian network | 1 | 78.05 | 79.03 | ^{*}80.54 | 78.50 | ^{*}65.50 |

2 | 39.08 | 40.31 | ^{*}43.24 | 94.10 | ^{*}73.30 | |

3 | 24.70 | 26.14 | ^{*}28.97 | 94.90 | ^{*}77.60 | |

GPCR-INTERPRO (448 attributes) | ||||||

Average predictive accuracy | Average number of selected attributes | |||||

Method | Class level | Using all attributes | Binary PSO | Discrete PSO | Binary PSO | Discrete PSO |

Naive Bayes | 1 | 54.17 | 55.33 | ^{*}56.55 | 136.40 | ^{*}120.70 |

2 | 25.19 | 26.08 | ^{*}27.27 | 158.60 | ^{*}136.20 | |

3 | 20.03 | 21.19 | ^{*}22.03 | 203.60 | ^{*}162.40 | |

4 | 27.97 | 29.95 | ^{*}30.43 | 168.00 | ^{*}150.10 | |

Bayesian network | 1 | 86.68 | 89.20 | ^{*}89.49 | 122.60 | ^{*}107.70 |

2 | 61.85 | 64.57 | ^{*}68.66 | 146.80 | ^{*}128.40 | |

3 | 40.77 | 44.11 | ^{*}46.51 | 184.60 | ^{*}148.10 | |

4 | 34.05 | 36.89 | ^{*}39.03 | 149.70 | ^{*}131.50 |

Results for the
EC data sets. For the binary PSO and the DPSO algorithms, 30 independent runs are
performed. The results reported are averaged over these 30 independent runs.
The best result on each line for each performance criterion is marked with an
asterisk (^{*}).

EC-PRINTS (380 attributes) | ||||||
---|---|---|---|---|---|---|

Average predictive accuracy | Average number of selected attributes | |||||

METHOD | Class level | Using all attributesS | Binary PSO | Discrete PSO | Binary PSO | Discrete PSO |

Naive Bayes | 1 | 72.35 | 73.78 | ^{*}74.81 | 102.80 | ^{*}64.20 |

2 | 31.19 | 32.07 | ^{*}34.06 | 149.00 | ^{*}112.30 | |

3 | 23.37 | 24.64 | ^{*}26.97 | 211.10 | ^{*}150.60 | |

Bayesian network | 1 | 88.30 | 89.51 | ^{*}90.73 | 92.80 | ^{*}48.90 |

2 | 53.15 | 55.14 | ^{*}56.92 | 129.70 | ^{*}102.00 | |

3 | 36.24 | 38.26 | ^{*}40.95 | 190.40 | ^{*}135.10 | |

EC-PROSITE (583 attributes) | ||||||

Average predictive accuracy | Average number of selected attributes | |||||

METHOD | Class level | Using all attributes | Binary PSO | Discrete PSO | Binary PSO | Discrete PSO |

Naive Bayes | 1 | 69.52 | 70.37 | ^{*}72.31 | 118.80 | ^{*}98.90 |

2 | 35.70 | 37.73 | ^{*}38.83 | 154.50 | ^{*}134.90 | |

3 | 21.91 | 22.86 | ^{*}24.36 | 197.70 | ^{*}154.50 | |

Bayesian network | 1 | 82.80 | 84.83 | ^{*}85.95 | 105.00 | ^{*}92.70 |

2 | 45.30 | 47.82 | ^{*}49.50 | 135.20 | ^{*}119.00 | |

3 | 28.44 | 29.40 | ^{*}32.52 | 172.00 | ^{*}146.50 | |

EC-PFAM (706 attributes) | ||||||

Average predictive accuracy | Average number of selected attributes | |||||

METHOD | Class level | Using all attributes | Binary PSO | Discrete PSO | Binary PSO | Discrete PSO |

Naive Bayes | 1 | 71.61 | 72.87 | ^{*}74.62 | 131.60 | ^{*}102.20 |

2 | 46.70 | 48.24 | ^{*}49.02 | 212.60 | ^{*}153.90 | |

3 | 31.00 | 32.20 | ^{*}33.24 | 244.40 | ^{*}177.70 | |

Bayesian network | 1 | 85.94 | 87.94 | ^{*}89.64 | 116.60 | ^{*}91.80 |

2 | 55.34 | 56.84 | ^{*}58.02 | 198.00 | ^{*}141.90 | |

3 | 36.56 | 37.61 | ^{*}39.44 | 221.70 | ^{*}168.60 |

Again, the predictive accuracy attained by both versions of the PSO algorithm surpassed the predictive accuracy obtained by the baseline algorithm in all class levels.

DPSO obtained the best predictive accuracy of all algorithms in all class levels. Regarding the second comparison criterion, finding the smallest subset of attributes, again DPSO always selected the smallest subset of attributes in all hierarchical levels.

The results on the performance of the classifiers—Naive Bayes versus Bayesian networks—show that Bayesian networks did a much better job. For all class levels, the predictive accuracy obtained by the 3 approaches (baseline, binary PSO and DPSO) using Bayesian networks was significantly better than the predictive accuracy obtained using Naive Bayes classifier. The Bayesian networks approach also enabled the two PSO algorithms to do the job using fewer selected attributes—compared to the Naive Bayes approach.

The results emphasize the importance of taking relationships among attributes into account—as Bayesian networks do—when performing attribute selection. If these relationships are ignored, predictive accuracy is adversely affected.

The results also show that for all 6 data sets tested, the DPSO algorithm not only selected the smallest subset of attributes, but also obtained the highest predictive accuracy in every single class level.

Computational
results show that the use of unnecessary attributes tends to derail classifiers
and hurt classification accuracy. Using only a small subset of selected
attributes, the binary PSO and DPSO algorithms obtained better predictive
accuracy than the baseline algorithm using all attributes. Previous work had
already shown that the DPSO algorithm outperforms the binary PSO in the task of
attribute selection [

Even when the difference in predictive accuracy is insignificant, by selecting fewer attributes than the binary PSO, the DPSO certainly enhances computational efficiency of the classifier and is therefore preferable.

The original work on DPSO [

The results demonstrate that, even using an identical initial population of particles, the DPSO is still outperforming the binary PSO in both predictive accuracy and number of selected attributes. The DPSO is arguably not too different from traditional PSO but still the algorithm has features that enable it to improve over binary PSO on the task of attribute selection.

Another result—although expected—from the experiments is the clear difference in performance between Naive Bayes and Bayesian networks used as classifiers. The Bayesian networks approach outperformed the Naive Bayes approach in all experiments and in all hierarchical class levels.

In this work, the hierarchical classification problem was dealt with in a simple way by “flattening" the hierarchy, that is, by predicting classes for one class level at a time, which permitted the use of flat classification algorithms. The algorithms made no use of the information of the class assigned to a protein in one level to help predict the class at the next hierarchical level. Future work intends to look at an algorithm that makes use of this information.

1: Evaluate the score of

2:

3:

4:

5:

6:

7:

8:

9:

10:

11: Modify

12:

13:

14:

15:

16:

17:

18:

19:

20:

21:

22:

23:

24: Modify

25:

26:

27:

The authors
would like to thank Nick Holden for kindly providing them with the biological
data sets used in this work. The authors would also like to thank EPSRC (grant