Solving Two-Dimensional HP Model by Firefly Algorithm and Simplified Energy Function

In order to solve the HP model of the protein folding problem, we investigated traditional energy function and pointed out that its discrete property cannot give direction of the next step to the searching point, causing a challenge to optimization algorithms. Therefore, we introduced the simplified energy function into a turn traditional discrete energy function to continuous one. The simplified energy function totals the distance between all pairs of hydrophobic amino acids. To optimize the simplified energy function, we introduced the latest swarm intelligence algorithm, the firefly algorithm (FA). FA is a hot nature-inspired technique and has been used for solving nonlinear multimodal optimization problems in dynamic environment. We also proposed the code scheme strategy to apply FA to the simplified HP model with the clash test strategy. The experiment took 14 sequences of different chain lengths from 18 to 100 as the dataset and compared the FA with standard genetic algorithm and immune genetic algorithm. Each algorithm ran 20 times. The averaged energy convergence results show that FA achieves the lowest values. It concludes that it is effective to solve 2D HP model by the firefly algorithm and the simplified energy function.


Introduction
Protein folding is the process by which a protein structure assumes its functional shape or conformation. It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from random coil. Each protein exists as an unfolded polypeptide or random coil when translated from a sequence of mRNA to a linear chain of amino acids [1]. This polypeptide lacks any developed three-dimensional structure. Amino acids interact with each other to produce a well-defined three-dimensional structure, the folded protein, known as the native state. The resulting three-dimensional structure is determined by the amino acid sequence [2,3].
The protein folding has a challenging search space, since nature identifies the global minimum from more than 10 50 possible conformations for the backbone of a small protein [4]. A successful prediction requires two major components: (1) a set of free energy components for the protein, which are computationally inexpensive enough to be used in the search procedure and sufficiently accurate to ensure the uniqueness of the native fold; (2) an efficient optimization procedure which is capable of finding an appropriate minimum for the strongly anisotropic function of hundreds of variables [5,6].
Scholars tend to use the 2D lattice model (HP model) for protein folding. The HP model was proposed by Lau and Dill [7]. In this model, proteins consist of two different kinds of residues, hydrophobic and hydrophilic. The task is to minimize the energy function, which is defined as the counting of every two hydrophobic residues that are nonconsecutive nearest neighbors on the lattice [8].
The recent literatures solving the 2D HP model were reported as follows. Lin and Hsieh [9] proposed an efficient hybrid Taguchi genetic algorithm that combines genetic algorithm, Taguchi method, and particle swarm optimization, in order to enhance the performance of predicting protein structure. In addition, Lin presented the PSO inspired by a mutation mechanism in a genetic algorithm. In the experiment, Lin demonstrated that their algorithm can be applied successfully to the protein folding problems based on the hydrophobic-hydrophilic lattice model. Zhang and Wu [10] investigated the bacterial chemotaxis optimization (BCO) on the 2D lattice model. He compared BCO with standard genetic algorithm, immune genetic algorithm, and artificial immune system for various chain lengths. He concluded that the BCO has the highest successful rate. Albrecht et al. [11] presented results from three-dimensional protein folding simulations in the HP model on ten benchmark problems. Their simulations are executed by a simulated annealing-based algorithm with a time-dependent cooling schedule. The neighborhood relation is determined by the pull-move set. The results of 10 benchmark problems provided experimental evidence of the relationship between the maximum depth, the stopping criterion, the mean of nonzero difference of the objective function, and their bounds. Chen et al. [12] suggested that protein folding cores form early in the process of folding, and proteins may have evolved to optimize both folding speed and native-state stability. Chen developed a set of empirical potential functions and applied them to analyze interaction energies among secondary structure elements in two proteins. Their work demonstrated that the predicted folding core also harbors residue that form native like interactions early in the folding reaction. Using a set of 29 unrelated proteins, Chen demonstrated that the average prediction result from their method is significantly better than predictions based on other computation methods. Sharma et al. [13] provided a description which is consistent with other natural processes, that the protein folding is formulated from the principle of increasing entropy. It then became evident that protein folding is an evolutionary process among many others. During the course of folding protein, structural hierarchy builds up in succession by diminishing energy density gradients in the quest for a stationary state determined by surrounding density in energy. Evolution toward more probable states, eventually attaining the stationary state, naturally selects steeply ascending paths on the entropy landscape that correspond to steeply descending paths on the free energy landscape. The dissipative motion of the non-Euclidian manifold is nondeterministic by its nature which clarifies why it is so difficult to predict protein folding. Zhang et al. [14] proposed a novel chaotic clonal genetic algorithm (CCGA) on a 2D lattice model. The algorithm combines chaos operator, clonal selection algorithm, and genetic algorithm. The experiments compared CCGA with standard genetic algorithm and immune genetic algorithm for various chain lengths, and found that CCGA not only find global minima more reliably, but also be significantly faster in convergence. Zhao [15] had introduced several natural computing approaches. In his paper, the evolutionary algorithm, memetic algorithm, ant colony optimization algorithm, tabu search, self-organizing map-based iterative approaches, and the typical chain growth computing approaches were reviewed. The advantages and disadvantages of various computing approaches and the current optimal results are analyzed.
However, those above mentioned algorithms cannot converge to the global optimal points within limited time. The reason lies at two sides. The first is that the energy function is discrete, so it cannot effectively guide the searching point as the continuous energy function. The second is that the optimization algorithms themselves are not rapid enough.
To solve the two outstanding issues, we introduced a simplified energy function [16], and we introduced the latest firefly algorithm (FA). FA is a hot nature-inspired technique and has been used for solving nonlinear multimodal optimization problems in dynamic environment [17]. The algorithm is based on the behavior of the fireflies. In social insect colonies, each firefly seems to have its own plans, and yet the group acts as a whole appears to be highly organized. Scholars published immense literatures reporting that its performance, effectiveness, and robustness are superior to GA, PSO, and other global algorithms in a wide range of fields [17][18][19].
The structure of this paper was organized as follows. Section 2 gave a brief overview on the simplified 2D HP model and its energy function. Section 3 introduced the fundamental of firefly algorithm, giving its pseudocodes and the encoding schemes to simplified 2D HP model. Section 4 contained the experiment of 14 chains varying from 18 to 100 residues and compared the FA algorithm to standard genetic algorithm and immune genetic algorithm. Finaly, Section 5 concluded the paper and summarized our contribution.

Overview of 2D HP Model
Proteins are the basis of biology. They are the driving force behind all of the biochemical reactions. They are the main constituent of our bones, muscles, hair, skin, and blood vessels. They recognize invading elements and allow the immune system to get rid of the unwanted invaders. For these reasons, scientists have sequenced the human genome, the blueprint for all of the proteins in biology. However, only knowing this sequence tells us little about what the protein does and how it does it. In order to carry out their function, they must take on a particular shape (fold). Therefore, proteins are genuinely amazing machines: before they do their work, they assemble themselves, which is called "folding. " [20] The protein folding problem is so difficult, so scholars prefer to use the 2D HP model proposed by Lau and Dill [7]. The task is to find the optimal folded configuration of a given sequence. Each potential configuration is associated with an energy function. The optimal configuration corresponds to the lowest energy.

The 2D HP Model.
There are two types of amino acids in the realistic world: hydrophobic amino acid and hydrophilic amino acid. Since proteins always exist in the watery environment, the hydrophobic amino acid and hydrophilic amino acid tend to cling to each other, so as to apart from water to the maximal degree [16]. Figure 1 shows an 8-residue chain (01111110) before and after folding. We use 1 to denote hydrophobic residue (denoted by red circle), and 0 to denote hydrophilic residue (denoted by green circle). Figure 1 indicates that (1) the chain can turn 90 ∘ left or right, or continue ahead at each point; (2) the hydrophobic residues compact themselves to keep away from water.

Simplified Energy Function.
In traditional HP model, the energy function is simple as adding −1 for each direct (occupying nondiagonal neighboring lattice points) between two hydrophobic amino acids, those are not consecutive in the protein sequences. Figure 2(a) shows a protein sequence with energy function as −2.
However, it is not easy to pinpoint the direct and not consecutive two hydrophobic amino acids. Therefore, a simplified energy scoring method via distance matrix is proposed at Mathworks contest in 2002 [16]. We introduced in this strategy in our work. The energy is scored by totaling the distance between all pairs of hydrophobic amino acids. For Figure 2(b), we used the blue lines to link all pairs of hydrophobic amino acids, and summed distances is calculated as 21.129. The detailed calculation procedure is shown as follows.

Theory Analysis of Simplified Energy Function.
We expect that the simplified energy function is better than the traditional function. The reason is that the traditional energy function is discrete or stair wisely with −1 decrease. It will not give instructive information for the searching point. While the simplified energy function used the sum of the distance matrix so as to transform traditional discrete energy function to a continuous one. Figure 3 gave an illustration. Figure 3(a) shows the discrete energy function, where the searching point found the energy function remains as 1 within the searching area, so it can not give the direction of the next step for the searching point. Figure 3(b) shows the continuous energy function, where the searching point found the energy function varies, so it will move towards the minimal point within the searching area.

Firefly Algorithm
A power optimization algorithm is of importance to solve 2D HP model. Firefly Algorithm (FA) is a nature-inspired algorithm based on the flashing behaviors of the firefly swarm [19]. The primary purpose for the flash of fireflies is to signal to attract other fireflies. The assumption of FA consists of three rules: (1) all fireflies are unisex, so that one firefly will be attracted to other fireflies regardless of their sex; (2) an important and interesting behavior of fireflies is to glow brighter mainly to attract prey and to share food with others; (3) attractiveness is proportional to their brightness, thus each agent firstly moves toward a neighbor that glows brighter.
In the FA, the fireflies are randomly distributed in the search space. The fireflies carry a luminescence quality, called luciferin, which emits light proportional to the quality [18]. Each firefly is attracted by the brighter glow of other approximated fireflies. The attractiveness decreases as their distance increases [21]. If there is no brighter one within the scope of a firefly, it will move randomly in the search space.

Variation of Light
Intensity. The brightness is related to the objective values, so for an optimization problem, a firefly with higher intensity will attract another firefly with higher probability, and vice versa. Assume that there exists a swarm of fireflies, and represents a solution for a firefly , whereas ( ) denotes its corresponding energy value. Here, the brightness of a firefly is equivalent to the simplified energy value = ( ) , 1 ≤ ≤ . (3)

Movement towards Attractive Flies.
The attractiveness of the firefly is proportional to the light intensity received by the adjacent fireflies [22]. Suppose 0 is the attractiveness with distance = 0, so for two fireflies and at locations and , their attractiveness is calculated as where ( , ) denotes the distance between fireflies and , denotes the light absorption coefficient. Suppose that firefly is brighter than firefly , then firefly will move to a new location as 3.3. Pseudocodes. See Pseudocode 1.

Encoding Scheme.
For the 2D Hp model, we set the firefly as a string of digits chosen from the set {0, 1, 2}. Each digit denotes a torsion angle as: 0 → left, 1 → right, and 2 → continue [8]. The initialization of FA generated the fireflies with random digits, and the first link of the sequence is not included in the encoding. Figure 4 shows that the direction of the first link only makes the structure rotated. Their structures of the lattice model does not change, neither does the corresponding energy levels.

Clash
Test. Sometimes the decoded string represents a clash, which means a lattice node is occupied by more than one residue, as shown in Figure 5. If a clash occurred, this firefly will be replaced by a new randomly generated firefly.
If the clash occurs in the new one, then the procedures are repeated until no clash occurs. Meanwhile, the clash tests need to be checked at every step after the fireflies are updated, see (5) for detailed information.

Experiments
The experiments were carried out on the platform of P4 IBM with 3.1 GHz processor and 2 G RAM, running under Windows XP operating system. The algorithm was in-house developed via Matlab 2012a. The experiment data contains 14 protein folding problems with different chain lengths varying from 18 to 100 generated randomly. Their sequences are listed in Table 1. Figure 6 gave the optimal solutions obtained by FA through the simplified HP model. It forces the hydrophobic residues to fold themselves together to the utmost extent. Obviously, all the hydrophobic residues are located at the inner side, while the hydrophilic residues are located outside. It satisfies the expectation of 2D HP model. Therefore, this simplified energy function is effective.

Algorithm Comparison.
We compared the FA with traditional standard genetic algorithm (SGA) and immune genetic algorithm (IGA). We ran each algorithm 20 times and got the averaged converged energy values of total 20 times results. The parameters are set by hundreds of experiments to get the optimal values and are shown in Table 2.
The average convergence results were illustrated in Figure 7. It indicated that the firefly algorithm achieves the least averaged convergence values among all three algorithms Mathematical Problems in Engineering 5 The pseudo-codes of the firefly algorithm can be summarized as Step 1 Initialization.
Step 1.2 Formulate light intensity of each firefly so as to be associated with the energy value ( ).
Step 1.3 Define the parameters 0 and . Step 2 Perform. while (termination criteria are not met) for = 1 to n for j = 1 to if ( < ) Move firefly towards firefly via (5). end if Update Attractiveness Table. Evaluate New Solutions and Update light intensity. end for end for Rank the fireflies and find the best. end for while Step 3 Post-processing and output. for all protein sequences. Therefore, the FA is the most effective among the three algorithms. Besides, the longer the proteins become (the index increases), the larger the advantages of FA turn.

Energy Function Comparison.
In this chapter, we compared the simplified energy function with traditional energy function. We choose the first protein in the dataset, of which the minimal traditional energy function is −5, and the minimal simplified energy function is 100.97. Their convergence curves are shown in Figure 8. We see that the simplified energy function merely costs about 1/3 iterations of those needed by the traditional energy function. The reason lies in the fact that the traditional energy function is discrete and cannot give effective direction for the searching point to move to, whereas the simplified energy function can be regarded as a continuous function, so the searching point is easy to guide itself moving towards the minimal point within its search territory. Moreover, Figure 8 validates that our developed theory in Section 2.3 is coherent to the experimental results.

Conclusions and Future Work
This study introduced the firefly algorithm and the simplified energy function and applied them to the 2D HP lattice model of protein folding problem. The experimental results on 20 runs of 14 chains with different lengths showed that the FA can achieve the lowest averaged energy values compared to standard genetic algorithm and immune genetic algorithm. We also prove the superiority of the simplified energy function on traditional energy function.
Our contributions can be summarized as follows.
(1) We applied firefly algorithm to the field of protein folding problem. (2) We introduced the simplified energy function.
(3) We demonstrate the superiority of the simplified energy function on traditional energy function on both from the discrete and continuous energy function theory and experimental results.