Exon recognition is a fundamental task in bioinformatics to identify the exons of DNA sequence. Currently, exon recognition algorithms based on digital signal processing techniques have been widely used. Unfortunately, these methods require many calculations, resulting in low recognition efficiency. In order to overcome this limitation, a two-stage exon recognition model is proposed and implemented in this paper. There are three main works. Firstly, we use synergetic neural network to rapidly determine initial exon intervals. Secondly, adaptive sliding window is used to accurately discriminate the final exon intervals. Finally, parameter optimization based on artificial fish swarm algorithm is used to determine different species thresholds and corresponding adjustment parameters of adaptive windows. Experimental results show that the proposed model has better performance for exon recognition and provides a practical solution and a promising future for other recognition tasks.
1. Introduction
With the completion of human genome project, gene data increase exponentially. Identifying the genes encoding of DNA [1] has important theoretical and practical implications. How to quickly access accurate genetic information is an urgent problem to be solved.
Early exon recognition methods were based mainly on statistical models [2], which get their chromosomal order by statistical analysis of different genes. But with the increase of genomic number, statistical methods cannot meet the need for rapid recognition of exons. At present, exon recognition methods based on digital signal processing have also been widely used [3–5]. These techniques select a suitable mapping method and transformation method to get spectral values and identify exons according to fixed length window. Limitations of these methods include slow recognition speed and inability to accurately determine the threshold for different species.
Synergetic theory [6] is the science proposed by Haken to describe high dimension and nonlinear problem as a set of low-dimension nonlinear equations. One advantage of synergetic neural network is that the method is robust against noise and the method can better handle the fuzzy matching problem [7–9]. Exon recognition can also be considered a problem of pattern recognition, for which the proposed method can be used to solve.
Artificial fish swarm algorithm (AFSA) [10, 11] is a class of swarm intelligence optimization algorithms based on the behavior of animals proposed in 2002; the basic idea of AFSA is to imitate the fish behaviors such as praying, swarming, and following. AFSA is very suitable for solving a variety of numerical optimization problems, making the algorithm become a hot topic in the current optimization field quickly. Because of simplicity in principle and good robustness, AFSA has been applied successfully to all kinds of optimization problems such as image segmentation [12], color quantization [13], neural network [14], fuzzy logic controller [15], multirobot task scheduling [16], fault diagnosis in mine hoist [17], data clustering [18], and other areas.
In this paper, we proposed a two-stage exon recognition model based on synergetic neural network and artificial fish swarm algorithm. This paper is organized as follows. Firstly, traditional exon recognition method based on digital signal processing and related work are presented. Secondly, an exon recognition model based on synergetic neural network and parameter optimization method based on artificial fish swarm algorithm are introduced. Finally some experimental tests, results, and conclusions are given on the systems.
2. Introduction to Exon Recognition Method Based on Digital Signal Processing
The gene is usually divided into many fragments. The coding sequence is called exons and noncoding part is called introns, as shown in Figure 1.
Structure of eukaryotic DNA sequence.
The objective of gene recognition is to identify the exons of DNA sequence. Gene recognition based on digital signal processing methods consists of several steps [19, 20]. First, gene sequences are transformed into digital symbol sequences using mapping methods [21–24]. This is followed by calculation of the corresponding frequency value by fast Fourier transform and the 3-Cycle properties of the spectrum are then used to identify exons [25, 26]. Finally, fixed sliding window method is used for automatic exon recognition, as shown in Figure 2.
Exon recognition algorithm based on 3-Cycle spectrum.
2.1. Z-Curve Mapping
In order to make digital processing, we must transform four nucleotide sequences A, T, G, and C into their corresponding numeric sequence based on certain rules.
Let the four instruction sequences be {ub[n]}, b∈I={A,C,G,T}, and cumulative sequence bn(n=0,1…,N-1) is bn=∑i=0n-1ub[i]; then we can define three sequences x[n], y[n], and z[n]:
(1)x[n]=2(An+Gn)-n,y[n]=2(An+Cn)-n,z[n]=2(An+Tn)-n.
Let
(2)x[-1]=0,y[-1]=0,z[-1]=0,Δx[n]=x[n]-x[n-1],Δy[n]=y[n]-y[n-1],Δz[n]=z[n]-z[n-1].
Thus we can get the Z-curve mapping:
(3)(Δx[n]Δy[n]Δz[n])=(1-11-111-1-11-1-11)(uA[n]uC[n]uG[n]uT[n]).
For example, the DNA sequence of S(n) is ACGTTAG; then the corresponding Z-curve mapping sequence is shown in Table 1.
Δx[n]
1
−1
1
−1
−1
1
1
Δy[n]
1
1
−1
−1
−1
1
−1
Δz[n]
1
−1
−1
1
1
1
−1
2.2. The Power Spectrum
To study the characteristics of DNA coding sequences (exons), we can do the discrete Fourier transform (DFT), respectively, for the instruction sequences:
(4)Ub[k]=∑n=0N-1ub[n]e-j(2πnk/N),k=0,1,…,N-1.
Thus we can calculate the power spectrum:
(5)Pz[k]=|ΔX[k]|2+||ΔY[k]||2+|ΔZ[k]|2,k=0,1,…,N-1,
where ΔX[k], ΔY[k], and ΔZ[k] are the Fourier transform of Δx[n], Δy[n], and Δz[n], respectively.
The spectral peaks of exon sequences are larger in k=N/3 and k=2N/3 of the power spectrum curve, while they are not similar for intron. This statistical phenomenon is known as 3-Cycle. Suppose that the average power spectrum of DNA sequences is
(6)E-=∑k=0N-1P[k]N.
The power spectrum ratio of the DNA sequence and the average spectrum of the entire sequence are known as SNR (signal-to-noise ratio):
(7)R=P[N/3]E-.
Figure 3 shows the power spectrum of viral genes.
The power spectrum of viral gene sequence.
From Figure 3, we can see that the spectrum presents obvious 3-Cycle. The peaks appear roughly in 2000, 4000, and 6000. So the exon segment can be determined, enabling the recognition of genes.
The highest point of power spectrum may not appear in k=N/3 and k=2N/3 but occur in the surrounding. So we can calculate average SNR R1 and R2 of intervals [N/3-γ, N/3+γ] and [2N/3-γ, 2N/3+γ], respectively:
(8)R1=∑k=N/3-γN/3+γP[E](2γ+1)E-,R2=∑k=2N/3-γ2N/3+γP[E](2γ+1)E-.
2.3. Automatic Recognition Algorithm Based on Fixed Sliding Windows
Supposed M is the length of fixed window; we can do four discrete Fourier transforms (DFT) for instruction sequences {ub[n]}(0≤n≤N-1),
(9)Ub[k]=∑i=n-(M-1)/2i=n+(M-1)/2ub[i]e-j(2πik/M),k=0,1,…,M-1.
Then the total spectrum p(n;M/3) at position M/3 is
(10)P[M3]=|UA[M3]|2+|UT[M3]|2+|UG[M3]|2+|UC[M3]|2=Δp(n;M3).
3. Related Work
The SNR of exon sequences reflects the distribution of spectrum peak. SNR greater than a given threshold is a characteristic of exons, while introns generally do not have this property.
Protein coding regions and noncoding regions can be distinguished using the value of SNR, but this method still has a large predictive error because the spectrum peak varies amongst different biological categories. A fixed threshold is unreasonable to use for different biological categories. Therefore, determining the SNR threshold has great significance for exon recognition. Note that it is difficult to find the proper prediction threshold for biological categories when relying only on prior biological knowledge.
Xu [27] proposed a method based on bootstrap algorithm to determine the best SNR threshold that can be obtained from marked exon sequences. The results of that study showed that the average prediction accuracy of the method was 81%, which is 19% higher than other methods that employ empirical thresholds. In paper [28], a novel model was proposed to determine the SNR threshold based on the means of biological categories and improved the recognition performance to some extent.
But all the methods mentioned above have problems, such as slow recognition speed, inaccurate determination of the threshold for different species, and the requirement to know the exon fragments of DNA sequences. In the following sections, we propose a novel two-stage exon recognition model based on synergetic neural network and artificial fish swarm algorithm to better deal with these problems.
4. A Novel Two-Stage Exon Recognition Model
In this section, a two-stage exon recognition model is presented. In the first stage, synergetic neural network is used to determine initial exon intervals. In the second stage, final accurate exon intervals determination based on adaptive sliding window and parameter optimization algorithm are introduced.
4.1. Initial Exon Intervals Determination Based on Synergetic Neural Network
The basic principle of synergetic neural network [29, 30] is that the pattern recognition procedure can be viewed as the competition progress of many order parameters. The strongest order parameter will win by competition and desired pattern will be recognized.
A pattern that remained to be recognized, q, is constructed by a dynamic process which translates q into one of prototype pattern vectors vk through status q(t); namely, this prototype pattern is closest to q(0). The process is described as the following equation:
(11)q⟶q(t)⟶vk.
A dynamic equation can be given for an unrecognized pattern q:
(12)q˙=∑k=1Mλkvk(vk+q)-B∑k′≠k(vk′+q)2(vk+q)vk-C(q+q)q+F(t),
where q is the status vector of input pattern with initial value q0, λk is attention parameter, vk is prototype pattern vector, and vk+ is the adjoint vector of vk that satisfies
(13)(vk+,vk′T)=vk+·vk′T=δkk′.
Corresponding dynamic equation of order parameters is
(14)ξ˙k=λkξk-B∑k′≠kξk′2ξk-C|∑k′=1Mξk′2|ξk.
Haken has proved that when λk=c(c>0), the largest initial order parameter will win and the network will then converge.
We firstly introduce the synergetic theory to exon recognition; an exon recognition algorithm based on synergetic neural network is shown in Figure 4.
Exon recognition based on synergetic neural network.
We use synergetic neural network and N equal method to quickly determine the initial exon region, as shown in Algorithm 1.
<bold>Algorithm 1: </bold>Determination of initial exon region based on synergetic neural network.
(1) Let S is a given gene sequence, Sstart and Send are the beginning and end of the
sequence respectively, T0 is throdthod of spectral values;
(2) Using Z_Curve mapping converted gene sequence to the corresponding numeric sequence;
(3) Using fast Fourier transform to get spectral values R1 and R2 according to the formula (8);
(4) Calculating gene sequence order parameter:
ξ1=R1R1+R2, ξ2=R2R1+R2;
(5) Setting network parameter λk and B, C;
(6) Order parameter evolution according to formula (14);
(7) If ξ1>T0 and ξ2>T0, then [Sstart,Send] is recorded as a possible interal, and S is divided
equally into n interals S1,S2, …, Sn, Repeat step 1 to step 7;
(8) End.
4.2. Get Precise Exon Intervals Using Adaptive Smoothing Window
We can obtain several possible exon intervals by Algorithm 1. In this section, we propose an adaptive sliding window algorithm to get more accurate intervals, as shown in Algorithm 2.
<bold>Algorithm 2: </bold>Precise exon regions based on adaptive smoothing window.
(1) Let W is a given gene sequence, Wstart and Wend are the beginning and the end of
the sequence respectively;
(2) Using Z_Curve mapping converted gene sequence [Wstart,Wend] to the
corresponding numeric sequence;
(3) Using fast Fourier transform to get spectral values;
(4) Calculating gene sequence order parameter:
ξ1=R1R1+R2, ξ2=R2R1+R2;
(5) Order parameter evolution according to formula (14);
(6) If ξ1>T0, ξ2>T0 and Wstart+γ<Wend-γ, Then Wstart=Wstart+γ,
Wend=Wend-γ,
Repeat step 2 to step 6;
(7) Output the final interval [Wstart,Wend].
4.3. Parameter Optimization Based on Artificial Fish Swarm Algorithm
The parameters T0 and γ directly influence the performance of exon recognition. The adjustment of the parameters is a global behaviour and has no general research theory to control the parameters in the recognition process at present. In this section, artificial fish swarm algorithm is used to search the global optimum parameters (T0,γ) in the corresponding parameter space.
The parameter optimization based on artificial fish swarm algorithm is shown as Algorithm 3.
<bold>Algorithm 3: </bold>Parameter optimization based on artificial fish swarm algorithm.
(1) Initialize the parameters of artificial fish, such as step, visual, the number of exploratory,
maximum number of iterations, and randomly generated n fishes;
(2) Set bulletin board to record the current status of each fish, and select the optimal value;
(3) Implementation of prey behavior, swarm behavior and follow behavior;
(4) Optimal value in bulletin board is updated;
(5) If termination condition is satisfied, output the result; otherwise return to step 2.
5. Experiment5.1. Data Description
In our experiments, we use some gene sequences provided by Chinese Graduate Mathematical Contest in Modeling. Chinese graduate Mathematical Contest in modeling is aimed at improving the students’ comprehensive abilities of mathematical modeling and computer to solve practical problems. From different points of view, the integrated use of a variety of mathematical methods established the mathematical model of the characteristic.
We selected 100 human gene sequences, 100 rodent gene sequences (including Mus musculus and Sewer rat), and 200 mammalian gene sequences for testing. The signal-to-noise ratios of the sequences are gotten by SPSS statistical analysis software, as shown in Table 2.
The signal-to-noise ratio of four different gene sequences.
Gene categories
Exon
Intron
Number
R-mean
Variance
Number
R-mean
Variance
Human
35
3.02
3.071
26
0.82
0.533
Mus musculus
357
2.46
2.508
275
0.68
0.414
Sewer rat
45
3
5.233
35
0.83
0.624
Mammalian
827
2.72
6.243
626
0.67
0.394
From Table 2, we can find out that the difference between SNR standard deviation of exons is greater than SNR standard deviation of introns.
At the same time, we analyze the SNR distribution of exons and introns of 200 mammalian gene sequences, as shown in Figure 5 and Figure 6.
The SNR distribution of 200 mammalian exons.
The SNR distribution of 200 mammalian introns.
From Figure 5 and Figure 6, we can see that the mammalian introns are mostly less than 2, while exons are mostly distributed in the range of [0,2], which accounts for 55.38%. Therefore, it is unreasonable to set SNR threshold of different categories as fixed value. How to accurately determine SNR threshold of each kind of biological gene has important significance.
5.2. Experiment Results
Suppose that sensitivity SN=TP/(Tp+FN) and specificity SP=TN/(TN+FP), where TP is the number of exons which are correctly identified, TN is the number of introns which are correctly identified, FP is the number of exons which are not correctly identified, and FN is the number of introns which are not correctly identified. Then we can compute the accurate rate Ac=(SN+SP)/2.
For comparison, we use four strategies.
Baseline: automatic recognition algorithm with threshold R0=2.
Bootstrap: the threshold selection algorithm based on bootstrap method.
SNN: exon recognition based on synergetic neural network.
SNN + AFSA: two-stage exon recognition model based on synergetic neural network and artificial fish swarm algorithm.
The testing performance of Baseline is shown as in Table 3.
The testing performance of Baseline.
Gene categories
TP
FN
SN
TN
FP
SP
Ac
Human
17
18
0.485
24
2
0.923
0.71
Mus musculus
146
211
0.409
271
4
0.985
0.70
Sewer rat
17
28
0.378
31
4
0.886
0.63
Mammalian
369
458
0.446
621
5
0.992
0.72
The experiments showed that when the exon length is short, the recognition accuracy rate is low. In the short gene coding sequence, 3-base periodicity is not absolutely satisfied.
In our experiments, we complete a two-stage exon recognition model based on synergetic neural network and artificial fish swarm algorithm. The parameter settings of artificial fish swarm algorithm are shown in Table 4.
The parameter settings of artificial fish swarm algorithm.
Algorithm
Fish number
Visual
Delta
Step
Number of iterations
AFSA
100
2.85
9
1
60
In the experiment, we set the recognition accuracy rate as score function.
The testing performance of SNN + AFSA is shown as in Table 5.
The test performance of SNN + AFSA.
Gene categories
TP
FN
SN
TN
FP
SP
Ac
Human
30
5
0.857
19
7
0.731
0.79
Mus musculus
295
62
0.826
220
55
0.80
0.81
Sewer rat
36
9
0.80
28
7
0.80
0.80
Mammalian
630
197
0.762
607
19
0.97
0.87
Table 5 shows that the two-stage exon recognition algorithm improves precision compared to the Baseline system. Experiments also indicate that the improved model has a more powerful global exploration ability and a reasonable convergence speed.
The accurate rate Ac of different methods is shown in Table 6.
The test performance comparison among different methods.
Gene categories
Baseline
Bootstrap
SNN
SNN + AFSA
Human
0.71
0.76
0.78
0.79
Mus musculus
0.70
0.78
0.80
0.81
Sewer rat
0.63
0.75
0.77
0.80
Mammalian
0.72
0.84
0.85
0.87
Detailed comparisons of results are given in Table 6. Experimental results show that the proposed model SNN and SNN + AFSA have good performance for exon recognition. The accurate rate we obtained for all four corpuses is comparable to the state-of-the-art systems, such as Baseline and bootstrap method. Through the evaluating of order parameter equation of SNN to obtain the best threshold, we can further improve the exon recognition performance.
At the same time, we can see that the performance of SNN + AFSA is better than SNN model. This is because the attention parameters are very important for SNN and optimization algorithm is essential for better performance. Experimental results show that improved AFSA algorithm has better global and local parameter searching capabilities and thus a better recognition result.
It is worth noting that experimental results show that run times of our proposed model reduced with good speedup ratio compared with Baseline. Further studies show that the procedure exhibits data parallelism, so it can be effectively parallelized by running it concurrently. In the future work, we will utilize parallel processing techniques for rapid exon recognition based on SNN to further reduce the run time.
6. Conclusions
In the paper, we proposed a two-stage exon recognition model based on synergetic neural network and artificial fish swarm algorithm. Experiments show that the proposed model can improve the precision of exon recognition.
We got the following conclusions.
The exon recognition procedure can be viewed as the competition progress of many order parameters. The proposed model based on synergetic neural network and N equal method can quickly determine the exon intervals.
Artificial fish swarm algorithm has both global and local search ability and can effectively choose the parameters of our proposed model.
Using N equal algorithm to obtain exon intervals may still miss some intervals which are in the middle; we will further improve the algorithm or use different pattern recognition algorithm in the future.
It must be noted that, although we have made some efforts to explore the intelligent exon recognition algorithm in this paper. But due to the special nature of life science itself, there are many problems such as how to accurately determine that the exon interval needs further study. But we believe that with the development of social progress and technology, gene identification technology will become increasingly perfect; we expect it can bring gospel to manking in the near future.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant no. 61005052), the Fundamental Research Funds for the Central Universities (Grant no. 2010121068), the Natural Science Foundation of Fujian Province of China (Grant no. 2011J01369), and the Science and Technology Project of Quanzhou (Grant no. 2012Z91).
BurgeC. B.KarlinS.Finding the genes in genomic DNAWangZ.ChenY.LiY.A brief review of computational gene prediction methodsSharmaS. D.ShakyaK.SharmaS. N.Evaluation of DNA mapping schemes for exon detectionProceedings of the International Conference on Computer, Communication and Electrical Technology (ICCCET '11)March 201171742-s2.0-7995751651010.1109/ICCCET.2011.5762441KotlarD.LavnerY.Gene prediction by spectral rotation measure: a new method for identifying protein-coding regionsYinC.YauS. S.-T.Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequenceHakenH.ShaoJ.GaoJ.YangX. Z.Synergetic face recognition algorithm based on ICA1Proceedings of the International Conference on Neural Networks and BrainOctober 2005Beijing, China2492532-s2.0-33847130089JiangZ.DougalR. A.Synergetic contro1 of power converters for pulse current charging of advanced batteries from a fuel cell power sourceMaX. L.JiaoL . C.CampilhoA.KamelM.Reconstruction of order parameters based on immunity clonal strategy for image classificationLiX. L.FengS. H.QianJ. X.LuF.Parameter tuning method of robust pID controller based on artificial fish school algorithmLiX. L.LuF.TianG. H.QianJ. X.Applications of artificial fish school algorithm in combinatorial optimization problemsTianW.GengY.LiuJ.AiL.Optimal parameter algorithm for image segmentationProceedings of the 2nd International Conference on Future Information Technology and Management Engineering (FITME '09)December 20091791822-s2.0-7795091317210.1109/FITME.2009.50YazdaniD.NabizadehH.KosariE. M.ToosiA. N.Color quantization using modified artificial fish swarm algorithm7106Proceedings of the International conference Artificial Intelligence2011382391Lecture Notes in Artificial IntelligenceZhangM.ShaoC.LiF.GanY.SunJ.Evolving neural network classifiers and feature subset using artificial fish swarmProceedings of the IEEE International Conference on Mechatronics and Automation (ICMA '06)June 2006Luoyang, China159816022-s2.0-3424721384210.1109/ICMA.2006.257414ChenD.ShaoL.ZhangZ.YuX.An image reconstruction algorithm based on artificial fish-swarm for electrical capacitance tomography systemProceedings of the 6th International Forum on Strategic Technology (IFOST '11)August 2011119011942-s2.0-8005342567810.1109/IFOST.2011.6021233WangC.-J.XiaS.-X.Application of probabilistic causal-effect model based artificial fish-swarm algorithm for fault diagnosis in mine hoistTianW.TianY.AiL.LiuJ.A new optimization algorithm for fuzzy set design2Proceedings of the International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC '09)August 20094314352-s2.0-7364914511210.1109/IHMSC.2009.230ChengY.JiangM.YuanD.Novel clustering algorithms based on improved artificial fish swarm algorithm3Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '09)August 20091411452-s2.0-7634911203810.1109/FSKD.2009.534BerrymanM. J.AllisonA.WilkinsonC. R.AbbottD.Review of signal processing in geneticsvon ÖhsenN.SommerI.ZimmerR.LengauerT.Arby: automatic protein structure prediction using profile-profile alignment and confidence measuresKotlarD.LavnerY.Gene prediction by spectral rotation measure: a new method for identifying protein-coding regionsYinC.YauS. S.-T.Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequenceRushdiA.TuqanJ.Gene identification using the stroke Z sign-curve representation2Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06)May 2006102410272-s2.0-33947667789SharmaS. D.ShakyaK.SharmaS. N.Evaluation of DNA mapping schemes for exon detectionProceedings of the International Conference on Computer, Communication and Electrical Technology (ICCCET '11)March 201171742-s2.0-7995751651010.1109/ICCCET.2011.5762441KotlarD.LavnerY.Gene prediction by spectral rotation measure: a new method for identifying protein-coding regionsBoL.DingK.Graphical approach to analyzing DNA sequencesXuS. L.ShaoJ. F.YanX. H.ShaoS.SNR of DNA sequences mapped by general affine transformations of the indicator sequencesJiangZ.DougalR. A.Synergetic contro1 of power converters for pulse current charging of advanced batteries from a fuel cell power sourceGaoJ.DongH.ShaoJ.ZhaoJ.Parameters optimization of synergetic recognition approach