The identification of protein coding regions (exons) plays a critical role in eukaryotic gene structure prediction. Many techniques have been introduced for discriminating between the exons and the introns in the eukaryotic DNA sequences, such as the discrete Fourier transform (DFT) based techniques, but these DFTbased methods rapidly lose their effectiveness in the case of short DNA sequences. In this paper, a novel integrated algorithm based on autoregressive spectrum analysis and wavelet packets transform is presented to improve the efficiency and accuracy of the coding regions identification. The experimental results show that the new algorithm outperforms the conventional DFTbased approaches in improving the prediction accuracy of protein coding regions distinctly by testing GENSCAN65, HMR195, and BG570 benchmark datasets.
Deoxyribonucleicacid (DNA) sequence consists of genic and intergenic regions. Identification of protein coding regions is an elementary but very important problem in bioinformatics because the exonic regions code for amino acids. So learning the primary structure of a protein leads to studying and analyzing the secondary and tertiary structures of a protein in addition to protein function. Once we could clearly know the structure and function of a protein, we can design drugs, cure diseases, improve crop productivity, and synthesize biofuel. In addition, coding regions represent the conserved part of genomes. On the other hand, predicting conserved regions is also important to study evolution and predict phylogenetic trees [
All living organisms can be divided into two categories according to their fundamental cell structures: prokaryotes and eukaryotes. In prokaryotes, the coding genes, which are in charge of protein synthesis, are long and continuous (that is open reading frames (ORFs)). But in eukaryotes, genes consist of coding segments interrupted by long noncoding segments. These coding segments are termed as exons and noncoding segments as introns (Figure
The DNA structure of eukaryotes and the splicing process. This figure shows that eukaryotic DNA consists of genic and intergenic regions, and the exon regions are interrupted by introns in eukaryotic DNA. Generally, the introns are much longer than the exons.
Genomic sequence processing has been an active area of research for the past twenty years and has increasingly attracted the attention of many researchers, and a number of methods have been proposed to predict the protein coding regions [
Though these modeldependent methods perform more precisely by means of a priori information to train the classifiers, nevertheless, the coding regions may not be represented on the available datasets but exist in the sequenced organism [
The power spectrum density (PSD) calculated by VossDFT of exon and intron from the gene F56F11.4. For the symmetry characteristics, only the first half of PSD is presented. (a) The PSD of an exon with the length of 330 basepair (bp), and the nucleotide position from 2528 to 2857, and the threebase periodicity (TBP) demonstrating peak at frequency
The aforementioned DFTbased spectrum analysis techniques may be roughly categorized as one kind of nonparametric (also named classical spectrum estimation) methods, that is, the periodogram method. This method has the advantage of possible implementation using the fast Fourier transform (FFT) and has made obvious progress in exons finding area [
In this paper, a novel technique based on Marple algorithm of AR PSD and wavelet packets transform is presented to identify the protein coding regions in eukaryotic DNA sequences. This method firstly employs a mapping method to convert the DNA sequences into numerical sequences; then the sequences are passed through a bandpass filter to enhance the TBP characteristics. After that, by taking the numerical sequence as the observed signal of an AR model, the efficient Marple algorithm is utilized to estimate the PSD of the AR model by calculating the parameters of the YuleWalker equations. Then wavelet packets transform (WPT) technique is employed to reduce the the background noise of the PSD. Finally, similar to the SDFT [
The remainder of the paper is organized as follows. In Section
In this subsection, several widely used benchmark datasets will be described for the purpose of comparing the performance of different algorithms in identifying exonic regions. They are listed in the following paragraphs.
The gene sequence F56F11.4 (Genbank old number AF009962, new number FO081497,
In order to demonstrate the performance of our proposed algorithm, we also apply it on three benchmark datasets HMR195 [
In this section, a novel integrated algorithm using Marple algorithm and WPT denoising technique is proposed for the identification of protein coding regions. We divide the integrated algorithm into five steps (Figure
Convert the DNA sequence into numerical sequence using Code13 mapping method.
Enhance the TBP characteristics of the numerical sequence using an FIR band pass filter.
Extract the TBP components using the Marple algorithm with proper model order. Similar to the SDFT [
Remove the noise effect of SNR by wavelet packets transform.
Classify (or predict) the protein coding regions according to the optimal threshold.
Block diagram of the proposed algorithm.
The following subsections give the detailed presentation of the aforementioned steps, respectively.
It is an important foundation to convert the DNA sequences into digital signals because it opens the possibility to employ all kinds of powerful DSP techniques for analyzing of genomic data and reveals features of chromosomes [
There are a number of representations for nucleotide sequences [
In this paper, we use the KQuaternary Code I (denoted as Code13) technique to convert sequence into numerical signal. In the mean time, for comparison purpose, we select four conventional mapping methods from the aforementioned methods for the following spectral analysis, that is, in shorthand form, the Voss method, the EIIP method, the SP method, and the PN method. The detailed representation methods are described as follows.
In order to emphasize the threebase property in the protein coding regions, the numerical sequences are passed through an FIR bandpass filter with a Hamming window, whose order is 8 and central frequency is
The spectrum estimation techniques available may be categorized as nonparametric (also named classical spectrum estimation) and parametric. The nonparametric methods include the periodogram, the Bartlett and Welch modified periodogram, and the BlackmanTukey methods. All these methods have the advantage of possible implementation using the fast Fourier transform (FFT), but with the disadvantage in the case of short data lengths of limited frequency resolution, and the requirement for windowing to reduce the spectral leakage. Parametric methods on the other hand can provide high resolution, applicability to short data lengths, and avoidance of spectral leakage, scalloping loss, spectral smearing, and window biasing effects [
The idea of the AR spectrum analysis is that the digitized signal is modeled as an AR time series plus a white noise error term. The spectrum is then obtained from the AR model parameters and the variance of the error term. The model parameters are found by solving a set of linear equations obtained by minimizing the mean squared error term (the white noise power) over all the data.
The process of the AR spectrum analysis using MarpleWPT method is described as follows. Firstly, an important consideration is the choice of the number of terms in the AR model. This is known as its order. If the order is too low the power density estimate will be excessively smoothed, so some peaks may be obscured. If the order is too high, spurious peaks may be introduced. Hence, it is important to determine the appropriate model order for each set of data.
In an AR model of a time series the current value of the series,
This equation incorporates
The power spectrum density,
It can be found that the parameters in the righthand side of (
The model parameters,
There are several methods to solve YW equation, such as autocorrelation method (also named the LevinsonDurbin algorithm) [
In the Marple method, the YW equations (
The
Generally, AIC
After the mapping and TBP enhancement steps, the next critical step of our algorithm is to extract the TBP components, which can be implemented similarly to the SDFT [
Sliding the window along the sequence one by one position, this successive progression and the SNR curve exhibit the coding regions in DNA. It is expected that in the SNR curve, the protein coding regions have high SNR, while the noncoding regions have low SNR. So we can identify those exonic regions by proper threshold; that is, if the SNR curve of a region is above the threshold horizonal line, this region may be the exonic region while the region which is under the threshold horizonal line may be noncoding region.
There are several assistant strategies for the identification algorithm.
The values on SNR curve will be normalized by dividing by their max value, which contributes to the following comparisons.
Different mapping methods and the sliding window technique will make the obtained SNRs have different lengths, so we will use the mirrorsymmetric boundaryextension method [
It should be noted that before we use SNR curve and the threshold to determine the exons, the background noise in SNR curve should be reduced. The noise reduction technique by WPT and the optimal threshold selection method are described in the following two subsections in detail.
Wavelet packets transform (WPT) is a generalization of wavelet decomposition that offers a richer signal analysis. In the decomposition of a signal by using discrete wavelet transform (DWT), only the lower frequency band is decomposed, giving a right recursive binary tree structure, where its right lobe represents the lower frequency band. Its left lobe represents the higher frequency band. In the corresponding decomposition by using WPT, the lower, as well as the higher, frequency bands are decomposed giving a balanced binary tree structure [
Wavelet packets decomposition tree at level 3.
Denoising is an important application of WPT and its main idea is to reconstruct the useful frequency contents after the decomposition. The WPT denoising procedure of MATLAB toolbox (MATLAB R2011a Wavelet Toolbox) involves four steps.
Threshold selection plays an important role in discriminating between coding and noncoding regions based on the SNR curve. The proper threshold can help to optimize the accuracy of the identification. Xu et al. [
In these evaluations, results of different methods are compared at the nucleotide level. At this level, we evaluate the accuracy of a prediction on a test sequence by comparing the predicted coding value (coding or noncoding) with the true coding value for each nucleotide along the test sequence [
Measures of prediction accuracy at the nucleotide level.
That is,
Neither
Here we introduce several other global measures as the previous researchers [
According to Burset and Guigó [
In this section, the results of the proposed algorithm are compared with those of existing techniques, such as sliding Fourier transform spectrum (SDFT) (referred to as VossDFT) [
The outline of this section is as follows. Firstly, the denoising performance of WPT technique is given by comparing the SNR of a short benchmark data. Then we compare our Code13 mapping method with four widely used mapping methods selected from the aforementioned mapping approaches, that is, Voss, EIIP, SP, and PN mapping methods. It should be noted that in the comparison only the mapping method is different; that means in Figure
Firstly, we use the DNA sequence F56F11.4 to test the noise reduction performance of WPT. The SNR curve of sequence F56F11.4 calculated by our algorithm is shown in Figure
Denoising performance results of WPT for DNA sequence F56F11.4. (a) The output SNR of sequence F56F11.4 based on the proposed algorithm before WPT is utilized. (b) The output SNR based on the proposed algorithm using WPT. Both methods employ the Code13 mapping method and all the other procedures are identical except whether denoising or not.
Secondly, the Code13 and the other four aforementioned mapping methods (Voss, EIIP, SP, and PN) are, respectively, utilized to map the F56F11.4 sequence into five different numerical sequences. Then according to the procedure of our proposed algorithm (Figure
The performance measures of five mapping methods for gene sequence F56F11.4 are represented in Table
Performance measures of five mapping methods for sequence F56F11.4 based on MarpleWPT technique.
Mapping method  Sn  Sp  AC  Optimal threshold 

Voss  0.7810  0.2270  0.3087  0.6338 
EIIP  0.4661  0.1402  0.0656  0.3862 
SP  0.5079  0.1095  −0.0235  0.6656 
PN  0.6411  0.1539  0.1265  0.6690 
Code13 



0.5062 
Exonic identification results of gene sequence F56F11.4 using five mapping methods and the proposed algorithm. (a) Voss method, (b) EIIP method, (c) SP method, (d) PN method, and (e) Code13 method. The red bold line segments represent the true exons that must be identified, the black thin line segments represent the predicted candidate exons, and the vertical heights of those line segments represent their optimal thresholds. The square shadow represents the missing true exon; the rectangle shadow represents the false predicted exon.
Finally, our proposed algorithm is applied to the three widely used benchmark datasets: GENSCAN65, HMR195, and BG570. For comparison purpose, several conventional exonic identification techniques are employed on the aforementioned datasets in the mean time, and the performance criteria measures such as AC, ROC curves, and AUC are utilized in the comparison process.
Taking HMR195 as an example, this benchmark datasets contains 195 sequences with exactly one complete either singleexon or multiexon gene (including 43 singleexon genes and 152 multiexon genes) [
The distribution of exonic regions lengths in HRM195. The horizonal axis represents the exonic regions length, and the vertical axis represents the number of exonic regions.
Five identification techniques are utilized for the aforementioned three datasets, that is, VossDFT [
Performance measures of five mapping methods for three benchmark datasets.
Datasets  Measure  VossDFT  EIIPDFT  SPDFT  PNDFT  Code13Marple 

GENSCAN65  AC  0.1533  0.1162  0.1850  0.2010 

AUC  0.6385  0.5754  0.6372  0.6283 


HMR195  AC  0.1045  0.0572  0.1300  0.1434 

AUC  0.5626  0.5113  0.5962  0.5971 


BG570  AC  0.1093  0.0599  0.1263  0.1356 

AUC  0.5329  0.4867  0.5470  0.5391 

ROC curves of different techniques for three benchmark datasets. (a) The ROC curves of five methods (VossDFT, EIIPDFT, SPDFT, PNDFT, and Code13Marple) for GENSCAN65 datasets. (b) The ROC curves of five methods for HMR195 datasets. (c) The ROC curves of five methods for BG570 datasets.
In this paper, we propose a new technique based on Marple algorithm and wavelet packets transform with the Code13 numerical mapping approach to improve the accuracy of identification of the protein coding regions in the eukaryotic DNA sequences. The outputs of the test by many benchmark datasets show that the proposed algorithm outperforms some wellknown DFTbased methods. There are several reasons attributed to the improvement of the identification accuracy: first, the FIR filters help to enhance the TBP characteristics of the numerical sequences before PSD calculation; second, the Marple algorithm can calculate the PSD more efficiently and accurately than those conventional methods because it can yield statistically stable spectral estimates of high resolution; third, the WPT can reduce the noise in SNR curves, which attributes to the following identification of exonic regions distinctly; finally, those assistant strategies such as threshold selection, normalization of SNR curves, the mirrorsymmetric boundaryextension method also can help to improve the final accuracy of the whole algorithm.
In the same time, it should be noted that there are still some shortcomings in our proposed algorithm, such as the order selection of the AR model when using Marple algorithm and the Marple algorithm being a little more timeconsuming.
Also there are still two important and challengeable problems which deserve further study. First, how can we obtain the precise exons location information [
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors thank Professor Fengzhu Sun from the University of Southern California deeply for the interest in the project and useful discussion about the coding region discriminating criteria. The authors also thank all the anonymous reviewers for their valuable suggestions and support. This work is supported by the Natural Science Foundation of China Grants 11371227 and 10921101 and Graduate Independent Innovation Foundation of Shandong University (GIIFSDU) (yzc12098).