Prediction of DNase I Hypersensitive Sites by Using Pseudo Nucleotide Compositions

DNase I hypersensitive sites (DHS) associated with a wide variety of regulatory DNA elements. Knowledge about the locations of DHS is helpful for deciphering the function of noncoding genomic regions. With the acceleration of genome sequences in the postgenomic age, it is highly desired to develop cost-effective computational methods to identify DHS. In the present work, a support vector machine based model was proposed to identify DHS by using the pseudo dinucleotide composition. In the jackknife test, the proposed model obtained an accuracy of 83%, which is competitive with that of the existing method. This result suggests that the proposed model may become a useful tool for DHS identifications.


Introduction
DNase I hypersensitive sites (DHS) are regions of chromatin which are sensitive to cleavage by the DNase I enzyme. Since the discovery of DHSs in 1980s [1], they have been used as markers of regulatory DNA regions. In general, these specific regions are generally nucleosome-free and associate with a wide variety of genomic regulatory elements, such as promoters, enhancers, insulators, silencers, and suppressors [2][3][4]. Therefore, mapping of DHS has become an effective approach for discovering functional DNA elements from the noncoding sequences.
Although the traditional Southern blotting technique is a gold-standard approach for identifying DHS, obtaining information from Southern blot approach is a tricky, timeconsuming, and inaccurate task [5]. Recently, the DNaseseq technique (combination of DNase I digestion and highthroughput sequencing) has been proposed [6] and this technique allows for an unprecedented increase in resolution. However, methodologies for the analysis of DNase-seq data are relatively immature [7]. Therefore, computational models will be an important complement to experimental techniques for identifying DHS.
Based on nucleotide compositions, a support vector machine model for identifying DHS in K562 cell line was proposed [8]. This method yielded quite encouraging results and did play a role in stimulating the development of this area. However, further work is needed due to the following reasons. First, the sequences in their dataset share high sequence similarities. Second, the DNA structural properties were ignored. To solve these problems, we proposed a new model for identifying DHS, which is trained on a high quality benchmark dataset. In the new model, each DNA sample is encoded by using the pseudo dinucleotide composition, into which the DNA structural properties are incorporated.

Benchmark Dataset.
The experimentally confirmed 280 DHS and 731 non-DHS sequences were obtained from http://noble.gs.washington.edu/proj/hs/, which have been used to train DHS prediction models [8]. As elucidated in [9], a predictor, if trained and tested by a dataset containing redundant samples with high similarity, might yield misleading results with an overestimated accuracy. To get rid of the redundancy and avoid bias, the CD-HIT software [10] was utilized to remove those DNA fragments that have ≥60% pairwise sequence identity to each other. The Scientific World Journal Finally, we obtained 247 positive and 710 negative samples for the benchmark dataset S, as can be formulated by where the subset S + contains 247 DHS sequences and S − contains 710 non-DHS sequences, while ⋃ represents the "union" in the set theory. The detailed sequences in the benchmark dataset S are given in Supplementary Information S1 available online at http://dx.doi.org/10.1155/2014/740506.

DNA Sequence Representation.
In order to integrate the sequence-order effects and DNA physicochemical properties together, the pseudo nucleotide composition was proposed in 2011 [11]. Since then, the concept of pseudo nucleotide composition has penetrated into many branches of computational genomics, such as predicting the recombination spots [12], predicting promoters [13], predicting nucleosome positioning sequences [14], and identifying splice sites [15]. Because of its wide and increasing usage, recently, a flexible web-server, called "pseudo -tuple nucleotide composition (PseKNC), " was developed [16], which can be used to generate various kinds of pseudo -tuple nucleotide compositions. Encouraged by the success of introducing pseudo nucleotide composition to computational genomics, in the current study, the pseudo dinucleotide composition was used to represent DNA sequences in the benchmark dataset, which can be expressed as [12,16] In (3), ( = 1, 2, . . . , 16) is the normalized occurrence frequency of the dinucleotides in the DNA sequence. is the number of the total counted ranks (or tiers) of the correlations along a DNA sequence, and is the weight factor. The concrete values for and as well as will be further discussed in Section 3.1, while the correlation factor represents the -tier structural correlation factor between all the th most contiguous dinucleotide +1 at position .

Support Vector Machine (SVM)
. SVM is a supervised learning algorithm and has been widely used in computational genomics and proteomics [17][18][19][20][21][22][23]. The basic principle of SVM is to transform the input vector into a high dimension space and then seek a separating hyperplane with the maximal margin in this space by using the decision function where is the Lagrange multipliers, is the offset, ⃗ is the th training vector, and represents the type of the th training vector. ( ⃗ , ⃗ ) is a kernel function which defines an inner product in a high dimensional feature space, and sgn is the sign function. Due to its effectiveness and speed in nonlinear classification process, the radial basis kernel function (RBF) ( ⃗ , ⃗ ) = exp(− ‖ ⃗ , ⃗ ‖ 2 ) was used in the current study. The Libsvm 2.84 package [24] was used to perform the SVM, which can be downloaded from http://www.csie .ntu.edu.tw/∼cjlin/libsvm/. The regularization parameter and the kernel width parameter were optimized via an optimization procedure using a grid search. The search spaces for and are [2 15 , 2 −5 ] and [2 −5 , 2 −15 ] with steps of 2 −1 and 2, respectively.

Performance Evaluation.
Three cross-validation methods, that is, independent dataset test, subsampling (or -fold cross-validation) test, and jackknife test, are often used to evaluate the anticipated success rate of a predictor. Among the three methods, the jackknife test is deemed the least arbitrary and most objective one [9,25] and, hence, has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors [26][27][28][29][30]. Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study. In the jackknife test, each sequence in the training dataset is in turn singled out as an independent test sample and all the rule-parameters are calculated without including the one being identified.
A set of parameters, namely, sensitivity (Sn), specificity (Sp), Matthew's correlation coefficient (MCC), and accuracy (Acc), are used to evaluate the performance of the proposed model and they are defined as follows: where TP, TN, FP, and FN represent the number of the correctly recognized DHS, the number of the correctly recognized non-DHS, the number of non-DHS recognized as DHS, and the number of DHS recognized as non-DHS, respectively.

Results and Discussions
3.1. Parameter Optimization. By analyzing the dinucleotide composition of DHS and non-DHS sequences, we found that the frequency of CC, CG, GC, and GG is higher in DHS sequences, while the frequency of the remaining dinucleotides is higher in non-DHS (Figure 1). This is selfevident as to why the pseudo dinucleotide composition was used for the current case. A series of evidences [12,14,31,32] have demonstrated that DNA local structural properties, that is, angular parameters (twist, tilt, and roll) and translational parameters (shift, slide, and rise), are effective in identifying DNA attributes. Therefore, in the present work, the six structural parameters of dinucleotides were used to calculate the pseudo dinucleotide composition by using the PseKNC web-server, which is available at http://lin.uestc.edu.cn/pseknc/default.aspx.
As we can see from (1) and (2), the present model depends on the two parameters and . is the weight factor usually within the range from 0 to 1 and is the global order effect. Generally speaking, the greater the is, the more global sequence-order information the model contains. However, if is too large, it would reduce the cluster-tolerant capacity so as to lower down the cross-validation accuracy due to overfitting or "high dimension disaster" problem [33]. Therefore, our searching for the optimal values of the two parameters is in the range of ∈ [0, 1] and ∈ [1,10] with the steps of 0.1 and 1, respectively.
In order to reduce the computational time, the 5-fold cross-validation approach was used to optimize the two parameters together with the parameters and of the SVM. We found that when = 0.2 and = 6 with = 512 and = 0.0078125, a peak was observed for the Acc. Accordingly, the two numerical values were used for the two uncertain parameters in the following analysis.

Prediction
Quality. The prediction quality measured by the four metrics defined in (5)-(8) for the present model in identifying DHS in the benchmark dataset S via the rigorous jackknife test was listed in Table 1, where, for facilitating comparison, the corresponding results obtained by the previous predictor [8] on the same benchmark data set are also given. As we can see from Table 1, the current method outperformed the existing model in all the four metrics, indicating that our proposed method may become a useful tool in identifying DHS sequences.

Conclusions
Since DHS associates with a wide variety of functional elements, knowledge about the locations of DHS is helpful for deciphering the genomes. However, strong DNA sequence conservation is not observed among DHS sequences, suggesting that it is difficult to computationally identify DHS from primary DNA sequence. A series of recent studies have demonstrated that the information coded by DNA structural properties is contributable to the identification of regulatory elements in genomes [12,14,31,32]. Hence, in the present study, we proposed a SVM based model for identifying DHS by using the pseudo dinucleotide composition. In this model, we integrate dinucleotide composition with DNA structural properties. The predictive results of our model are better than existing methods. Therefore, it is anticipated that the proposed method may become a useful tool for identifying DHS sequences or, at the very least, it can play a complementary role to the existing methods in this area.