Recombination Hotspot/Coldspot Identification Combining Three Different Pseudocomponents via an Ensemble Learning Approach

Recombination presents a nonuniform distribution across the genome. Genomic regions that present relatively higher frequencies of recombination are called hotspots while those with relatively lower frequencies of recombination are recombination coldspots. Therefore, the identification of hotspots/coldspots could provide useful information for the study of the mechanism of recombination. In this study, a new computational predictor called SVM-EL was proposed to identify hotspots/coldspots across the yeast genome. It combined Support Vector Machines (SVMs) and Ensemble Learning (EL) based on three features including basic kmer (Kmer), dinucleotide-based auto-cross covariance (DACC), and pseudo dinucleotide composition (PseDNC). These features are able to incorporate the nucleic acid composition and their order information into the predictor. The proposed SVM-EL achieves an accuracy of 82.89% on a widely used benchmark dataset, which outperforms some related methods.


Introduction
Meiotic recombination describes the process of alleles' exchange between homologous chromosomes during meiosis [1]. It can provide material for natural selection by producing diverse gametes. It might also contribute to the evolution of the genome via gene conversion or mutagenesis [2][3][4].
Although the exact location where recombination happens in the genome and the mechanism of recombination are still unclear, it has been assured that recombination plays an important role in promoting genome evolution. Therefore, several studies have been performed on chromosomes [5][6][7] and found that recombination presents a nonuniform distribution across the genome. Genomic regions that present relatively higher frequencies of recombination are called hotspots while those with relatively lower frequencies of recombination are called recombination coldspots [8,9]. With the number of the sequenced genomes showing explosive growth, more reliable methods are urgently needed to be developed to identify the recombination spots.
The prediction of recombination hotspots or coldspots is still a challenging task, although much information can be acquired from the experiments. Recently, several computational models have been presented to identify the recombination hotspots/coldspots. For example, Liu et al. [10], based on sequence Kmer frequencies, proposed a model which combines the increment of diversity with quadratic discriminant analysis (IDQD). Later, this method was improved by adding gaps into the kmers [11]. Chen et al. presented a predictor called iRSpot-PseDNC trained with pseudo dinucleotide composition features [12].
The aforementioned methods extracted the features from DNA sequences in different aspects. For example, the model based on oligonucleotide frequencies considers the nucleic acid composition information. The iRSpot-PseDNC incorporates both the local nucleic acid composition information and the global information of the protein sequences. Therefore, it is reasonable to combine these complementary predictors to further improve the performance of recombination hotspot/coldspot identification. In this regard, three basic predictors trained with basic kmer (Kmer) [13], dinucleotidebased auto-cross covariance (DACC) [14,15], and pseudo dinucleotide composition (PseDNC) [16], respectively, were combined via the framework of ensemble learning approach, and a novel predictor called SVM-EL was proposed. All these features can be easily generated by a recently proposed tool called Pse-in-One [17], which is able to generate various features only based on the DNA, RNA, or protein sequence information.

Benchmark Dataset.
The benchmark datasets S was obtained from Liu et al. [10]: where the subset S + contains 490 recombination hotspots, the subset S − contains 591 recombination coldspots, and the symbol ∪ represents the "union" in the set theory.

Feature Vectors Generated by Pse-in-One
. SVM-EL is developed by combining the outcomes of three individual predictors which were trained by different features, including basic kmer (Kmer) [13], dinucleotide-based auto-cross covariance (DACC) [14,15], and pseudo dinucleotide composition (PseDNC). These basic features can be generated by using Psein-One [17] which provides two approaches to generate feature vectors. One way is through the web server (http://bioinformatics.hitsz.edu.cn/Pse-in-One/) and another way is through the stand-alone tool (http://bioinformatics.hitsz.edu.cn/Psein-One/download/). Suppose a DNA sequence D is where represents the DNA sequence length and ( = 1, 2 ⋅ ⋅ ⋅ ) is the nucleic acid at the position . Therefore, three basic features used in the current study can be described as follows. [13] is an approach representing DNA sequences by the occurrence frequencies of kmers. The Kmer contains the local sequence-order information and it can be generated with the help of Pse-in-One by the following steps.

Kmer. Kmer
For web server approach, firstly, choose DNA sequences (PseDAC-General), then select Kmer in the tab of Mode, and set the value of . Secondly, input or upload the DNA sequence file in FASTA format, click the Submit button, and then you will see the results and you can download them as a text file (Figure 1).
For stand-alone approach, Kmer features can be easily generated by using the following command line: where −f svm represents the format of the output file which is the LIBSVM training data format, −l +1 represents the input file that contains positive samples only, equals 3, and the sequence type is DNA.
Given a DNA sequence D represented as (2), the DAC feature can be calculated as [17] DAC ( , lag) where is the dinucleotide property index; is the length of DNA sequence; lag represents the distance between two dinucleotides; ( +1 ) represents the value of dinucleotide +1 at position for the dinucleotide property index ; represents the average value of ( +1 ) for a DNA sequence.
Given a DNA sequence D represented as (2), the DCC feature can be calculated as [17] DCC ( 1 , 2 , lag) where 1 and 2 are two different dinucleotide property indices; is the DNA sequence length; lag is the distance between two dinucleotides; 1 ( +1 )( 2 ( +1 )) represents the value of dinucleotide +1 at position for the ) for a DNA sequence. The features of DACC contain global sequence-order information, and it can be generated via Pse-in-One [17] which includes two generation approaches. The generation steps of DACC feature can be described as follows.
For web server approach, firstly, choose the DNA sequences (PseDAC-General) option, then select DACC in the tab of Mode, and set the value of lag. Secondly, upload a user-defined physicochemical index file called user property and the values of fifteen dinucleotide physicochemical properties are shown in Table 1. Finally, input or upload the DNA sequence file in FASTA format, click the Submit button, and then you will see the results and you can download them as a text file (Figure 2).
For stand-alone approach, DACC features can be easily generated by using the following command line: './acc.py −e user property −f svm −l +1 3 DNA DACC' where −e user property represents the user-defined physicochemical index file, −f svm and −l +1 have the same meaning with the above command line, the parameter lag equals 3, the sequence type is DNA, and the method used is DACC.

Pseudo Dinucleotide Composition (PseDNC).
Given a DNA sequence D represented as (2), the PseDNC feature vector D can be defined as [17] where (1 ≤ ≤ 16) represents the normalized frequency of dinucleotides along the DNA sequence; (0 ≤ ≤ 1) represents the weight factor; is the top counted tiers of the correlation in a DNA, (1 ≤ ≤ ) measures the correlation between dinucleotides in the DNA, which is defined as . . . where where represents the indices of the dinucleotide property; ( +1 )( ( +1 )) represents the value of dinucleotide +1 ( +1 ) at position ( ) for the dinucleotide property index .
Pseudo dinucleotide composition (PseDNC) [17] not only incorporates the local nucleic acid composition information and the global or long range information along the DNA sequences, but also incorporates the dinucleotide properties into feature vectors.
For web server approach, the generation steps of the feature vectors are similar to those of the DACC's. For web server approach, an example is shown in Figure 3.
For stand-alone approach, the command line is './pse.py −e user property −f svm −l +1 7 0.3 DNA PseDNC' where −e user property, −f svm, and −l +1 have the same meaning with the above command line, lambda equals 7, the value of weight equals 0.3, the sequence type is DNA, and the method used is PseDNC. The meanings of all the parameters for these scripts are described in [17].
In the current study, the LIBSVM package version 3.21 [18] has been employed. The SVM parameters, the kernel width parameter and the regularization parameter , were optimized via the grid tool provided by LIBSVM [18].

Ensemble Learning.
In machine learning, ensemble learning is the process by which multiple classifiers are constructed and combined based on the same dataset to obtain a better performance than a single classifier [28,29] and existing popular multiobjective optimization evolutionary algorithms can be used for ensemble learning [30,31]. Ensemble classifier also performed well in several bioinformatics problems. In the current study, the basic framework for an ensemble classifier is illustrated in Figure 4. The final results are obtained by fusing three individual classifier outcomes, as illustrated below.
Suppose the ensemble classifier C is defined as where C 1 represents the classifier SVM-Kmer, C 2 represents the classifier SVM-DACC, and C 3 represents the classifier SVM-PseDNC. The symbol ⊕ denotes the fusing operator.
BioMed Research International 5 Therefore, the process of the ensemble classifier can be formulated as follows: where 1 is the set only containing recombination hotspots and 2 is the set of recombination coldspots. (S, ) is the probability for DNA sequence S which belongs to category obtained by the th basic classifier. Thus, which category the query DNA S belongs to is to be determined by using its average probability calculated by (13); that is, suppose that where the operator max represents selecting a lager value in the brackets, and the subscript represents the query DNA S belonging to category .

Performance of the Three Basic Classifiers.
As an inherent property, sequence-order is important for the classification of DNA sequences. So, three basic methods based on sequenceorder information are adopted to identify recombination hotspots/coldspots. Table 2 shows the performance of the three methods. According to the table, we can see that SVM-DACC and SVM-PseDNC outperform SVM-Kmer on the prediction accuracy index. The main reason is that SVM-Kmer is only based on local sequence-order information, while both of SVM-DACC and SVM-PseDNC also contain global sequence-order information.

The Performance of the Three Basic Predictors Can Be
Further Improved by Using Ensemble Learning. Based on the analysis above, we have proposed three basic predictors for identifying recombination hotspots/coldspots. These methods capture DNA information from different aspects.  [18]. b The parameters used are lag = 6 for SVM-DACC and = 2 3 and = 2 −3 for LIBSVM [18]. c The parameters used are = 7 and = 0.3 for SVM-PseDNC and = 2 13 and = 2 3 for LIBSVM [18]. Therefore, we presented a complementary method SVM-EL which can fuse these basic methods to improve the prediction performance. The performance of SVM-EL is shown in Table 2, from which we can see that SVM-EL outperforms the three basic methods. Besides, the corresponding receiver operating characteristic (ROC) curves of the four classifiers were drawn in Figure 5. AUC, the area under the ROC curve, is often used to indicate the performance of a classifier: the larger the value, the better the classifier. As shown in Figure 5, the predictor SVM-EL showed the top performance, outperforming three basic methods: SVM-Kmer, SVM-DACC, and SVM-PseDNC.

Comparison with Other Related Predictors.
Two stateof-the-art methods, IDQD [10] and iRSpot-PseDNC, were selected to compare with the proposed SVM-EL. Table 3 shows the results of various methods on the benchmark dataset. According to Table 3, we can see that SVM-EL outperforms the other methods. The main reason is that IDQD and SVM-Kmer only consider local sequence-order information, and iRSpot-PseDNC, SVM-DACC, and SVM-PseDNC improved them by incorporating global sequenceorder information. However, SVM-EL not only incorporates the local nucleic acid information, but also incorporates the global information. Therefore, we conclude that SVM-EL would be a useful tool for hotspots/coldspots identification.

Conclusion
In this article, we proposed a predictor called SVM-EL for yeast hotspot/coldspot identification, which combines Support Vector Machine (SVM) with Ensemble Learning (EL). The approach combined with different predictors trained by different features contributes to the improvement of prediction accuracy. SVM-EL is trained by different features, including basic kmer (Kmer), dinucleotide-based auto-cross covariance (DACC), and pseudo dinucleotide composition (PseDNC). All these features can be generated by Pse-in-One [17], which is a powerful web server for generating various DNA, RNA, or protein features. It also provides a stand-alone version to users, which is easy to use. Via jackknife test, it was observed that the predictor outperforms other predictors. In the future, we will consider using other approaches for yeast hotspot/coldspot identification, such as bioinspired computing models [38][39][40][41][42][43][44][45].