A De Novo Genome Assembly Algorithm for Repeats and Nonrepeats

Background. Next generation sequencing platforms can generate shorter reads, deeper coverage, and higher throughput than those of the Sanger sequencing. These short reads may be assembled de novo before some specific genome analyses. Up to now, the performances of assembling repeats of these current assemblers are very poor. Results. To improve this problem, we proposed a new genome assembly algorithm, named SWA, which has four properties: (1) assembling repeats and nonrepeats; (2) adopting a new overlapping extension strategy to extend each seed; (3) adopting sliding window to filter out the sequencing bias; and (4) proposing a compensational mechanism for low coverage datasets. SWA was evaluated and validated in both simulations and real sequencing datasets. The accuracy of assembling repeats and estimating the copy numbers is up to 99% and 100%, respectively. Finally, the extensive comparisons with other eight leading assemblers show that SWA outperformed others in terms of completeness and correctness of assembling repeats and nonrepeats. Conclusions. This paper proposed a new de novo genome assembly method for resolving complex repeats. SWA not only can detect where repeats or nonrepeats are but also can assemble them completely from NGS data, especially for assembling repeats. This is the advantage over other assemblers.


Background
Over the past twenty years, genome sequencing technologies have made great progress in many aspects, such as speed, cost, coverage, and so forth. The automated Sanger sequencing is regarded as the first-generation genome sequencing technology which has the ability to read longer than 1000 base pair (1000-2000 bp). The latter sequencing technologies are referred to as the next-generation sequencing (NGS) technologies. Currently, the available commercial NGS platforms include GA, MiSeq, and HiSeq from Illumina [1], SOLiD and Ion Torrent from Life Technologies [2], RS system from Pacific Bioscience, and Heliscope from Helicos Biosciences [3][4][5][6][7]. Next generation sequencing machines can sequence the whole human genome in a few days, and this capability has inspired a flood of new projects that are aimed at sequencing large kinds of animals and plants [8,9]. NGS can be characterized by highly parallel operation, higher yield, simpler operation, shorter reads, and much lower cost [10].
However, the NGS technologies all share a common intrinsic characteristic of providing very short read length (30∼ 250 bp), which is substantially shorter than the Sanger sequencing reads.
These short reads may be assembled de novo before further genome analysis if the reference genome is not available. Currently, there are tens of genome assembly algorithms and software. Among them ABySS [11], ALLPATHS-LG [12], Bambus2 [13], CABOG [14], MSR-CA (http://www.genome.umd.edu/masurca.html), SGA [15], SOAPdenovo [16], and Velvet [17] are the typical ones. Each of them is able to run large and whole genome assembly using NGS short read data from mate-pair or paired-end information. In terms of repeats smaller than read length, these current assemblers performed well. But for the repeats longer than read length, most of them performed poorly in completeness and accuracy of assembling repeats from NGS data. It is shown that repetitive DNA comprises a significant fraction of the eukaryotic genomes, for example, ∼20% of 2

BioMed Research International
Caenorhabditis elegans and Caenorhabditis briggsae genomes [18] and ∼50% of the human genome [19] have been identified as repetitive DNA. Most of these repetitive DNA sequences have some important biomedical functions and are closely related to some complex disease [20,21], such as cancer [22], neuropsychiatric disorders [23], and autism [24]. Thus, it is necessary to improve the genome assembly algorithms, especially in assembling repeats.
To this end, we proposed a new genome assembly algorithm aiming for assembling repeats and nonrepeats, named SWA (sliding window assembly), which can assemble repeats and nonrepeats completely and accurately. In SWA, sliding window function is used to filter out the sequencing bias caused by sequencing process and improve the confidence of separating repeats and nonrepeats. There are five typical properties.
(1) Assembling repeats and nonrepeats completely and accurately. SWA can assemble repeats and nonrepeats from NGS data directly in a parallel way which can reduce the memory usage and executive time. In addition, SWA cannot only detect where repeats or nonrepeats are but also can assemble them completely. Therefore, SWA provides an alternative solution to resolve long repeats in some extent.
(2) Adopting dynamic overlapping strategy to extend each seed. The so-called dynamic overlapping strategy is to compute the reads overlapped with seed in intervals composed of maximum overlap and minimum overlap. This strategy can search the optimal read for extension in dynamic interval and jump over the short repeats.
(3) Adopting sliding window to filter out the sequencing bias so as to improve the confidence of detecting boundary of repeats. NGS data is always full of sequencing bias and is highly uneven, which caused the difficulty of distinguishing where repeats or nonrepeats are. Sliding window technique is used to filter out the sequencing bias in the genome assembly process so as to increase the confidence of detecting boundary of repeats.
(4) Proposing a compensational mechanism for the loss caused by low coverage. Low coverage makes the statistical properties of read counts less significant. To improve the statistical significance, SWA proposed a compensational mechanism based on sliding window. This mechanism can improve the statistical significance of read counts under the condition of low coverage.
(5) Estimating copies of assembled ones as an auxiliary function.
The main contributions of our approach are as follows. (1) Assembling repeats and nonrepeats completely and accurately rather than only detecting where repeats or nonrepeats are. Complex repeats structures have very important biomedical functions. Consequently, the completeness and accuracy of assembling repeats are what SWA is mainly concerned about the completeness of assembled repeats and nonrepeats rather than the continuity of whole genome assembly. (2) Sliding window functions to filter out the sequencing bias are used in genome assembling process. Filtering noise by window function is very common in information processing but is rare in genome assembly process. SWA adopts sliding window to filter out NGS data bias and improve the statistical significance of read counts. In addition, a compensational mechanism based on sliding window was embedded in SWA. This mechanism can improve the significance of read counts under the condition of low coverage.
The assessments were performed in simulated datasets and real NGS datasets which are all generated by Illumina sequencer. Simulated study is only used to validate the performances of SWA. Therefore, it has little meaning to compare with other assemblers in simulated datasets. Extensive comparisons were conducted with other eight famous assemblers, such as ABySS, ALLPATHS-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, and Velvet, in real NGS datasets.
The results indicate that for whatever small genomes or large genomes, SWA outperformed other eight leading assemblers in the completeness of assembling repeats. SWA is freely available at http://222.200.182.71/swa/SWA.rar. Firstly, preprocessing is performed. SWA firstly filters out the raw reads that contain any " ", which is noninformative, or any low quality value region, which may contain errors and lead to false positive overlaps with other reads (Figure 2(a)). And then the parameters are set in advance (see parameters): the maximum overlapping length max, dynamic overlapping interval , read length , length of sliding window , threshold of repetitive seeds , threshold of nonrepetitive seeds , threshold of repeats assembly 2 , threshold of nonrepeats assembly 1 , and sequencing depth after filtering by sliding window fd .

Results
Thirdly, unique processing is performed (Figure 2(b)), which is used to find the unique reads and corresponding frequencies in raw datasets. This step is performed using unique function provided by MATLAB platform. For example, = unique ( ) for the array and returns the same values as in but with no repetitions. The values of are in sorted order. For the sequencing reads, this step is performed in a similar way. The reads that are exactly the same-that is identical reads-are collapsed into one unique read and the corresponding frequency is also recorded. For each raw read, the reverse-complement is also used. After unique processing, all unique reads are sorted in a table , which is a variable to store these processed data. By unique processing, the amount of data is decreased, especially for the high coverage data, which can also reduce the computational requirement and accelerate the running time.  Figure 1: High-level diagram of the SWA assembly pipeline. The assembly has three main modules: preparatory stage, constructing repeats, and constructing nonrepeats. Preparatory stage includes data cleaning, unique processing, and hash index constructing. The stage of constructing repeats and constructing nonrepeats can be performed in parallel or in series. Figure 1 shows the parallel manner. The specific steps of constructing repeats and nonrepeats are detailed in Methods.
Thirdly, hash index is constructed ( Figure 3). As it is time consuming to identify unique reads by directly comparing the whole raw reads one by one, SWA constructs the indirect hash index of unique reads which is able to index complex and nonsequential data in a way that facilitates rapid searching. The indirect hash index structures firstly transform the keywords to the quaternary integers and then index these integers rather than the strings directly. This is especially appropriate for DNA sequencing reads.  Figure 2: Some key steps of SWA. (a) Raw reads processing. Input reads containing any " " or a low quality region are discarded and then sorted in alphabetical order. (b) Graphical illustration of unique process. The five different color lines represent the five unique reads in preprocessed raw reads. Each of them appears more than twice. By unique processing, the identical reads are collapsed into one unique and corresponding frequency. (c) Seed selection. The unique reads are ranked by read count (from high to low). Unique reads with read count larger than are selected as seeds for repeat (the red dotted frame), while unique reads with read counts smaller than are selected as seeds for nonrepeat extension (the blue dotted frame). (d) The graphical example of extending repeats using sliding window function in dynamic overlapping interval. The dotted box represents the dynamic overlapping interval . After overlapping with seed in , the overlapped read counts are recorded and then sliding window function is used to filter out the read bias in this interval continuously, as shown in C1. C2 is the corresponding results filtered by sliding window function, and then the mean value of this interval is recorded in variable to detect the boundary of repeats ( Figure 4) and nonrepeats ( Figure 5). In this extension, SWA regards 1 as the optimal extendable read. The extension of nonrepeats is performed in a similar way. The detailed extension and boundary detection are graphically shown in Figures 4 and 5. Fourthly, seed selection is conducted. In SWA, each extension requires a unique read, called a seed, to initiate the extension. In an extension-based assembler, a good seed should not contain any sequencing errors and should not be selected from the boundary of repeats and nonrepeats.
Consequently, data cleaning is necessary in the data preprocessing stage before seed selection, and then the seeds are selected in table . In addition, theoretically, a read from repetitive region usually has a high read count because identical repeats from other loci are counted as well; a read  Sequence reads with associated read identifiers are shown, with the regions that will be used for seed selection in capital letters and matched seeds of two bases from AA to TT. Given read identifiers are associated with the seeds using a hash function (e.g., a unique integer representation of each seed). Once such hash table has been built for unique reads; the corresponding data can be scanned with the same hash function, resulting in a much smaller subset of reads to more exactly search the extendible reads. Extending repeats Potential non-repeats

Potential boundary
Extended repeats Figure 4: Schematic of extending repeats and boundary detection. The graphic illustration of extending repeats and boundary detection. Red line represents the extended repeats, blue line represents the extending repeats, and the green line represents the potential nonrepeats. The yellow lines represent the supporting reads overlapped with the extended contig. We assume that the sequencing depth = 2, and let 2 = 3. Therefore, in the process of extending repeats, the mean value of dynamic overlapping interval filtered by sliding window as shown in Figure 2(d) should be larger than or equal to 2 . The dotted box represents the potential boundary of repeats. Consequently, if we set > 2 , the extension will be stopped at B1 or the extension will be stopped at B2. from nonrepetitive region always has a low read count. On the other hand, the read from the boundary always has the middle read count; these seeds always lead to misassembled and short contigs. Thus, the seeds for repeats are chosen with high read count in table , larger than , while the one for nonrepeats should be chosen with low read count in table , smaller than . And seeds with middle read count should be avoided. In order to avoid the risk of selecting seed with full errors, the lower limit of read count is also necessary. Furthermore, in seed selection stage, sequencing base quality value will be used to avoid the risk of picking seed with errors, especially for seed of nonrepeats. For the seeds with same read count, SWA selects the one with higher quality value, because higher base quality means lower errors. This strategy can avoid the risk of picking seed with errors to maximum extent.
Fifthly, seed extension (Figures 2(d), 4, and 5). We first define some terms. Let be the length of each read and and Extending repeats Potential non-repeats non-repeats Potential boundary Extended M n1 < 3 M n2 = 3 M n3 > 3 B2 B1 Figure 5: Schematic of extending nonrepeats and boundary detection. The graphic illustration of extending repeats and boundary detection. Green line represents the extended nonrepeats, blue line represents the extending nonrepeats, and the red line represents the potential repeats. The yellow lines represent the supporting reads overlapped with the extension. We assume that the sequencing depth = 2, and let 1 = 3. Therefore, in the process of extending nonrepeats, the mean value of dynamic overlapping interval filtered by sliding window should be smaller than or equal to 1 . The dotted box represents the potential boundary of repeats. Consequently, if we set < 1 , the extension will be stopped at B1 or the extension will be stopped at B2. be two unique reads in table . We say that and overlap if the suffix bases of are identical to the prefix bases of , where min ≤ ≤ max (max and min are, resp., the allowed maximal and minimal numbers of overlapping bases) and the default value of them are /2 ≤ min < max = − 1. More explicitly, we say that the 3 -end of overlaps with the 5 -end of . In the circumstances of long reads, the max overlap max is not needed to use this default value, and the empirical value is min < max < min{100, − 1}. The dynamic overlapping interval , that is the -mer in SWA, and is defined as = -mer = max − min. Given a seed, SWA first extends it at the 3 -end and then at the 5 -end. A read is extendable for seed if its 5 -end overlaps with the 3 -end of seed and its 3 -end overlaps with one or more unique reads. To extend a seed seed at the 3 -end, SWA searches all unassembled unique reads in table for extendable reads in the dynamic range of [1, -mer], that is, from maximal overlapping to minimal overlapping. The read counts overlapping with seed in this interval are recorded (as shown in Figure 2(d)). Then sliding window function is used to filter out the read count bias of this interval, and then the mean value filtered by sliding window is recorded in variable which is used to detect the boundary of repeats. For the extension of repeats (Figure 4), the extension will continue if > 2 , where 2 is the threshold of repeats, else the extension will be stopped at this end. For the extension of nonrepeats ( Figure 5), the extension will continue if < 1 , where 1 is the threshold of nonrepeats. Meanwhile, the mean value of variable can be used to estimate the copy numbers.
In seed extending stage, dynamic overlapping interval is used to search the optimal read to extend seed and which has three functions: (1) the optimal read can be searched in this interval for seed extension (Discussion); (2) the bias of the read counts in this interval can be filtered out by sliding window function so as to increase the confidence of detecting boundary (Figures 4 and 5); (3) copies of assembled contig can be estimated based on the combination of this interval.
Moreover, sliding window is a determinant factor of judging whether the extension of seed is up to the bound of repeats or nonrepeats. The threshold for repeats and nonrepeats is closely related to the sliding window and filtered times, which determines the accuracy and correctness of repeat or nonrepeat contig construction.
The concrete contents and abbreviations are described in detail in methods and parameters.

Compensational Mechanism.
Another property of SWA is the mechanism of compensating the loss caused by low sequencing depth. In practice, the genome sequencing process is highly uneven among the whole genome and full sequencing bias. High coverage can decrease the influence of sequencing bias on the statistical significance of read counts. However, under the condition of low coverage, sequencing bias reduces statistical significance of read counts, which leads to difficulty of distinguishing boundary of repeats. In order to improve the statistical significance of read counts under low coverage, sequencing bias will be filtered out more completely. To this end, SWA proposed a compensational mechanism by the combination of filtered times and sliding window function. The so-called filtered times is the parameter of SWA which means the times of sequencing data filtered by sliding window function. For example, if the read counts filtered by sliding window function is filtered by sliding window once more, that is filtered times = 2. The main idea of the compensational mechanism is as follows: SWA can increase the length of dynamic overlapping interval and the size of sliding window and then coalesce several points into one point, which can be achieved by both increasing the size of sliding window and number of filtered times.
The relationship between sequencing depth filtered by sliding window, filtered times , and average sequencing depth is given as follows: After filtering by sliding window, read counts will be more flat. To run this compensational mechanism, we suggest that only the values, size of sliding window, and filtered times, may need to be adjusted. The empirical value of optimal sliding window should be set in the range -mer/5 ≤ ≤ -mer/3, where -mer is the size of dynamic overlapping interval, and optimal filtered times should be set = 2. Notably, this compensational mechanism is a double edge sword. On the one hand, it can decrease the sequencing bias and compensate the read loss of low coverage. On the other hand, it reduces the sensitiveness of detecting boundary of repeats and increases the computing complexity and executive time. In order to get the best assembly, the high coverage NGS data is also preferred.

Copy Number Estimation.
The auxiliary function of SWA is to estimate the copy numbers of each assembled contig including repeats and nonrepeats directly from NGS data rather than aligning them back to reference genomes. Copy number estimation is also an important factor for the genomic function analysis related to CNVs. SWA estimates the copy number of the assembled contig when its extension is stopped at both ends. SWA output this finished contig and its corresponding copies simultaneously. Furthermore, the accuracy of estimated copy numbers is high up to 99% as shown in assessments in simulated datasets. Because in the process of extension of a seed, the mean value of each extension is recorded in variable , which is used to estimate the copy number of corresponding items (Methods). TRC is the types of repeats and CNRC is the copy numbers of each corresponding type of repeat. These two metrics are used to evaluate the correctness of assembled repeats. NNC is the number of nonrepeat contigs which is used to evaluate the correctness of assembled nonrepeats. Because N50 size might sometimes be a misleading statistic, we also computed another statistic, which we called E-size.

Assessments
(i) CN-accuracy is the accuracy of estimated copy number of each repeat. And which is designed to evaluate the accuracy of estimating copy number of each repeat contig and defined as CN-accuracy = 1 − (∑ error / total ), where total is the sum of the copy numbers of all types of repeats and error is the absolute value between estimated copy numbers and theoretical copy numbers. Therefore, the larger CNaccuracy is preferred. The larger the CN-accuracy is, the better the performances of SWA of estimating copy numbers of repeats will be.
(ii) Rep-accuracy (Rep-acc) is the accuracy of assembled repeat contigs, which is designed to evaluate the correctness of all types of assembled repeat contigs and defined as Rep-accuracy = 1−(∑ | − |/ ∑ ), where represents the length of real repeat and represents the length of assembled repeat contig. | − | is the error tolerance. The higher Rep-accuracy represents the better performances of SWA for assembling repeat contigs. So higher Repaccuracy is preferred.
(iii) C-accuracy is the accuracy of the total contigs, which is designed to evaluate the accuracy of all assembled contigs totally and defined as C-accuracy = 1 − ( error / total ); total represents the total number of all contigs including repeats and nonrepeats and error is the sum of all error contigs.
(iv) The E-size is designed to answer the question: if you choose a location (a base) in the reference genome at random, what is the expected size of the contig or scaffold containing that location? This statistic is Three sequences with length = 500 kb, 500 kb, and 1 Mb containing different types of repeats. Contigs of repeat and nonrepeat are generated independently by SWA with basic parameters: read length = 50, filtered times = 1, sliding window size = 3, and -mer = 10. Contigs smaller than 200 are removed. one way to answer the related question: how many genes will be completely contained within assembled contigs or scaffolds, rather than split into multiple pieces? E-size is computed as E = ∑ ( ) 2 / , where is the length of contig and is the genome length estimated by the sum of all contig lengths.
For evaluating correctness, the metrics, such as CNaccuracy, Rep-accuracy, and C-accuracy, are all computed by aligning the corresponding items back to the reference genome using program swalign from MATLAB platform. Swalign constructs local pairwise alignments between two sequences using Smith-Waterman algorithm [25].
Among these metrics, TRC, CNRC, NNC, CN-accuracy, and Rep-accuracy are specially designed for judging the correctness of assembling repeats and nonrepeats. Notably, the length of contigs is not what we are mainly concerned about due to following reasons.
(1) The primary goal of SWA is to resolve repeats by assembling repeats regions and nonrepeats regions separately. Therefore, the correctness of assembling repeats and nonrepeats is what SWA is firstly concerned about.
(2) In order to separate repeat or nonrepeat correctly, SWA must detect the boundary of them accurately and stop extending seed at the boundary automatically.
(3) The assembly with larger contig is always preferred. However, if the contig is assembled or constructed with errors, the larger the contig is, the worse the assembly will be. So accuracy is another important metric for the correct contig.

Assessments in Simulated Datasets.
In this part, we validated the performances of SWA in three kinds of simulated datasets containing interspersed repeats, tandem repeats, and compound repeats, respectively. And then the effect of sequencing depth and read length to SWA was evaluated, respectively. The detailed results were shown in Tables 1, 2,  and 3. From Table 1, three kinds of simulated datasets containing interspersed repeats, tandem repeats, and compound repeats were used to validate the performances of SWA. The repetitive contents contained in these three sequences represented a wide range of repeats with different copies and lengths. The maximum of copy number and length of repetitive sequence are set up to 18 and 5 kb, respectively. The CN-accuracy, C-accuracy, and Rep-accuracy were almost up to 100% and 99.9%, respectively, which indicated that the estimated copy numbers of assembled repeats and the constructed contigs were all absolutely correct, and the error tolerance of constructed repeats was lower than 0.2%. The error rate of genome coverage was up to 1% low. All of these indicate that SWA not only can assemble different kinds of repeats and nonrepeats independently but also can estimate their copy numbers accurately.
From Table 2, we can clearly see that sequencing depth has a great influence on the performances of SWA. When depth = 6 or 4, the performances of SWA were so much better, metrics such as TRC, CNRC, CN-accuracy, and Caccuracy were absolutely correct and high. Rep-accuracy and N50 were a little down but still good; when depth dropped to 2 or 1, TRC and NNC were increasing, while N50, Max, and Rep-accuracy were decreasing; meanwhile the completeness of assembled repeats is getting worse. All of these indicated that the performances of SWA were degenerating; that is,  Sequence length = 500 kb containing four kinds of repeats. Size of sliding window varies from 3 to 11. Contigs of repeat and nonrepeat are generated in an independent way by SWA at different levels with basic parameters: sequencing depth = 0.5, filtered times = 1, read length = 60, and -mer = 20. Contigs smaller than 200 are removed. long repeats and nonrepeats were assembled into several short fragments. CNRC of corresponding assembled repeat contig was shown in Appendix. But CN-accuracy and Caccuracy were still up to 100% which indicated that the copy number estimation of each assembled repeat contig was still right. When depth fell to 0.5, the metrics except CN-accuracy and C-accuracy were almost not accurate. Particularly, the Rep-accuracy was only 66.7%, which indicated that the completeness of repeats was destroyed. Notably, when depth dropped to 0.2, almost all metrics were getting bad. TRC and NNC were far from the real value. CN-accuracy and Rep-accuracy were so low that almost half repeats were not assembled and their copy numbers were estimated with large errors. N50 and Max were so small which indicated that all long repeats and nonrepeats were broken into smaller fragments. The assembled repeats and nonrepeats were far from completeness. But C-accuracy was more robust than other metrics, which indicated that although these contigs were so small they were at least correct. All of these indicate that sequencing depth affects the performances of SWA greatly. Some extent of high coverage depth is necessary for SWA to generate best assembly.
From Table 3, we can clearly see that read length has little influence on the performances of SWA. When read length varied from 36 to 150, almost all metrics were good and had little change except TRC and NNC. TRC and NNC were decreasing, which indicated that long repeats and nonrepeats were assembled more completely. Therefore, N50 increased a little with the increase of read length. What is more, the Rep-accuracy and genome coverage were increasing a little with the increasing of read length, which indicated the completeness of assembled repeats and assembly redundant were increasing simultaneously.
From Tables 1, 2, and 3, the property of assembling repeats and nonrepeats independently and separately was validated. The effect of sequencing depth and read length to the performances of SWA was also evaluated, respectively. From these results we can safely come to a conclusion that SWA can assemble repeats and nonrepeats independently and correctly; meanwhile sequencing depth has a greater influence on SWA than read length. In high coverage depth, the total performances of SWA are perfectly good. But in a low coverage depth situation, the performances of SWA are a little down. In practice, the higher coverage generated increases the higher sequencing cost. So a compensational mechanism of low sequencing depth is described in the following section.
In the following section, we evaluated the effect of sliding window and filtered times to SWA in low sequencing depth situation, respectively. The detailed results were shown in Tables 4 and 5. Table 4 indicated that the performances of SWA were not so much good in low sequencing depth and sliding window can improve the performances of some metrics, such as TRC, NNC, and Rep-accuracy. When sliding window size varied from 3 to 11, the performances of SWA were getting from good to bad, then to bad generally. Particularly for the metric of Rep-accuracy, which was getting up to 98.7% from 76.1% and then getting down to 89.6%. So the appropriate sliding window can improve the performances of SWA and compensate the read loss of low coverage depth. The choice of optimal sliding window is very important for the effect of compensational mechanism and is closely related to read length, sequencing depth, and dynamic overlapping interval.
It was clearly shown in Table 5 that the appropriate increase of filtered times could also improve the accuracy Sequence length = 500 kb containing five kinds of repeats. Contigs of repeat and nonrepeat are generated in an independent way by SWA at different levels with basic parameters: sequencing depth = 0.5, size of sliding window = 3, read length = 60, and -mer = 15. Contigs smaller than 200 are removed.
of assembling repeats. The effect of sliding window is to filter out the bias caused by sequencing process; therefore the increase of filtered times can improve the effect of filtering bias. It is clear that the Rep-accuracy was up to 82.1% after being filtered twice from 66.8% but got worse to 68% after three times filtering. So too much filtered times can lead to misassembled contigs across the boundary of repeats and nonrepeats as shown in Table 5. Therefore, for real NGS data, some compromise is necessary for choosing optimal filtered times.

Assessments in Reference Datasets.
In this section, we validated the performances of SWA in reference genome datasets of three species in Materials. We analysed their repeat structures by whole genome scan using RepeatScout [26]. The repeats structures including lengths and copies were detailed in the Appendix and the link detected repeats and their copies in Table 6 in supporting data. The results of SWA are presented in Table 6. Table 6 shows the assembly statistics of three species by SWA. All contigs were corrected and verified by aligning them back to reference genome, so the C-accuracy was 100%. For chrIV-S.c datasets, SWA generated 32 repeats and 315 nonrepeats; the copy numbers of assembled repeats are presented in the Appendix. In our whole genome scanning, 41 repeats longer than 200 were identified. By aligning these assembled repeats back to the reference, 4 tandem repeats were assembled together and the left were all correct. So the CN-accuracy is about 88%, but C-accuracy and genome coverage are almost up to 100%. For E. coli, SWA generated 50 repeats and 259 nonrepeats; the copy numbers of assembled repeats are presented in the Appendix. By whole genome scanning, 57 repeats were identified. So CN-accuracy is about 88%, but genome coverage is 99.7%. For chrIII-C.e datasets, SWA generates 198 repeats and 5471 nonrepeats. In our whole genome scanning, 339 repeats longer than 200 were identified. By aligning these assembled repeats back to reference, 103 short tandem repeats were assembled together. So the accuracy is about 89%. But the genome coverage is a little lower.

Assessments in NGS Datasets.
In order to assess the performances of SWA more comprehensively, we performed comparisons with other eight leading genome assemblers presented in GAGE [27], such as ABySS, ALLPATHS-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, and Velvet. We used the assembly evaluation script provided by GAGE to assess various assembly metrics. Briefly, the GAGE script aligns contigs back to the reference genome and calculates the corrected N50 length by breaking contigs at misassembled sites. Tables 7, 8, and 9 show the assembly metrics for SWA and eight others in three species including S. aureus, R. sphaeroides, and human chromosome 14. We did not run these assemblers on the whole human genomes due to the following reasons: (1) some of the assemblers in our comparison would take many weeks to assemble the complete genome and others would fail entirely; and (2) high computing platform is not available. The statistics for these eight assemblers were taken from GAGE study. In order to compare the ability of assembling repeats fairly, all contigs are aligned back to repeats using swalign function [25]. Table 7 shows the assembly statistics for S. aureus dataset by nine assemblers. For S. a, SWA performs best in assembling repeats and nonrepeats. By aligning them back to repeats, SWA generated 30 repeats with total size 20.7 kb. SGA generated 18 repeats with total size 19.4 kb. Velvet generated 15 repeats with total size 13.8 kb. In terms of the completeness of types of repeats, SWA achieved the best assembly. The Rep-acc of SWA and SGA and Velvet are 82% and 83.8%, respectively. Therefore, for the accuracy of assembling repeats, SWA and SGA are the top two assemblers. Other assemblers had poor performances in the accuracy of assembling repeats and nonrepeats. Particularly, Allpath-LG only generates 3 repeats with size 2.6 kb and Repacc 4.3%. Because Allpath-LG achieved the longest corrected N50 length (66.2 kb), long contig can cross the boundary of repeat and lead to indistinguishable contig. So the better the continuity of assembler is the worse the completeness of assembling repeats and nonrepeats will be. For the continuity of assembly, SWA was read loss to other eight assemblers. However, for the assembly size and genome coverage, SWA was also the best one with assembly size 2,939 kb and genome coverage 100.1%, which is most approximate to the real genome size. So in terms of completeness of assembly, SWA achieved the best assembly. Table 8 shows the assembly statistics for R. sphaeroides dataset by nine assemblers. By aligning them back to repeats, SWA generated 13 repeats with total size 7.8 kb. AByss, SGA, and SOAPdenovo generated 13 repeats with total size 8.3 kb, 9 repeats with total size 3.9 kb, and 9 repeats with total size 3.7 kb, respectively. The Rep-acc of them is 44.5%, 44.5%, 64.3%, and 84.5%, respectively. Therefore, in terms of accuracy of assembling repeats, SOAPdenovo outperformed others. But for the completeness of size of repeats, SWA and AByss are the top two assemblers. Other Contigs of repeat and nonrepeat are generated in an independent way by SWA with basic parameters: sequencing depth = 2, read length = 60, filtered times = 1, and sliding window = 3. Contigs smaller than 200 are removed.   five assemblers performed worse in assembling repeats and nonrepeats. Particularly, Allpath-LG, Bambus2, and CABOG only generated less than three repeats with length less than 1.4 kb, because in terms of continuity, these three assemblers, Allpath-LG, Bambus2, and CABOG, are the top ones. But long contigs can lead to indistinguishable contigs. However, in terms of assembly size and genome coverage, SWA also achieved the best assembly with genome size 4,600 kb and genome coverage 99.9%, which is most approximate to the real genome size. So in terms of completeness of assembly, SWA performed best among these nine assemblers. Table 9 shows the assembly statistics for human chromosome 14 dataset by nine assemblers. For H. s 14, SWA performs best in assembling repeats and nonrepeats. By aligning them back to repeats, there are four top assemblers in terms of assembling repeats, such as SWA, MSR-CA, SOAPdenovo, and Velvet. There are 198 repeats with total size 129.3 kb, 190 repeats with total size 526 kb, 188 repeats with total size 476 kb, and 192 repeats with total size 339 kb generated by SWA, MSR-CA, SOAPdenovo, and Velvet, respectively. The Rep-acc of corresponding items is 88.3%, 85.3%, 73.3%, and 56.9%, respectively. For the accuracy of assembling repeats, SWA and MSR-CA achieved the best results. In terms of completeness of types of repeats, SWA achieved the best results. However, in terms of continuity, Allpath-LG, CABOG, and SOAPdenovo outperformed SWA. But for the genome size and genome coverage, SWA achieved the best results with assembled size 8,7936 kb and genome coverage 99.6%. So in terms of completeness of assembly, SWA outperformed other assemblers.
One can safely come to a conclusion from Tables 7, 8, and 9 that SWA performed best in assembling repeats and nonrepeats in three NGS datasets. In terms of assembling repeats and completeness of repeats, SWA is the top one among these nine assemblers. One may argue that the contiguity of SWA is not better than others; the metrics such as N50, N90, and E-size are smaller than some of the other assemblers. This is because SWA is specially designed for assembling repeats and nonrepeats. Therefore, SWA stops extending contigs automatically when the boundary of repeats is detected. Meanwhile, the continuity of assembly and the completeness of repeats and nonrepeats are the pair of contradiction. On the other hand, the better completeness of repeats and nonrepeats requiers that contigs must be stopped at the boundary of repeats. Therefore, the continuity of assembly will be down.

Sequencing Strategies for SWA.
The uniformity of the sequencing process is very important for SWA, because assembling repeats and nonrepeats independently of SWA is based on the combination of coverage depth and sliding window. The effect of sliding window is to filter out the bias caused by sequencing process, because sequencing bias makes the frequency of repeat and nonrepeat more ambiguous to determine, so large sequencing bias may result in short contigs or misassembly. For the appropriately uniformed sequencing data, SWA cannot only assemble repeats and nonrepeats independently but can also estimate their copy numbers correctly. In this situation, contigs of repeats and nonrepeats can be easily grouped into scaffolds by SWA using only the short insert paired-end information, while other current assemblers all need the good combination of several mate-pair libraries with different insert lengths, which increase the cost of sequencing and complexity of technologies. Therefore SWA provides a simple sequencing strategy for NGS technologies; that is, long-distant library is not necessary. So similar to the strategies recommended by ALLPATHS-LG [19], we recommend that for the Illumina technology one should use an overlapping paired-end library with a suitable insert size to generate PE raw reads for contig assembly and there is no need for several mate-pair libraries with different insert lengths to generate long-distant jumping reads for scaffolds. The average genome coverage is at least 100× or higher. For generating overlapping paired-end reads, we provide a simple formula to calculate the insert size for constructing a paired-end library: insert Size = (read length ( ) + max error tolerance ( )) × 2 − max overlap length ( ). For example, if the read length is = 150 bp, the max overlap length = 100 bp and the max error tolerance = 50; then the recommended insert size is 300 bp.

Seed Selection.
In an extension-based assembler, a good seed should not contain any sequencing errors and should not be selected from the boundary of repeats and nonrepeats. A read from repeat region usually has a high read count because identical repeats from other loci are counted as well. On the other hand, a read from nonrepeat region always has a low read count. However, the read from the boundary always has the middle count under the condition of uniformed sequencing process. These seeds are hard to determine whether they belong to repeat region or nonrepeat region and always lead to misassembly or short contigs. Thus, the seeds for repeat region are chosen with high read count, while the one for nonrepeat region should be chosen with low read count, and seeds with middle read count should be avoided.

Parallel Operation.
Obviously, parallel operation can save the executive time and reduce the memory use. SWA can assemble repeats and nonrepeats independently by using two different computers without any communication. This property can shorten the executive time almost a half and reduce the memory use in some extent. Furthermore, the raw data also can be classified into two parts, repeats and nonrepets, according to read count. This strategy can reduce the memory usage largely.

Optimal Sliding
Window. The sliding window plays an important role in contig construction in SWA. By filtering out sequencing bias, SWA can distinguish repeats and nonrepeats from NGS data easily so as to assemble repeats and nonrepeats independently. The mean value of read count in sliding window determines whether the extension should continue or not. So too small sliding window cannot filter out the bias efficiently but has high sensitiveness of detecting changes of read count. A too large sliding window can filter out the bias efficiently but decreases the sensitiveness of detecting repeats and leads to misassembly. The optimal sliding window should have both the property of filtering bias efficiently and detecting repeats sensitively. So the compromise is necessary in practice.

Optimal Read.
In the stage of seed extension, the optimal read is needed in order to extend the seed in dynamic overlapping interval.
In SWA, the optimal read was identified using the following strategy: the one overlapped most bases with seed in dynamic overlapping interval was taken as the optimal read. In theory, longer overlapping with seed means higher accuracy of assembly. But this strategy has low speed, because the seed only extends one or two bases at one extension. Of course, the other strategy for choosing optimal read in dynamic overlapping interval also can be adopted, such as the optimal read can be identified as the one with read counts nearest to the theoretically sequencing depth. This strategy has a higher speed, but the correctness of extension will be low compared with the first strategy. In practice, the strategy of identifying optimal read can be chosen by users.

4.6.
Comparisons. SWA is specially designed for assembling repeats and nonrepeats, respectively. What we are mainly concerned with is the correctness and accuracy of assembling repeats and estimating their copy numbers rather than the length of assembled contigs, so SWA stops extending contig automatically when detecting the boundary of repeat and nonrepeat. Duo to this, the validations and evaluations are performed rather than comparisons with other assemblers in simulations and reference datasets. In real NGS datasets, the comparisons were performed comprehensively with other eight leading assemblers. But the accuracy and completeness of repeats are what we are firstly concerned with.

Conclusions
In this paper, we developed a de novo genome assembly algorithm named SWA, which can assemble repeats and nonrepeats independently. The most important features of SWA are (1) assembling repeats and nonrepeats completely and accurately; (2) adopting sliding window function to filter out sequencing bias in genome assembly process; (3) compensating the loss of low coverage; and (4) estimating the copies of each assembled contigs. Consequently, in this study, we have validated the performances of SWA and compared them with other leading assemblers in three real NGS datasets. For comparisons in real NGS datasets, the metrics such as TRC, NNC, and Rep-size are used to evaluate the completeness of assembled repeats and nonrepeats; the metrics such as Rep-accuracy, C-accuracy, and CN-accuracy are used to evaluate the accuracy of assembled repeats and nonrepeats, while the N50, N90, and maximum contig are used to evaluate the continuity of whole genome assembly. Results indicated that SWA outperformed other leading assemblers in the completeness and correctness of assembling repeats, but the continuity was not better than some of the others. It is natural, because SWA is not specially designed for whole genome assembly and continuity is not what SWA is firstly concerned with.
In general, without long insert-size libraries, repeats that extend beyond the paired-end insert sizes will be difficult to resolve and assemble. Although, some of the compared assemblers can assemble long repeats with simple structure, the completeness and accuracy are not good. It is natural that the continuity is what a whole genome assembler is mainly concerned with. However, SWA is not a whole genome assembly. So bridging two unique sequences around a repeat is not allowed by SWA in order to ensure the completeness of separating repeats and nonrepeats. Even though SWA was not aiming for the whole genome assembly, SWA also provided another solution to resolve long repeats without the help of long insert-size libraries by assembling from nonrepeats completely. In theory, for the whole genome assemblers, if repeats and nonrepeats are assembled correctly and completely, their copies are estimated correctly. Scaffolds can be grouped easily by using short-insert paired-end information rather than the good combination of several libraries. In practice, repeat characteristics in different genomes can vary extensively and depths of sequencing can be highly uneven along the genome, so the expected theoretical de novo assembly results from different genomes will also vary. (Figure 4) (1) Selecting a seed in repeat regions with high frequency larger than in table . (2) Computing read counts overlapped with seeds in dynamic overlapping intervals. (3) Filtering the overlapped read counts by sliding window and then computing the mean value of this interval and recording in . (4) Judging whether the extension of seed is out of the bound of repeat or not. If > 2 , the extension will continue or else stop extension at this end. (5) Extending seed using the optimal read in the dynamic overlapping interval. The optimal read in SWA is the unique read with longest overlapped bases. (6) Continue Step 2-Step 5 until this seed is stopped at both 3 -end and 5 -end. (7) If repetitive seed sets are not empty, go to Step 1 and repeat these steps, else repetitive contigs construction are finished.

The
Steps of Extending Nonrepeats ( Figure 5) (1) Selecting a seed in nonrepeat regions with low frequency smaller than in table .
BioMed Research International

13
(2) Computing read counts overlapped with seeds in dynamic overlapping intervals.
(3) Filtering the overlapped read counts by sliding window and then computing the mean value of this interval and recording in .
(4) Judging whether the extension of seed is up to the boundary of repeat or not. If < 1 , the extension will continue, or else stop extension at this end.
(5) Extending seed using the optimal read in the dynamic overlapping interval. The optimal read in SWA is the unique read with longest overlapped bases.
(6) Continue Step 2-Step 5 until the extension is stopped at both 3 -end and 5 -end.
(7) If nonrepetitive seed sets are not empty, go to Step 1 and repeat these steps, else nonrepetitive contigs construction are finished.

Sliding Window.
Coverage bias is inevitable in genome sequencing process and is usually caused by whole genome amplification (WGA) [28]. Three primary forms of WGA have been developed: multiple displacement amplification (MDA) [29], primer extension preamplification (PEP) [30], and degenerate oligonucleotide primed PCR (DOP) [31]. Furthermore, coverage bias also can be amplified by data cleaning and error correction stage. The existence of coverage bias increases the nonuniformity of read depth for detecting copy numbers of repeats, which can lead to extending contigs crossing the bound of repeats and incorrect estimation of copy numbers of repeat contigs. How to eliminate or decrease these noises is a very important step in constructing contigs of repeat or nonrepeat regions. So the sliding window is used to filter out the noise caused by coverage bias so as to improve the performances of distinguishing repeats from nonrepeats and estimating the copy number of each repeat contig. In order to decrease the coverage bias, we use rectangular window function to smooth coverage bias. Rectangular window function is defined as follows: here is the length of sliding window. So where represents read counts by overlapping − bases.
In theory, the longer the is, the better the sliding effect will be. However, cannot be too large or too small which is closely bounded to read length and length of dynamic overlapping interval . In order to guarantee the correctness of SWA, we set ≥ (3/4) , 2 ≤ ≤ /2. Of course, this is an empirical value.

Hash Index.
Overlap computing is the most time consuming stage for all against all. In order to speed up SWA, we use hash index to store the location of each keyword and read rather than the keyword itself in hash index. The details are as follows. For each read, the continuous bases from 3 -end and 5 -end are selected and then mapped into two variables, forward and backward, respectively. And then, map → 0, → 1, → 2, → 3. So a string consisting of continuous bases is transferred into quaternary integers; the quaternary integers then are transferred into decimal integer. So each keyword is mapped to a unique location in hash index which stores the identification of unique processed reads. We define the hash function as is the length of keywords in hash function.
6.5. Parameters of Kernel SWA Program. The kernel program of SWA has eight parameters: the maximum overlapping length max, dynamic overlapping interval , read length , length of sliding window , threshold of repetitive seeds , threshold of nonrepetitive seeds , threshold of repeats assembly 2 , threshold of nonrepeats assembly 1 , and sequencing depth after filtering by sliding window fd . On the basis of the extensive comparisons of three species as shown in the paper, we suggest that the only values of maximum overlapping length, length of dynamic overlapping interval, and read length may not be adjusted. We recommend the following: = 101, max = 100, = 9, and -mer can be set in the range [35, 50]. These four parameters: , , 1 , and 2 are closely related to sequencing depth and length of sliding window . For example, if sequencing depth is = 2, length of sliding window = 3 and filtered times = 1, and then should be higher than double sequencing depth; that is, = 5. should be lower than sequencing depth, so let = 1. 1 should be a little higher than the filtered sequencing depth fd = × = 2 × 3 = 6, so let 1 = 8. 2 should be a little lower than the 2 × fd = 2 × 6 = 12, so let 2 = 10, where sequencing depth is = / , coverage is = ( / ) × = × , and is the number of reads. Parameters have a great influence on the performances of SWA. Therefore, in practice, fine tuning is necessary for special genome with intrinsic complex repeat structure or different backgrounds of sequencing bias. We suggest that the parameters needed to fine tune are 1 and 2 . The abbreviations are presented in the Abbreviation Section. 6.6. Thresholds. The threshold for repeats and nonrepeats are based on the size of confidence intervals and significance testing, which are closely related to coverage depth, size of sliding window, and filtered times. Let = { 1 , 2 , . . . , } be the overlap numbers in dynamic overlapping stage and the mean of is = (1/ ) ∑ . Let = { 1 , 2 , . . . , } be the read counts filtered by sliding window, so the mean of is = (1/ ) ∑ ≈ × . According to the law of large number in the probability and statistic theory, we can easily get ∼ ( , 2 ). So if the average sequencing depth is , the average read counts filtered by sliding window is fd = × . Thus, the read counts of the nonrepeat region filtered by sliding window should be fd if sequencing bias is free, and the corresponding items of repeat region with two copies should be 2 fd . Clearly fd = if sequencing bias is free. Let = /2; random variable is the mean read counts filtered by sliding window.
If { ≤ + 1 } ≥ 1 − , so the confidence upper limit 1 of nonrepeat region at the confidence level 1 − is 1 = + 1 which is the threshold of nonrepeats. If { ≥ 2 − 2 } ≥ 1 − , the confidence lower limit 2 of nonrepeat region at the confidence level 1 − is 2 = 2 − 2 which is threshold of repeats, where 0 ≤ { 1 , 2 } ≤ , and { 1 , 2 } can be used to control the type-I error and type-II error since the statistical tests of overlapping intervals of windows are not independent. The construction of repeat contig and nonrepeat contig is generated separately. 6.7. Estimating Copy Numbers. The copy number of each repeat contig is estimated by using significant testing methods. After finishing contig construction, the variable has stored the whole filtered read counts and its mean value is computed. The copy number of corresponding items is estimated by rounding mean( )/ fd to the nearest integer.

Materials
In this study, three kinds of datasets are used to validate the performances of SWA and compare with other assemblers. They are real simulated datasets, reference datasets, and real NGS datasets.
For real simulated datasets, the model sequences, A, B, and C are randomly sampled from {A, T, C, and G} with different repetitive contents. These three sequences contain tandem repeats, interspersed repeats, and compound repeats, respectively (as shown in Figure 6). These repetitive contents represent a wide range of length and copies.

Sequence Ainterspersed repeats
Sequence Btandem repeats Sequence Ccompound repeats Figure 6: The graphic explaining the real simulated datasets. The yellow lines represent the model of random sequences. The red lines, blue lines, and purple lines represent different contents of repeats. Sequence A represents the interspersed repeats that is different repeats do not link each other closely. Sequence B represents tandem repeats that is same repeats link each other in the cascade manner. Sequence C contains the compound repeats, which is the combination of sequence A and sequence B. The detailed generation process is presented in supplementary materials.
The detailed information is presented in supplementary table (see Table S3 in Supplementary Material available online at http://dx.doi.org/10.1155/2014/736473). And the generation process is also presented in supplementary materials. Then, the paired-end NGS reads are randomly sampled from the fragments with normal distribution (300, 30).
For reference genome datasets, we download the reference genome of S. cerevisiae, C. elegans from UCSC (http://hgdownload.soe.ucsc.edu/downloads.html) and E. coli k12 (GenBank: U00096.3). For S. cerevisiae and C. elegans, we only randomly take chromosome IV and chromosome III, respectively. The sizes of chrIV-S.c, E. coli, and chrIII-C.e are 1,531,933 bp, 5,132,068 bp, and 13,783,700 bp, respectively. Their repeats structures can be easily analyzed by RepeatScout [26], which is a very effective and sensitive de novo repeats identification method for large genomes and is freely available at http://bix.ucsd.edu/repeatscout/. The repeats structures including lengths and copies are detailed in the Supplementary Materials Appendix and are freely available at http://222.200.182.71/swa/Table6.rar.
For real NGS datasets, two bacterial genomes (Staphylococcus aureus and Rhodobacter sphaeroides, genome sizes of 2.9 and 4.6 Mb, resp.) and human chromosome 14 (genome size of 88.3 Mb) were downloaded from http://gage.cbcb.umd.edu/data/. In the GAGE study [27], all reads were error-corrected before assembly by ABySS, ALLPATHS-LG, Bambus2, Celera Assembler with the Best Overlap Graph (CABOG), Maryland Super-Reads Celera Assembler (MSR-CA), SGA, SOAPdenovo, and Velvet. For a fair comparison, we also obtained these corrected datasets for using in GAGE. These three species have perfect reference genomes. Therefore, their real repeats structures can be easily detected by RepeatScout [26]. For S. a, R. s, and H. s 14, there are 52 repeats, 21 repeats, and 259 repeats detected by RepeatScout [26], respectively, with length longer than 100 bp; their total sizes are 15.8 kb, 3.6 kb, and 146.5 kb, respectively. Table 10 presents the detailed memory usage and CPU times of SWA in three real NGS datasets. Different stages have different requirements of memory. The  Table 10 is the maximum memory. The CPU times refer to the run time of main procedure except for the preprocessing stage. Because a different assembler has different hardware requirements; therefore the direct comparisons are not reasonable to some extent. However, Table 10 gives users a rough guidance for hardware requirement and run time. Those of others have been accessed clearly in GAGE study and are freely available at http://genome .cshlp.org/content/early/2012/01/12/gr.131383.111/suppl/DC1. SWA is implemented in MATLAB computing environment. Programming language: m language. Operation systems: Windows, Linux. Computing platform: 3.5 GHz eight Intel Celeron CPU with 32 GB RAM and 64-bit operational system.

Availability of Supporting Data
The supporting data including NGS data and assembling results are freely available at http://222.200.182.71/swa/ Results.rar.
The detected repeats and their copies of three species used in Table 6 can be freely found at http://222.200.182.71/swa/ Table6.rar.
The detected repeats and their copies of three species used in Table 7

SWA:
Sliding window assembling TRC: The types of repetitive contigs CNRC: Copy number of each repeat contig NNC: Number of nonrepeat contigs Rep-acc: The accuracy of assembled repeat contigs C-accuracy: The accuracy of the total contigs CN-accuracy: The accuracy of estimated copy number of each repeat -mer: Maximum overlap minus the minimum overlap : The number of filtered times by sliding window : The length of sliding window : Th el e n g t ho fr e a d s : The length of dynamic overlapping interval : The threshold of repetitive seeds : The threshold of nonrepetitive seeds : Th el e n g t ho fs e q u e n c i n gr e a d s : The sequencing depth 1 : Threshold for nonrepeats extension 2 : Threshold for repeats extension.