Repetitive element sequences are adjacent, repeating patterns, also called motifs, and can be of different lengths; repetitions can involve their exact or approximate copies. They have been widely used as molecular markers in population biology. Given the sizes of sequenced genomes, various bioinformatics tools have been developed for the extraction of repetitive elements from DNA sequences. However, currently available tools do not provide options for identifying repetitive elements in the genome or proteome, displaying a user-friendly web interface, and performing-exhaustive searches. ProGeRF is a web site for extracting repetitive regions from genome and proteome sequences. It was designed to be efficient, fast, and accurate and primarily user-friendly web tool allowing many ways to view and analyse the results. ProGeRF (Proteome and Genome Repeat Finder) is freely available as a stand-alone program, from which the users can download the source code, and as a web tool. It was developed using the hash table approach to extract perfect and imperfect repetitive regions in a (multi)FASTA file, while allowing a linear time complexity.
Repetitive elements are found in large quantities in eukaryotic genome, both in coding and noncoding region, and also in intergenic regions of prokaryotes [
Repetitive sequences can be categorized into two groups: interspersed repeats and tandem DNA repeat. Interspersed repeats are mainly active or inactive copies of transposable elements dispersed throughout the genome and are divided into DNA transposons and retrotransposons [
Normally, tandem repeats are classified according to the repetitive motifs length in microsatellites, minisatellites, and macrosatellites. Microsatellites (also known as short tandem repeats (STRs) or simple sequence repeats (SSRs)) are small stretches of DNA sequences (usually <200 bp), with motif lengths between 1 and 6 bp. Minisatellites are large repetitive sequences, with motif lengths of 5 to 25 bp, and the macrosatellites are large regions of repeats with lengths larger than 25 bp [
Microsatellites can be classified as perfect, imperfect, and compound. Perfect repetitive elements are formed from identical repetitive units. Imperfect repetitive elements are units with small mutations and may have been caused by insertions, deletions, or replacements. Repetitive compounds elements are composed of sequences in which two or more repetitions (perfect or imperfect) are arranged successively with or without nucleotide bases between them [
Repetitive elements, mainly microsatellites, have been widely used as molecular markers in phylogenetic studies, analyses of genetic populations, construction of genetic maps, paternity testing, and forensic medicine [
Given the importance of identifying repeating regions and the possibility of identifying them in silico, many tools for identifying repeating regions have been developed. Work carried out by Lim et al. [
However, it should be taken into consideration that all of the above software tools are unable to obtain all of the possible sequences because they (a) locate only perfect repetitions (GMATo, TROLL, and Misa); (b) make use of probabilistic or statistical patterns heuristics that do not meet all possible repetitions (TRF and Mreps); and (c) are unable to execute on large FASTA files (SciRoKo and Mreps). Finally, none of these tools can be executed in both DNA and protein datasets.
Thus, this paper presents a fast and efficient algorithm inspired by the concepts of “Sequence Search and Alignment by Hashing Algorithm,” SSAHA [
Some definitions are presented below to facilitate understanding.
Here
ProGeRF is available in two execution modes, as illustrated in Figure
ProGeRF architecture. Structure of the tool both for the web environment and for the stand-alone mode. The dark blue rectangles with rounded corners represent interfaces with the system. The transparent rectangles with a blue background represent algorithms done in C or Perl. The process script receives data from the web environment, treats the data, saves them in a MySql database, and calls the repetition extract module.
Repeat extraction module has been used in this two execution modes. This module consists of three algorithms: one developed in Perl and two developed in C language. The perfect and imperfect repetitions are identified by algorithms in C language, called RepeatFinderDNA and RepeatFinderProteome. The first algorithm works on a FASTA file with DNA sequences and the second algorithm works on a FASTA file with amino acids sequences.
The Perl script, called ProGeRF, receives the input parameters, performs the call to the RepeatFinder algorithms, and after treating overlaps calculates statistics and generates the output file.
The ProGeRF algorithm receives as input parameters (a) a (multi)FASTA file, (b) minimum size of the repetitive pattern, (c) the minimum and maximum sizes of the motif (word DNA or amino acids length or sliding window length), (d) maximum amount of gaps accepted between each motif of a repeat, (e) percentage of maximum degeneration accepted for a motif, (f) overlapping percentage, and (g) run mode, that is, whether using a FASTA file of DNA or of amino acids.
RepeatFinder procedure executes, in parallel, for each motif size within the range of minimum and maximum values, to identify sequences with all motif lengths in this range of values in the FASTA file.
An overview of the ProGeRF algorithm is as shown in Figure Dynamically allocate two hash tables (repetitive element hash table and degeneration hash table) of Read the first sequence from FASTA file. Creating degeneration hash table Repeat the previous process for For each position marked with 1 in the degeneration hash table, run the generating degeneration procedure. This procedure generates all possible degenerations of a motif up to a maximum percentage of defined imperfection. Each motif degenerate generated is converted to an integer key Creating repetitive elements hash table (REHT), illustrated by Figure calculate Step 1: if a single bucket does not exist at position set increase the field set However, if there is a single bucket and the condition Step 2: check whether the current sliding window is a degeneration of some motif ever recorded in buckets of REHT. For this, degeneration buffer at position Step 3: for each existing degeneration in the buffer, the function Save the REHT results in a file and later erase its data. Repeat steps 1–7 for the other sequences in the FASTA file. In dealing with overlaps, join all the files from step 7 into a single file, sort the rows by the initial position of the repetition and for each row that represents a repetitive element, and check the following: if the current repetitive element has an initial position less than the final position of the previous repetitive element then compute the degree of overlap; if the degree of overlap is within the permitted value, skip to the next repetition. Otherwise, delete the smallest repetitive element and pass on to the next line. Print the remaining reps in the file.
Creating degeneration hash table: Step 1: sliding window maps each motif of the sequence for a position in the degeneration hash table and sets value 1 to mapped position. Step 2: generate possible degeneration of the sliding window and store in the buffer at position
Creating repetitive element hash table: Step 1: sliding window maps each motif of the sequence for a position in the repetitive element hash table and sets value 1 to mapped position, and add or remove the sliding window to single bucket; Step 2: check whether the current sliding window is a degeneration of some motif ever recorded in buckets of REHT; Step 3: for each existing degeneration in the buffer the function
Hash tables were developed to perform a dynamic allocation of memory which allows the program to read FASTA files of any size. Furthermore, degeneration buffer and the buckets were implemented through circular doubly linked lists, which allow you to insert or remove degenerations or single buckets in the hash table quickly, without the need to scroll through the whole list.
Time complexity to create the degeneration hash table is approximately linear in function of the number of nucleotides or amino acids sequence, because the algorithm runs once the input sequence to identify the existing motif, scoring with 1 the position in the degeneration hash table of motif found. Then, it traverses the degeneration hash table, and at positions marked with 1, the possible degenerations are generated for the corresponding motif.
The algorithm accepts a maximum of 35% degeneration, that is, at most two degenerate characters in a motif of size 7. The amount of possible combinations for a motif of size
The step of generating REHT also presents linear time complexity depending on the size of the sequence input. Because the sliding window traverses the FASTA sequence once for every sliding window, the corresponding motif is inserted or deleted in the bucket in constant time and then tested at most
ProGeRF is designed to have web and command line interface. The command line interaction may be performed by indicating the (multi)FASTA file address containing DNA or amino acids sequence(s), the motif length range, the minimum repeated times for all motif lengths or the minimum repeated time for each one, the maximum gaps allowed between motifs, the maximum degeneration percentage, the motif shifting percentage, and the run mode that defines DNA or amino acids input sequence and the output file name.
For example, the command sequence perl progerf.pl −q Linfantum_JPCM 5.FASTA −o output −i 2 −y 6 −r 5 −g 3 −v 0 −d 20 −m n will search repetitive elements in the file Linfantum_JPCM5.FASTA of motif with length range between 2 and 6, with maximum gaps of 3, motif overlap of 0%, degeneration of 20%, and run mode of nucleotide, and the result will be saved in the output file.
The results file presents a table, wherein each column represents the following information in order: sequence ID, size of the DNA/protein sequence, minimal repetitions allowed, repetition amount, repetitive element start and final position, number of gaps, statistics (only nucleotide run mode), and complete repetitive element.
The web mode, available at
Web interface provides three ways of sending the FASTA file containing DNA or amino acid sequences. File upload: the users can send a FASTA file from their own computer. Copy and paste sequence: the user copies a sequence of interest and pastes in the text box. Automatic download from the NCBI data base: the user enters one or more GI numbers separated by commas, and the tools will download the sequences from the NCBI data base and run the repetition extraction algorithm. GI number (GenInfo identifier) is a unique number that identifies a particular sequence in the NCBI databases.
The results on the web page can be viewed in two ways: tabular format using the jqGrid script and graphical format, through the JBrowse plugin [
jqGrid is an Ajax-enabled JavaScript control that provides solutions for representing and manipulating tabular data on the web dynamically [
JBrowse is a browser for genome viewing, developed in JavaScript, in which the user can navigate through the genome annotations on the web. In JBrowse, it is possible to zoom, navigate, and select range of subsequence within a genome [
We present two experiments in this paper. The first experiment demonstrates the efficiency of ProGeRF compared with other microsatellite identification tools, and the second experiment demonstrates the use of the repetitive element identification algorithm in protein FASTAs files.
Our current implementation features a Dell Inspiron, Intel core 2 duo 2.2 GHz processor with 2 MB cache, 3 GB RAM, 320 GB hard drive, and the Ubuntu operating system 14.04 LTS 32 bits.
For the first experiment, the tools Misa [
For tools that allow for configuring the parameters minimum size, maximum size, and a minimum number of repetitions of five motifs, the values set for these parameters were 1, 6 and 5, respectively. For the remaining parameters, the following values were used according each tool: (a) Misa: maximum difference between 2 SSRs of 0; (b) Mreps: a resolution of 5; (c) SciRoKo: mode mismatched fixed penalty, with other parameters' score using default values; (d) Sputnik: a maximum size of 5 (maximum allowed by the tool), a minimal score: 5, maximal recursion: 0, minimum length of SSR to report: 10, and points for a mismatch and points for a match: 1; (e) TRF: matching weight: 2, mismatching penalty: 7, indel penalty: 7, match probability: 80, indel probability: 10, Minscore: 2, and MaxPeriod: 15; and (f) ProGeRF: a maximum number of gaps allowed 1, overlap of 0%, a degeneration of 20, and nucleotide mode.
IMEX tool presented error during the execution of the versions 1.0 and 2.0 in Ubuntu operating system 14.04 of 32 bits; thus it has not been possible to compare the results of this tool. In the first three sequences, Table
Comparison of amount detection and execution times (in seconds) of Mreps, Misa, Sputnik, GMATo, SciRoKo, TRF, and ProGeRF. The features were run on a Dell Inspiron, Intel core 2 duo 2.2 GHz processor with 2 MB cache, 3 GB RAM, 320 GB hard drive, Ubuntu operating system 14.04 LTS 32 bits.
Sequence | Mreps | Misa | Sputnik | GMATo | SciRoKo | TRF | ProGeRF |
---|---|---|---|---|---|---|---|
Rep (time) | Rep (time) | Rep (time) | Rep (time) | Rep (time) | Rep (time) | Rep (time) | |
NC_004318.1 (1204 kb) | 9608 (2.8) | 22867 (3.2) | 7420 (0.7) | 23539 (10.3) | 3763 (1.1) | 30244 (72.4) | 26164 (3.9) |
NC_001136.8 (1531 kb) | 935 (1.4) | 10640 (3.3) | 1427 (0.9) | 10721 (7.7) | 185 (0.7) | 8101 (4.5) | 11552 (2.4) |
NC_000962.2 (4411 kb) | 1412 (3.9) | 6832 (8.9) | 3140 (1.46) | 6846 (12.1) | 72 (1.5) | 19496 (24.5) | 11422 (4.0) |
|
— (—) | 2054241 (868.3) | 480644 (105.7) | 2073643 (9859.1) | 47770 (129.0) | 2438036 (1481.5) | 2319812 (1352.0) |
The smaller numbers of repetitive elements found by tools SciRoKo, Sputnik, and Mreps are due to the fact that (a) Sputnik does not report hexanucleotide since maximum allowed is pentanucleotide; (b) according Mudunuri et al. [
For three files, Table
By default, ProGeRF does not allow overlaps and chooses the biggest repetitive elements sequence. However, the user can define the overlap percentage allowed, through the parameter −v. Nevertheless, the runtime of ProGeRF was lower than TRF and 7 times smaller than that of GMATo.
We evaluated whether the detections returned by tools on sequences NC_004318.1, NC_001136.8, and NC_000962.2, Table
Loci and nucleotide coverage between tools.
Sequence | B | ||||||||
---|---|---|---|---|---|---|---|---|---|
Tools | Mreps | Misa | Sputnik | GMATo | SciRoKo | TRF | ProGeRF | ||
|
A | Mreps | — | 78 (45) | 53 (33) | 78 (45) | 41 (36) | 98 (74) | 89 (60) |
Misa | 47 (63) | — | 21 (4) | 100 (99) | 18 (40) | 88 (79) | 100 (98) | ||
Sputnik | 91 (86) | 70 (74) | — | 70 (74) | 58 (69) | 0 (0) | 83 (84) | ||
GMATo | 49 (62) | 100 (99) | 21 (40) | — | 19 (40) | 89 (79) | 100 (98) | ||
SciRoKo | 96 (95) | 93 (76) | 95 (71) | 92 (76) | — | 95 (96) | 98 (91) | ||
TRF | 51 (41) | 54 (32) | 0 (0) | 54 (32) | 16 (21) | — | 68 (46) | ||
ProGeRF | 46 (56) | 86 (67) | 21 (31) | 86 (67) | 16 (32) | 87 (78) | — | ||
|
|||||||||
SAC Chr4 |
A | Mreps | — | 60 (40) | 42 (26) | 68 (40) | 18 (20) | 95 (74) | 86 (62) |
Misa | 7 (12) | — | 3 (7) | 100 (99) | 1 (4) | 33 (37) | 100 (99) | ||
Sputnik | 30 (35) | 29 (30) | — | 29 (30) | 12 (1) | 74 (73) | 38 (39) | ||
GMATo | 7 (12) | 100 (99) | 3 (7) | — | 1 (4) | 33 (37) | 100 (99) | ||
SciRoKo | 91 (89) | 77 (61) | 86 (59) | 77 (61) | — | 99 (99) | 94 (72) | ||
TRF | 15 (16) | 41 (26) | 13 (12) | 41 (26) | 2 (5) | — | 48 (35) | ||
ProGeRF | 8 (16) | 92 (80) | 4 (7) | 92 (80) | 1 (5) | 34 (39) | — | ||
|
|||||||||
MTB H37Rv |
A | Mreps | — | 9 (3) | 15 (7) | 9 (3) | 3 (3) | 91 (71) | 75 (58) |
Misa | 2 (3) | — | 1 (1) | 100 (99) | 0.5 (1) | 13 (13) | 100 (99) | ||
Sputnik | 6 (7) | 2 (2) | — | 2 (2) | 1 (1) | 66 (64) | 14 (14) | ||
GMATo | 2 (3) | 100 (99) | 1 (1) | — | 0.5 (1) | 13 (13) | 100 (99) | ||
SciRoKo | 73 (74) | 47 (36) | 63 (41) | 47 (36) | — | 100 (100) | 79 (72) | ||
TRF | 8 (7) | 4 (1) | 10 (7) | 4 (1) | 0.4 (0.5) | — | 18 (13) | ||
ProGeRF | 9 (16) | 60 (35) | 4 (4) | 60 (35) | 0.5 (1) | 29 (33) | — |
Percentage of the total number of detections (perfect and imperfect) of tools A also detected (i.e., covered) by tools B. The value in brackets is the proportion of nucleotides detected by A and covered by B.
Sputnik and TRF present low amount of loci covered by ProGeRF on sequences NC_001136.8 and NC_000962.2. This low coverage is consequence of the lack of a parameter to set maximum size and minimum number of repetitions, which allows them to find a larger number of repetitive elements.
Therefore, we filter the results of Sputnik and TRF tools, limiting the results to minimal repeat of 5, minimal size of 1, and maximum size of 6. Thus, ProGeRF coverage increases to 100% over results of Sputnik and more than 80% over results of TRF (97% for sequences NC_004318.1 and NC_001136.8).
On the other hand, the coverage of ProGeRF by SciRoKo, Mreps, and Sputnik is lower than 46% for all sequences and much lower than 9% when observing the last two sequences. This is consistent with the fact that ProGeRF detects more repetitive elements than others tools.
In the second experiment, we run the ProGeRF web version in the protein mode in circumsporozoite protein (ACO49545.1), merozoite surface protein 1 (XP_001352170.1), and merozoite surface protein 9 (AAN36363.1).
Table
Repetitive protein elements found by the web tool ProGeRF.
ID sequence | Locus | Motif | Rep. |
|
|||
|
62–97 | GASAQS | 6 |
|
146–297 | PNAN | 38 |
|
693–712 | KEKEE | 4 |
The parameters used were motif size between 2 and 6, repetitions of the least 4 motifs, and zero for the gaps, overlap, and degeneration.
Screen shot from circumsporozoite protein (ACO49545.1), merozoite surface protein 1 (XP_001352170.1), and merozoite surface protein 9 (AAN36363.1) element repetitive search: (a) visualization of results through the jqGrid plugin: by clicking over the repetitive element the graphical view is opened; (b) repetitive elements are mapped and displayed graphically through JBrowse.
Regarding the tool in web mode, no other web tool offers the user the possibility to consult executions previously carried out and the integration/visualization of results using a dynamic and friendly environment for navigation genome with jBrowse.
ProGeRF, the proposed identification algorithm for repetitive elements, presents itself as an efficient, fast, accurate, and easy to use tool and is available in either stand-alone or web mode. It offers a dynamic and user-friendly web interface, the identification of perfect and imperfect repetitive elements, repeat size detection from 1 to 12, repeating the search for specific sizes, preview of the alignment, the flanking sequence, repetition statistics, and a graphical output.
Among the tools that locate both perfect and imperfect repeats ProGeRF is the one that provides graphical visualization and allows for the filtering of the results. Another advantage is the possibility of executing it on genomic and proteomic data and the ability to treat large genomic/proteomic data files.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)