Sequencing and restriction analysis of genes like 16S rRNA and HSP60 are intensively used for molecular identification in the microbial communities. With aid of the rapid progress in bioinformatics, genome sequencing became the method of choice for bacterial identification. However, the genome sequencing technology is still out of reach in the developing countries. In this paper, we propose FN-Identify, a sequencing-free method for bacterial identification. FN-Identify exploits the gene sequences data available in GenBank and other databases and the two algorithms that we developed, CreateScheme and GeneIdentify, to create a restriction enzyme-based identification scheme. FN-Identify was tested using three different and diverse bacterial populations (members of
Bacterial identification is an important routine in the clinical and industrial microbiology laboratories. Microbiologists and researchers stepped up their efforts to improve and facilitate the rapid characterization of various microbial communities. Traditional bacterial identification strategies are mainly based on morphological, biochemical, enzymatic, antigenic, staining, and antibiogram characterization [
In the early 1980s, polymerase chain reaction (PCR) provided novel approaches for bacterial identification through amplification of specific sequences/genes from the bacterial genome. Several ribosomal RNA (rRNA) genes and Internal Transcribed Spacers (ITSs) had been utilized for PCR-based bacterial identification such as 16S rRNA, 23S rRNA, 5S rRNA, and SSU rRNA [
Numerous ribosomal RNA genes and ITSs such as Hsp65, rpoB, gyrB, groEL, and recA have been tested as a genetic marker in bacterial identification [
With the rapid progress in DNA and RNA sequencing technology, sequencing of 16S rRNA gene and several other genes became a popular method for bacterial identification and phylogenetic reconstruction. Furthermore, it is employed in nucleic acid-based detection, quantification of microbial diversity, and discovery of novel bacterial isolates in different microbiology laboratories [
Despite the outstanding advancements in speed and accuracy and the remarkable decrease in cost of the sequencing technologies in the recent years, sequencing technologies in developing countries are out of reach for the majority of clinical and research laboratories. This is mainly due to the high cost of establishing sequencing facility and high cost of reagents and maintenance [
In this work, we present a FN-Identify, an efficient and sequencing-free bacterial identification method, as a proposed alternative that can be employed when genome sequencing is inaccessible. FN-Identify, which stands for fragment number-identify, is based on techniques that are available in most of the standard microbiological laboratories. Our new method depends on sequences available in GenBank and other public databases, such as RDP-II [
Comparison between sequencing-based identification approach and FN-Identify proposed approach.
We downloaded the 33, 33, and 22
Names and GenBank accession number of
Strain ID |
Organism | GenBank accession number |
---|---|---|
1 |
|
CP002559 |
2 |
|
CP000033 |
3 |
|
CP002338 |
4 |
|
CP002609 |
5 |
|
CP000416 |
6 |
|
CP002652 |
7 |
|
CP000423 |
8 |
|
FN692037 |
9 |
|
CP000156 |
10 |
|
CR954253 |
11 |
|
CP000412 |
12 |
|
CP002033 |
13 |
|
AP008937 |
14 |
|
CP000413 |
15 |
|
CP000517 |
16 |
|
CP002429 |
17 |
|
CP002464 |
18 |
|
FN298497 |
19 |
|
AE017198 |
20 |
|
CP001617 |
21 |
|
CP002222 |
22 |
|
CP000705 |
23 |
|
AP007281 |
24 |
|
AP011548 |
25 |
|
FM179322 |
26 |
|
FM179323 |
27 |
|
CR936503 |
28 |
|
CP002764 |
29 |
|
CP002391 |
30 |
|
CP003032 |
31 |
|
CP002034 |
32 |
|
CP000233 |
33 |
|
CP002461 |
This table lists the studied
The files that contain the
16S rRNA and HSP60 copy numbers and genomics positions.
Strain ID | 16S rRNA copies number | 16S rRNA position | HSP60 position |
---|---|---|---|
1 | 4 |
57091⋯58665 |
407805⋯409506 |
|
|||
2 | 4 | 59255⋯60826 |
379688⋯381333 |
|
|||
3 | 4 | 66295⋯67869 |
403452⋯405083 |
|
|||
4 | 4 | 55901⋯57475 |
376234⋯377865 |
|
|||
5 | 5 | 86149⋯87711 |
645454⋯647079 |
|
|||
6 | 5 | 706262⋯707824 |
1429276⋯1430898 |
|
|||
7 | 5 | 259510⋯261077 |
2233684⋯2235318 |
|
|||
8 | 4 | 62524⋯64075 |
391450⋯393075 |
|
|||
9 | 9 | 35825⋯37395 |
1448011⋯1449624 |
|
|||
10 | 9 | 45160⋯46720 |
1392354⋯1393967 |
|
|||
11 | 9 | 43705⋯45265 |
1405173⋯1406786 |
|
|||
12 | 5 | 169808⋯171375 |
394255⋯395886 |
|
|||
13 | 5 | 169391⋯170958 |
393747⋯395378 |
|
|||
14 | 6 | 477570⋯479148 |
425524⋯427155 |
|
|||
15 | 4 | 76215⋯77787 |
408372⋯409994 |
|
|||
16 | 4 | 85110⋯86682 |
393232⋯394854 |
|
|||
17 | 4 | 546957⋯548607 |
490210⋯491841 |
|
|||
18 | 4 | 455618⋯457268 |
412091⋯413722 |
|
|||
19 | 6 | 558550⋯560200 |
502509⋯504140 |
|
|||
20 | 5 | 484838⋯486408 |
631044⋯632669 |
|
|||
21 | 5 | 487643⋯489213 |
591466⋯593091 |
|
|||
22 | 6 | 177728⋯179296 |
401807⋯403435 |
|
|||
23 | 6 | 177347⋯178880 |
401630⋯403258 |
|
|||
24 | 5 | 306772⋯308345 |
2303140⋯2304732 |
|
|||
25 | 5 | 307756⋯309313 |
2308734⋯2310368 |
|
|||
26 | 5 | 289782⋯291339 |
2265733⋯2267367 |
|
|||
27 | 7 | 306178⋯307748 |
358686⋯360625 |
|
|||
28 | 41 | 125303⋯126858 |
82036⋯83667 |
|
|||
29 | 5 | 274946⋯276503 |
2240006⋯2241640 |
|
|||
30 | 6 | 274311⋯275837 |
650101⋯651714 |
|
|||
31 | 7 | 74995⋯76521 |
1247027⋯1248649 |
|
|||
32 | 7 | 74540⋯76056 |
1246385⋯1248007 |
|
|||
33 | 7 | 40703⋯42272 |
485966⋯487585 |
1Our Annotation for 16S rRNA sequences in
We tested 13 different primer sequences obtained from 8 published studies (Table
Primer sequences used for 16S rRNA.
ID | Gene name | Name | Sequence | Reference |
---|---|---|---|---|
1 | 16S rRNA | 8F |
5′AGAGTTTGATCCTGGCTC AG3′ | [ |
2 | 16S rRNA | U1492R | 5′GGTTACCTTGTTACGACTT3′ | [ |
3 | 16S rRNA | 928F | 5′TAAAACTYAAAKGAATTGACGGG3′ | [ |
4 | 16S rRNA | 336R | 5′ACTGCTGCSYCCCGTAGGAGTCT3′ | [ |
5 | 16S rRNA | 1100F | 5′YAACGAGCGCAACCC3′ | [ |
6 | 16S rRNA | 1100R | 5′AGGGTTGCGCTCGTTG3′ | [ |
7 | 16S rRNA | 907R | 5′CCGTCAATTCCTTTRAGTTT3′ | [ |
8 | 16S rRNA | 785F | 5′GGATTAGATACCCTGGTA3′ | [ |
9 | 16S rRNA | 805R | 5′GACTACCAGGGTATCTAATC3′ | [ |
10 | 16S rRNA | 515F | 5′GTGCCAGCMGCCGCGGTAA3′ | [ |
11 | 16S rRNA | 518R | 5′GTATTACCGCGGCTGCTGG3′ | [ |
12 | 16S rRNA | 27F | 5′AGAGTTTGATCMTGGCTCAG3′ | [ |
13 | 16S rRNA | 1541R |
5′AAGGAGGTGATCCAGCCGCA3′ | [ |
14 | HSP60 | HSP60-F | 5′ATGGCWAARGANNTHAARTT3′ | Designed |
15 | HSP60 | HSP60-R | 5′TCDGCVACNACNGCTTCNGA3′ | Designed |
A universal degenerate primer for picking up HSP60 sequences was designed based on the conserved regions in the HSP60 extracted sequences. We identified the conserved regions by performing multiple sequence alignment (MSA) using CLC Sequence Viewer software (CLC Bio, Swansea, UK). Table
We collected the information about restriction enzymes and restriction sites from the database of restriction enzymes (REBASE), Roberts 1980 and Roberts et al., 2010 [
Currently, genome sequencing is the technology-of-choice for several research and clinical applications due to its rapid development, remarkable speed, continuously improved accuracy, and affordable sample processing cost. However, in several developing countries, the genome sequencing technologies are still out of reach for most of researchers and scientists due to several reasons which constrain employing such indispensable technology. Firstly, the high cost of establishing sequencing facility and high cost maintaining the facility in poor-resources countries. Secondly, the lack of well-trained personnel to run the facility. Thirdly, the weak power, Internet, and computational infrastructures. Finally, the limited access to the updated scientific data, literature, and training [
The scientific community expected this problem over a decade ago with the rising of the next-generation sequencing technologies [
The identification of the family of certain bacteria is usually based on the morphological and other characteristics of the colony, while the identification of the species and strains requires molecular and more sophisticated methods [
We downloaded the 33 complete
In order to select standard universal primer(s) for 16S rRNA sequences from all
In some cases, these two primers are not present in 16S rRNA separated sequences. For instance, the two primers failed with the separated 16S rRNA genes of the strain
In some cases, there was a difference in length between the 16S rRNA returned
After selecting the 8F and 1541R primers as universal primers for 16S rRNA, we used them to annotate the 16S rRNA gene in the
For HSP60 gene, we could not find a universal primer in the published literature. Therefore, we design a universal primer based on the conserved nucleotide sequences of HSP60. The conserved nucleotide sequences were identified be multiple sequence alignment (MSA) using CLC Sequence Viewer software (CLC Bio, Swansea, UK). Based on the alignment results, we were able to design two degenerate primers for HSP60 (HSP60-F and HSP60-R, Table
In order to perform an
The
The exclusion of the short fragments was observed in several species and strains from those we used in this study. For instance,
Other sources of differences in ribotyping between the
To determine the number of returned DNA fragment from a particular species/strain that contains several copies of 16S rRNA sequences, we compare the lengths of the fragments and exclude the duplicated equal fragments length. This is how the restriction will be done actually in the lab, as the fragments with the same length will be in the same band in the gel. For instance, for
For HSP60 gene, the construction of the restriction map was straightforward. Each
This section describes our proposed sequencing-free bacterial identification method in detail. The proposed method identifies bacterial species/strains based on the number of fragments and/or fragment lengths that result from the restriction of certain genes using a given set of restriction enzymes. Therefore, we refer to it as the fragment number-identification method or FN-Identify. The main goal of FN-Identify is to establish an identification scheme for bacterial species utilizing fragments patterns of enzymatic restrictions such as the restriction map we built in the above section. The established scheme specifies the set of enzymes that could be employed to identify a given (unknown) gene sequence as well as the order of their application. The identified gene refers to a particular species/strain within the restriction map.
The idea behind FN-Identify is inspired from two basic observations. First, the number of fragments resulting from each restriction of a DNA sequence (e.g., 16S rRNA gene sequence) would differ based on the employed restriction enzyme. Generally speaking, a given gene sequence
For the purpose of illustration, consider an extreme example where all sequences of
The
Predict the restriction map Search number-of-fragments when applied to all gene-sequences of Group all strains/species having the same number-of-fragments in a distinct group on the results produced by createScheme(
An example of a tree
Predict the restriction map
Search
Use results obtained from the application of
Apply Step
The algorithm stops if either (1) the number of species/strains of all groups being processed is one or (2) no further application of any restriction enzyme can discriminate species/strains in groups containing more than one species/strains. The former case indicates that the algorithm can identify all species/strains of
Once an identification scheme
currentNode Cut find a child
as currentNode.chld; currentNode
The
In order to develop our proposed method and algorithms, we used the 16S rRNA sequences of a population of 33 members of
Identification scheme of
Identification scheme of
To further improve the identification efficiency of FN-Identify method and algorithms, we used the HSP60 genes as an example for genes with a single copy in the genome (Table
Summary of the employed training and testing datasets and FN-Identify performance.
Bacterial group | Gram1 | Members | 16S rRNA | HSP60 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Unique sequences2 | Required enzymes | Max.-Min. Enzymes/species3 | Unique sequences2 | Required enzymes | Max.-Min. Enzymes/species3 | |||||||
1 factor | 2 factors | 1 factor | 2 factors | 1 factor | 2 factors | 1 factor | 2 factors | |||||
Training set | ||||||||||||
|
P. | 33 | 24 | 6 | 5 | 6-6 | 5-3 | 23 | 6 | 5 | 4-1 | 3-1 |
Testing sets | ||||||||||||
|
N. | 33 | 32 | 8 | 6 | 8-7 | 7-4 | — | — | — | — | — |
|
P. | 22 | 18 | 7 | 4 | 7-5 | 4-3 | — | — | — | — | — |
1P: positive and N: negative.
2Members with differences in 16S rRNA sequences. In some cases two or more members have 100% similarity in 16S rRNA sequences. Those members are considered as one entry to FN-Identify.
3The maximum and minimum number of enzymes required identifying a given member of the group.
FN-Identify method and the two algorithms were developed using a training set of 33 members of
We obtained the sequences of the 16S rRNA genes of 22 members of
Collectively, these results demonstrate the efficiency and utility of the FN-Identify method and the two developed algorithms in identifying bacterial species/strains within a genus and show that the method is applicable in bacterial groups with distinct properties.
The assessment of FN-Identify method and the two developed algorithms shows the potentials of the method, with standard microbiology protocols and instruments. FN-Identify is a computational method that is designed as an aid that helps designing and minimizing the experimental procedures required for bacterial identification. Ideally, FN-Identify interfaces with the experimental and clinical workflows through receiving inputs (expected bacterial group, gene(s) to be used for identification, and list of restriction enzymes) and provides outputs that lead the later bench exterminates (list and order of enzymatic restriction experiments and the identification scheme that is used to interpret the experimental results).
To be fully utilized, FN-Identify needs a software tool that is connected with a database of gene sequence (e.g., 16S rRNA and HSP60) in different bacterial families and database of restriction enzymes. The software should implement the two algorithms and automate the selection of the species and the enzymes as well as automating building the restriction map and the identifying scheme. We are currently building this tool as a webserver that provides these services for free to enable the scientific community in the developing countries to utilize FN-Identify.
Bacterial identification is an important routine that is required in several microbiological and environmental applications and research. The current techniques are highly dependent on genome sequencing techniques that target certain genes that present almost in all bacterial species. Although the genome sequencing techniques observed outstanding improvements in accuracy and decrease in cost, developing countries remain far from employing these indispensable technologies due to several barriers. Therefore, alternative sequencing-independent methods are required to facilitate the needed tasks with affordable costs and using the available facilities. We developed FN-Identify method, a sequencing-independent method for bacterial identification, using standard microbiological protocols and instruments, restriction enzymes, and two algorithms that we developed (CreateScheme and GeneIdentify). FN-Identify was tested against standard bacterial populations of 22 and 33 bacterial species/strains of the
The authors declare that there is no conflict of interests regarding the publication of this paper.