Gene conversion is an important biological process that involves the transfer of genetic (sequence) information from one gene to another. This can have a variety of effects on an organism, both short-term and long-term and both positive and detrimental. In an effort to better understand this process, we searched through over 3,000 abstracts that contain research on gene conversions, tagging the important data and performing an analysis on what we extract. Through this we established trends that give a better insight into gene conversion research and genetic research in general. Our results show the importance of the process and the importance of continuing gene conversion research.
At its most basic level, gene conversion is the exchange of information between two genes [
Interesting also is when only a part of the acceptor sequence is replaced. This typically leads to a change in functionality in the acceptor gene and is an important method to achieve genetic diversity. We see this prominently in the human major histocompatiblity complex genes (the
As demonstrated by our description and examples, gene conversion is an important process for all organisms. Further understanding of gene conversions will lead to advances such as a better understanding of evolution and a better understanding of (and hopefully prevention of) genetic diseases and disorders. A large amount of research has been devoted to gene conversions, which has given much insight into the process and its effects, both short-term and long-term. However, identification of gene conversions is a difficult process as it typically requires comprehensive phylogenetic analyses across multiple species. While programs do exist that are meant to identify gene conversions (GENECONV [
In order to help further understand gene conversions, we have searched PubMed for research on this topic and extracted and analyzed what was contained in the abstracts. Through this we have identified important trends in gene conversions that can help with continued research. Furthermore, we have listed these papers online to allow the reader to access this research with ease.
We downloaded abstracts from PubMed [
At their most basic level, the majority of papers can be broken down into these two categories. If the research is focused on identifying the biological mechanisms of a gene conversion (such as which proteins are involved in the process or under what conditions an organism shows an increase/decrease in gene conversion), then it is labeled as a “mechanism” paper. If the research is more focused on identifying a gene conversion event (such as an ancient gene conversion between two genes or two alleles), then it is labeled as an “event.”
A key characteristic in gene conversions is between what two genes the gene conversion occurs, that is, did the gene conversion occur between two distinct loci (“intergenic”) or did it occur between two alleles of the same gene (“interallelic”)?
While gene conversions can often be detrimental (as we will illustrate when we look at the listing of genetic diseases and disorders associated with gene conversions), there are evolutionary reasons for gene conversions existing. On the one hand, gene conversions can lead to genetic diversity by creating new alleles through combining pieces of other alleles. On the other hand, gene conversion can also lead to gene conservation by having two genes maintain similarity through sequence exchange.
In cases where gene conversion occurs between two distinct genes, it is of interest of what type these genes are. Are they both functional, protein-coding genes (“gene-to-gene”) or is one of them a nonfunctional pseudogene (“gene-to-pseudogene”)? Gene conversions involving pseudogenes can have a variety of effects, both good and bad.
Essentially, all gene conversions are between two genes. But if we model the evolution of a group of genes over a longer time period, interesting trends can be established within this group in terms of gene conversion. Sometimes, it is still just isolated to two genes (“1-to-1”). Other times multiple genes are involved in gene conversion events. These can be broken down into situations where one gene is used as a donor to multiple other genes (“1-to-Many”) and when gene conversions have occurred amongst a group of genes with no one donor being determined (“Many-to-Many”).
Important for this analysis and for future research is a listing of which genes were involved in the gene conversion. When this information is available (this data is not always within the abstract), it is extracted in whatever detail is provided. Sometimes this gives exact gene names (
Gene conversion does not always encompass the entire gene. In many cases, only a specific region within the gene (for instance the second exon) is involved in the conversion. While this data does not often appear in an abstract, we extracted it when it was available. Specifically we looked for gene conversions that occur within exons, introns,
Through our search for abstracts regarding gene conversions, we also encountered those that present means of identifying gene conversions (“algorithms”) and those that discuss how and/or why gene conversions occur (“models”) with this being more of the research’s emphasis than the identification of gene conversions. This information was extracted as well.
An important part of gene conversion research is identifying the genetic diseases and disorders that a gene conversion can cause. Due to diseases having different names or abbreviations, we created a unique tag for each disease. So “Congenital Adrenal Hyperplasia” and “CAH” are identified as the same disease. In total 58 unique tags were created, giving us 58 different diseases.
The final important data element to extract is in which species this gene conversion occurs. Much like with the diseases, a unique tag was created for each species. So “
All abstracts were read through and tagged using the traditional tagging style used in standards such as XML and HTML. So there was an opening tag (
Taxonomy for all species was determined through the taxonomy database of NCBI [
In total 3575 abstracts were downloaded from PubMed that contained the term “gene conversion.” However, only 2478 were tagged as having information on gene conversions. This is due to a variety of reasons. Some files contained only titles with no abstracts, due to some papers not possessing abstracts or these abstracts not being present in PubMed. Other papers may have dealt with another “conversion” process that also happened to involve genes. This is because the search algorithm uses BOOLEAN “and,” therefore, as long as the abstracts contain both “gene” and “conversion,” the abstracts will be included. And finally, some papers mentioned gene conversions but this was not the focus of the research being presented. For instance, the
In total, we identified 308 unique species. The actual number of species is likely to be higher. Because exact species names are not always given in the abstract, for the cases where no exact species names are given, we counted the species as one as long as the common species names are the same. For instance, many abstracts say they used mice in their research; so the assumption we are going with is that they used the species
Species Breakdown.
Species | Total Count | Mechanism | Event | N/A |
---|---|---|---|---|
Humans/ | 642 | 115 | 522 | 6 |
Yeast/ | 490 | 463 | 24 | 3 |
Mouse/ | 188 | 89 | 97 | 2 |
Chicken/ | 89 | 54 | 35 | 0 |
Fruit Fly/ | 88 | 45 | 41 | 2 |
E. Coli/ | 46 | 37 | 7 | 2 |
Rabbit/ | 46 | 22 | 23 | 1 |
Rat/ | 34 | 4 | 30 | 0 |
34 | 21 | 13 | 0 | |
Fission yeast/ | 31 | 26 | 4 | 1 |
Chimp/ | 20 | 2 | 17 | 1 |
Cow/ | 18 | 4 | 14 | 0 |
18 | 4 | 14 | 0 | |
Chinese Hamster/ | 16 | 16 | 0 | 0 |
Gonococci/ | 16 | 11 | 4 | 1 |
15 | 8 | 7 | 0 | |
Maize/ | 15 | 6 | 9 | 0 |
13 | 6 | 7 | 0 | |
Salmonella/ | 11 | 6 | 5 | 0 |
Asexual Yeast/ | 11 | 7 | 4 | 0 |
Silk Moth/ | 10 | 0 | 10 | 0 |
Tobacco/ | 10 | 7 | 3 | 0 |
9 | 7 | 2 | 0 |
In this table, we have sorted the found species based on the number of abstracts they were found in (listed here as Total Count). In addition we list the amount of abstracts that dealt with a mechanism of gene conversion, a specific gene conversion event, or whether we were unable to determine if it was either based on the information given (N/A).
In addition to the general count, we have listed the count of in what capacity the gene conversion research was done. Was a
Breaking down the species into superkingdoms we found that gene conversion research has been mostly done in eukaryotes (Table
Superkingdom and Eukaryote Kingdom Breakdown.
Superkingdom | Species count | Paper count | Mechanism | Event | N/A |
---|---|---|---|---|---|
Eukaryotes | 247 | 2030 | 990 | 1025 | 15 |
Bacteria | 40 | 137 | 98 | 36 | 3 |
Viruses | 15 | 30 | 22 | 8 | 0 |
Archaea | 1 | 1 | 0 | 1 | 0 |
Kingdom | Species count | Paper count | Mechanism | Event | N/A |
Metazoa | 154 | 1268 | 372 | 887 | 9 |
Fungi | 24 | 589 | 541 | 42 | 6 |
Viridiplantae | 44 | 93 | 34 | 59 | 0 |
In this table, we list how many abstracts fall into superkingdom and kingdom categories (for the kingdom breakdown, we focus on eukaryotes). We list the total count of abstracts as well as a breakdown of how many of these abstracts dealt with gene conversion mechanisms, specific gene conversion events, or whether there was not enough information to determine the type of gene conversion research (N/A).
Since eukaryotes are overwhelmingly favored, we further brokedown this superkingdom into three kingdoms. 24 species fall into the fungi category, 154 into the metazoa category, and 44 fall into the viridiplantae category. 589 papers are in fungi, 1268 in metazoa, and 93 in viridiplantae.
Table
We further brokedown the species into classes and orders. In total the 308 species encompass 44 different classes and 94 different orders. The top five classes include Mammalia, Saccharomycetes, Insecta, Aves, and Gammaproteobacteria. Mammalia and Insecta have more event abstracts (704 versus 251 and 91 versus 59, resp.) while Saccharomycetes and Gammaproteobacteria have more mechanism abstracts (476 versus 29 and 48 versus 16, resp.). Aves had a similar amount for both (55 mechanism versus 46 event).
The top five orders include Primates, Saccharomycetales, Rodentia, Diptera, and Galliformes. Primates have a clear bias towards event abstracts (543 versus 117). Saccharomycetales and Galliformes have more mechanism abstracts (476 versus 29 and 54 versus 38, resp.). Both Rodentia and Diptera seem to have a balance between mechanism and event abstracts (110 mechanism versus 125 event and 55 mechanism versus 65 event).
Figure
Trend Analysis Graphs. These graphs show the chronological trend of the tagged abstracts. For all graphs, the
Total Abstracts
Species Comparison
To further analyze the chronological trends, we looked at the top three studied species: human (
An observable trend that reveals itself in every graph is a large increase in the amount of gene conversion research in the early 1980s. Between the years 1981 and 1986 there is a sharp spike in the total number of gene conversion abstracts. This spike seems to include both mechanism and event research as well as researches in the three species.
In an almost contradictory fashion, gene conversions are important for the creation of genetic diversity and the conservation of genetic sequences. Table
Species Breakdown of Diversity and Conservation.
Species | ia | ia/div | ia/con | ig | ig/div | ig/con |
---|---|---|---|---|---|---|
Humans/ | 152 | 142 | 1 | 149 | 22 | 69 |
Yeast/ | 17 | 1 | 1 | 5 | 0 | 4 |
Mouse/ | 7 | 6 | 1 | 44 | 3 | 29 |
Chicken/ | 1 | 1 | 0 | 19 | 4 | 12 |
Fruit Fly/ | 6 | 4 | 0 | 16 | 2 | 11 |
E. Coli/ | 1 | 1 | 0 | 5 | 2 | 1 |
Rabbit/ | 1 | 1 | 0 | 7 | 2 | 4 |
Rat/ | 0 | 0 | 0 | 19 | 1 | 15 |
1 | 0 | 0 | 3 | 0 | 3 | |
Fission yeast/ | 2 | 0 | 0 | 0 | 0 | 0 |
Chimp/ | 3 | 2 | 1 | 11 | 2 | 5 |
Cow/ | 1 | 1 | 0 | 9 | 2 | 6 |
1 | 0 | 0 | 0 | 0 | 0 | |
Chinese Hamster/ | 0 | 0 | 0 | 0 | 0 | 0 |
Gonococci/ | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 5 | 0 | 5 | |
Maize/ | 0 | 0 | 0 | 5 | 1 | 4 |
0 | 0 | 0 | 4 | 3 | 0 | |
Salmonella/ | 0 | 0 | 0 | 4 | 0 | 3 |
Asexual Yeast/ | 1 | 1 | 0 | 0 | 0 | 0 |
Silk Moth/ | 0 | 0 | 0 | 7 | 0 | 7 |
Tobacco/ | 0 | 0 | 0 | 1 | 0 | 1 |
0 | 0 | 0 | 0 | 0 | 0 |
In this table we list the species with the highest abstract counts and detail the type of gene conversion events they have undergone. Our focus here is on whether the conversion was between two distinct genes or two alleles from the same gene and whether the gene conversion event led to genetic diversity or gene conservation. Ia refers to an interallelic event and ia/div and ia/con refer to interallelic events that lead to genetic diversity and gene conservation, respectively. Ig refers to an intergenic event and ig/div and ig/con refer to intergenic events that lead to genetic diversity and gene conservation, respectively.
In total, 1412 abstracts had genes listed that were involved in gene conversion. In addition, 138 abstracts had regions that underwent gene conversion. In this section, we detail the genes that have been extensively studied for gene conversion.
The RAD family of genes (among them
Additional species that had RAD genes research done on them are mouse,
The group CYP genes (or genes that are responsible for creating cytochrome P450) contains the gene
In humans it is typically linked to diseases, with over 80 abstracts containing study of this gene in
The major histocompatibility complex genes (MHC genes) play important roles in the immune system [
The immune system must be able to defend an organism from a diverse amount of pathogens. Therefore a large amount of diversity is needed in the genes involved in the immune system and gene conversion is an important mechanism through which high diversity is achieved [
In Tables
Gene Conversion Diseases/Disorders Part 1.
Disease/Disorder | Gene 1 | Gene 2 | Pseudogene? | Papers |
---|---|---|---|---|
Phenylketunoria (PKU) | N/A | N/A | N/A | 1 |
Huntington’s Disease | N/A | N/A | N/A | 3 |
Thalassemia | IVS-2 | N/A | N/A | 1 |
APRT Deficiency | APRT | N/A | N/A | 1 |
Congenital Adrenal Hyperplasia/Hydroxylase Deficiency | CYP21 | CYP21P | Yes (CYP21P) | 77 |
Hereditary Persistence of Fetal Hemoglobin | protein S alpha | protein S beta | No | 3 |
Debrisoquine polymorphism | CYPD6 | CYPD7 (CYPD6*2) | No | 3 |
Sickle Cell Anemia | A Gamma | G Gamma | No | 4 |
Gaucher’s Disease | GBA | psGBA | Yes (psGBA) | 6 |
Thrombocytopenia | HLA Class II | N/A | N/A | 1 |
Haemoglobin H Disease | N/A | N/A | N/A | 1 |
Rheumatalogic Disease | HLA complex | N/A | N/A | 1 |
Beta Thallasemia | Beta-Globin Locus | N/A | N/A | 3 |
Blue Cone Monochromacy | RCP | GCP | No | 2 |
K36.16 thymoma | N/A | N/A | N/A | 1 |
Rheumatoid Arthritis | DR4 | N/A | N/A | 1 |
Spinal Muscular Atrophy | SMN | SMNtel | No | 16 |
Hypertension | CYP11B2 | CYP11B1 | No | 3 |
Chronic Myeloid Leukaemia (CML) | ABL | N/A | N/A | 1 |
Fragile X Syndrome | FMR1 | FMRa/FRAXAC2 | No | 5 |
Homocysturnia | CBS | N/A | N/A | 1 |
Von Willebrand Disease | VWF | N/A | Yes | 5 |
Myotonic Dystrophy | N/A | N/A | N/A | 3 |
Myeloma | GAU Hyprid Alpha | N/A | N/A | 1 |
Human Complement C4A Deficiency | C4A | C4B | No | 1 |
Neurofibriomatosis Type 1 (NF1) | NF1 | NF1 pseudogene | Yes | 3 |
Colorectal Cancer | APRT | N/A | N/A | 2 |
Carbonic anhydrase II deficiency | CA II | N/A | N/A | 1 |
Fanconic Anemia | FAC | N/A | N/A | 2 |
Mucopolysaccharidosis type I Hurler/Scheie | alpha-L-iduronidase | N/A | N/A | 1 |
In this table we list the diseases and disorders associated with gene conversions. In addition, we list the genes involved where applicable (listed here as Gene 1 and Gene 2) as well as whether one was a pseudogene (and listing which is if this information was available). Finally, we list the number of abstracts that dealt with the disease/disorder.
Gene Conversion Diseases/Disorders Part 2.
Disease/Disorder | Gene 1 | Gene 2 | Pseudogene? | Papers |
---|---|---|---|---|
Hereditary Neuropathy with liability to Pressure Palsies | N/A | N/A | N/A | 2 |
Charcot-Marie-Tooth disease type 1A | N/A | N/A | N/A | 3 |
Polycistic Kidney Disease | PKD1 | N/A | Yes | 3 |
Autosomal dominant facioscapulohumeral muscular dystrophy | N/A | N/A | N/A | 2 |
Breast Cancer | BRCA1 | BRCA2 | No | 1 |
Hereditary Pancreatitis | PRSS1 | R122H | No | 5 |
non-Hodgkin’s Lymphoma | D6S347 | N/A | N/A | 1 |
Spinocerebellar ataxia type 8 | N/A | N/A | N/A | 2 |
Neural Tube Defects | N/A | N/A | Yes | 1 |
Friedreich’s ataxia | N/A | N/A | N/A | 1 |
Pseudoxanthoma elasticum | ABCC6 | psiABCC6 | Yes (psiABCC6) | 1 |
Incontinentia pigmenti | NEMO/LAGE2 | N/A | N/A | 1 |
Schwachman Diamond Syndrome | SBDS | SBDSP | Yes (SBDSP) | 6 |
Hypergonadotrophic Hypogonadism | FSHR | N/A | N/A | 1 |
Smith-Magenis Syndrome | N/A | N/A | N/A | 1 |
Human Male Infertility | DAZ genes | N/A | N/A | 2 |
Hemophilia A | F8 | N/A | N/A | 1 |
Chronic Pancreatitis | PRSS1 | PRSS2 | No | 1 |
Campomelic dysplasia | SOX9 | N/A | N/A | 1 |
Machado-Joseph Disease | MJD/SCA3 | N/A | N/A | 2 |
Sodium-sensitive cardiac hypertrophy | CYPB112 | N/A | N/A | 1 |
Obesity | HTR2C | N/A | N/A | 1 |
Velo-cardio-facial syndrome/DiGeorge syndrome | LCR22-2 | LCR22-4 | No | 1 |
Hereditary Nonpolyposis Colorectal Cancer | MLH1 | MSH2 | No | 1 |
Atypical Hemolytic Uremic Syndrome | CFH | CFH1 | No | 1 |
Pyridoxine-responsive Homocystinuria | CBS | N/A | N/A | 1 |
Autosomal Dominant Cataract | CRYBB2 | CRYBB2P1 | Yes (CRYBB2P1) | 1 |
In this table we list the diseases and disorders associated with gene conversions. In addition, we list the genes involved where applicable (listed here as Gene 1 and Gene 2) as well as whether one was a pseudogene (and listing which is if this information was available). Finally, we list the number of abstracts that dealt with the disease/disorder.
It has been shown that a genetic disease or disorder can involve a gene conversion with a pseudogene [
So if part (or all) of a pseudogene’s sequence information is transferred to the functional gene, the functional gene may lose function, which might be detrimental to the organism. As can be seen in the disease tables, this is a common cause of genetic diseases/disorders.
However, it is also possible that a gene conversion between two functional genes can lead to a genetic disease/disorder. A slight change in DNA sequence is sometimes all a gene needs to alter its functionality and while the donor gene is generally highly similar to the acceptor gene, there is enough of a difference to be detrimental to the entire organism.
The most studied genetic disease associated with gene conversions is congenital adrenal hyperplasia (OMIM ID: 201910). In our abstract literature analysis, we encountered this disease in 77 papers, although in some papers it was also listed as 21-hydroxylase deficiency. Congenital adrenal hyperplasia can take on many forms but is most often associated with altered production of sex steroids and altered development of sex organs. While it can have many genetic causes, it has been shown to be caused in many situations by a gene conversion between the
Much research has been devoted to gene conversions and bacteria [
The following bacteria were identified as using gene conversions to achieve adaptation:
Eukaryotic microorganisms have also been shown to use gene conversion for adaptation, for example,
In Table
Further analyses.
Tagged data | Paper counts |
---|---|
Region | |
Exon | 12 |
Intron | 7 |
| 1 |
| 0 |
Amount | |
1-to-1 | 187 |
1-to-Many | 3 |
Many-to-Many | 36 |
Type | |
Algorithm | 4 |
Model | 4 |
In this table we list additional data that was gathered in this project and the number of papers in each category. Region refers to the region of the gene on which the gene conversion occurred. Amount refers to the number of genes involved in gene conversion. Type refers to whether the papers dealt with algorithms or models.
As expected, the majority of gene conversions found (in which we were able to clearly establish the amounts) were 1-to-1. This is unsurprising as gene conversion is by definition a process that involves only two genes. However we did find papers that established larger trends of gene conversion, most often involving gene conversion events that lead to conservation. In few occasions (3) one gene is used as the primary donor to two or more genes. More often a group of genes (in 36 abstracts) maintains sequence similarity by engaging in multiple gene conversions over a long period of time.
In total, we found 2478 abstracts in PubMed that detail research on the causes and outcomes of gene conversions. As can be seen by looking at the chronological trends in Figure
Gene conversion research can be found in all three superkingdoms (Eukaryota, Bacteria, and Archaea) as well as viruses. The fact that it encompasses all types of life shows how universal the process is. Unfortunately, few research exists on Archaea. Only one abstract is on gene conversion in
The fact that 308 different species were found is also significant. Despite the fact that close to 2/3 of the species appear in only one abstract, we can safely conclude that gene conversion is indeed a wide-spread process. The type of research is also relatively wide-spread as well, 127 species have had gene conversion mechanism research done with them, and 234 species have had gene conversion events identified.
Interesting also is how many species have gene conversions associated with the same genes. 7 species were shown to use RAD genes as part of the gene conversion process (although it is very likely that this number is higher). The CYP genes undergo deleterious gene conversions in humans and 9 other species (although these do not cause any known diseases). Meanwhile, the MHC genes are associated with gene conversions in 22 species and immunoglobulin genes have 19 species that exhibit gene conversions with them. It is therefore very likely that more of these patterns of overlap will be found as gene names become more standardized and listings of orthology increase.
The effect of gene conversion on the evolution of genes can be both long-term and short-term. There are two main categories for the long-term evolutionary effects of gene conversion: genetic diversity and gene conservation. As can be seen by the data in Table
What is interesting is the exceptions to this rule and more research in this regard could have interesting results. Table
In total, more intergenic gene conversions (418, encompassing 150 species) were identified than interallelic (208, encompassing 106 species). This is most likely due to the fact that the identification of intergenic gene conversions is relatively easier than that of interallelic gene conversions. Due to the high similarity between alleles, it is often difficult to determine which two alleles combined to make a new one.
The immediate short-term effect that is evident when looking at our results is the amount of diseases and disorders associated with gene conversion research. 58 diseases is a large number and is in fact a larger number than what was found in a recent review of human gene conversions [
Another short-term effect is the use of gene conversions to achieve genetic diversity. Here we have two competing sides: bacteria and other pathogenic microorganisms that use it to adapt and immune system genes that use it to create diverse B cells to combat those pathogens. While mutation would work as well (and immunoglobulin genes do use hypermutation), this type of “templated mutation” ensures a more structured alteration of the genetic sequence. It is not random like regular mutation, thus ensuring better, quicker results.
Out of the 2478 abstracts containing gene conversion research, more than 25% deal with gene conversions in humans. If we count yeast (
Humans (
The other species listed here are model organisms. Of particular interest are the yeast species. Three types of yeast are listed in Table
Naturally, there are problems that stem from this bias. If one just looks at the raw counts in Table
Another problem stems from the detrimental effects of gene conversions. Due to the sheer number of examples, it is clear that gene conversions can lead to diseases and disorders in humans. However, we found no diseases/disorders in other species. Does this mean that gene conversion does not cause these detrimental effects in other species? Also, with one exception, those bacteria and microorganisms that have their adaptation linked with gene conversions can be pathogenic. Is this process more prevalent in pathogenic organisms or is it just that pathogenic organisms are more studied due to the fact that they cause diseases? Clearly more gene conversion research is needed to address these issues and give us a better idea of the whole process.
While we were able to extract a significant amount of data from the abstracts we collected, there were some shortcomings to this process. Oftentimes we were not able to identify the exact species (i.e., “Primates” or “Plants” were listed) and other times exact gene names were difficult to find (if they were there at all). And as can be seen in Table
Genetic research is increasing exponentially and our own trend analyses of gene conversion research (Figure
As we tagged various terminologies related to gene conversion in all these abstracts, we have generated a large amount of data through this manual tagging process. We can use the data for both training and testing of our machine learning algorithms for predicting things such as the genes that are involved in gene conversion, the consequence of gene conversion at both gene sequence level and phenotypic level, that is, diseases or pathogenicity, using an abstract. Similarly, we can expand our work to mining the entire research paper. This way, we can create a database of gene conversion data including species, genes, and diseases/disorders.
Furthermore, we hope to use this data to facilitate in the identification of gene conversions. With the increasing amount of sequenced genomes, it would be ideal if we could use a software solution to automatically predict gene conversions. Our own research has used an ensemble of existing gene conversion identification programs in addition to rare-class learning techniques to identify gene conversions and the results have been promising [