In Silico Expressed Sequence Tag Analysis in Identification of Probable Diabetic Genes as Virtual Therapeutic Targets

The expressed sequence tags (ESTs) are major entities for gene discovery, molecular transcripts, and single nucleotide polymorphism (SNPs) analysis as well as functional annotation of putative gene products. In our quest for identification of novel diabetic genes as virtual targets for type II diabetes, we searched various publicly available databases and found 7 reported genes. The in silico EST analysis of these reported genes produced 6 consensus contigs which illustrated some good matches to a number of chromosomes of the human genome. Again the conceptual translation of these contigs produced 3 protein sequences. The functional and structural annotations of these proteins revealed some important features which may lead to the discovery of novel therapeutic targets for the treatment of diabetes.


Introduction
To understand the behavior and functionality of various biological processes, it is important to get a clear cut idea of genes and gene products involved, evident by the regulatory interactions of DNA, RNA, and proteins. Rapid advancement in technologies like microarray, sequencing, and spectrometry has contributed vast data for analysis and prediction in the light of genomics and proteomics.
Expressed sequence tags (ESTs) are short stretch of nucleotide sequences (200-800 bases) derived from the cDNA libraries. ese are capable of identi�cation of the full-length complimentary gene and mostly used for the identi�cation of an expressed gene. e EST generation process involves sequencing of single segments either 5 ′ end or 3 ′ end of random clones from cDNA library of an organism. A single sequencing reaction and automation of DNA isolation, sequencing, and analysis can generate many ESTs at a time.
Since their original description and involvement as primary resources in human gene discovery [1], ESTs grow exponentially in various public databases, which will continue till there is suitable funding for the sequencing projects. Although the original ESTs were of human origin, a large number of ESTs are also isolated from model organisms like Caenorhabditis elegans, Drosophila, rice, and Arabidopsis. Public databases like dbEST [2], TIGR Gene Indices [3], and UniGene [4][5][6] now contain ESTs from a number of organisms for research and analysis. In addition, several commercial establishments maintain some privately funded, in-house collections of ESTs which are available for research. At present, ESTs are widely used throughout the genomics, and molecular biology communities for gene discovery, complement genome annotation, mapping, polymorphism analysis, gene prediction, gene structure identi�cation, and expression studies establish the viability of alternative transcripts and facilitate proteome analysis. Diabetes is a metabolic disorder characterized by hyperglycemia, glucosuria, negative nitrogen balance, and sometimes ketonemia. e clinical symptoms associated with it are retinopathy, neuropathy, and peripheral vascular insuf-�ciencies. Overweight populations with sedentary lifestyle are more prone to diabetes. A recent study reveals that it affects 150 million people and almost 300 million more will be diabetic by the year 2025 [7]. Out of the three major types of diabetes, the non-insulin-dependent (type II diabetes or NIDDM) accounts for 90-95% of the diagnosed cases of the disease. ere is no single approach to treat this disease and usually a combination therapy is adopted from different approaches. e worldwide epidemic of type II diabetes led the development of new strategies for its treatment. e discovery of nuclear receptor peroxisome proliferator activated receptors (PPARs) heralded a new era in understanding the patho-physiology of insulin receptors and its related complications [8]. PPARs are known to be the receptor for the �brate class of hypolipidemic agents, while PPAR agonists reduce hyperglycemia without increasing the amount of insulin secretion. Again few other validated targets are protein tyrosine phosphatase-1B (PTP1B) and glycogen synthase kinase-3 (GSK-3). PTP-1B is a cytosolic phosphatase with a single catalytic domain [9]. In vitro, it is a nonspeci�c PTP and phosphorylates a wide variety of substrates. In vivo, it is involved in down regulation of insulin signaling by dephosphorylation of speci�c phosphotyrosine residues on the insulin receptor. GSK-3 is a type of protein kinase, which mediates the phosphorylation of certain serine and threonine residues in particular cellular substrates. is phosphorylation mainly inhibits the target proteins as in the case of glycogenesis it inhibits glycogen synthase [10][11][12]. While a lot of research is focused on validated targets like PTP1B, PPARs, and GSKs, this paper intends identi�cation of novel diabetic genes as virtual target(s). e approach is purely in silico and by analysis of ESTs available in public databases.

Materials.
Databases like dbEST, TIGR Gene Indices, and UniGene are most useful resources containing raw and clusters of ESTs for many organisms. e dbEST is the largest repository of EST data maintained by NCBI. e TIGR Gene Indices of DFCI alphabetically list the ESTs of many organisms. NCBI's UniGene contains gene-oriented clusters of transcript sequences obtained by alignments between transcript sequences and genomic sequences originating from the same gene. e current information content of these three databases is represented in Table 1.
To initiate an in silico analysis, the UniGene database was searched for human diabetes gene clusters that reported seven gene entries whose mRNA and ESTs information are listed in Table 2.
e ESTs of all seven gene entries were downloaded and only those originating from pancreas and liver tissue were taken for analysis. Only the 5 ′ ESTs were considered as the ESTs generated from the 3 ′ end are most error prone because of the low base-call quality at the start of sequence reads. ere were no ESTs of pancreatic or hepatic tissue origin for the gene entry "Aquaporin 2 (AQP2). " us we found a total of 34 ESTs from six reported gene entries as listed in Table 3.

EST Pre-
Processing. e EST sequences are oen of low quality because they are automatically generated without veri�cation and thus contain higher error rates. e ESTs are also contaminated by vector sequences during their synthesis because a part of the vector is also sequenced along with the EST sequences. ese sequences should be removed from EST to reduce the overall redundancy and to improve efficacy in further analysis. A comparison of ESTs with various nonredundant vector databases identi�es the contamination which is deleted prior to analysis, for example. e EMVEC [13,14] database removes the vector contamination from the EST sequences using NCBI BLAST2 [15,16]. Using the UniGene clusters in our analysis is obvious as each cluster is generated by combined information from dbEST, GenBank mRNA database, and electronically spliced genomic DNA. Further they are clustered and cleaned from contamination (either by bacterial vector sequences or by linker sequences).

EST Clustering and Assembly. e purpose behind
EST clustering is to collect overlapping ESTs from the same transcript of a single gene into a unique cluster to reduce redundancy. is is important because all the expressed data coming from a single gene are grouped into an index class which represents information of that particular gene. e clustering or assembly is mainly done by pairwise sequence similarity search between sequences and it consists of three major phases. In the �rst phase, poor regions of both 5 ′ and 3 ′ reads are identi�ed and removed. en the overlapping regions between the sequences are calculated and the false overlaps are removed a�er their identi�cation. In the second phase, reads are joined to form contigs in decreasing order of overlap scores. en, both forward-reverse constraints are used to make corrections to the resulting contigs. In the third phase, a multiple sequence alignment of reads is constructed and a consensus sequence along with a quality value for each base is computed for each contig. Base quality values are used in computation of overlaps and construction of multiple sequence alignments. e tissue-based ESTs from six reported genes were subjected to cluster analysis by the CAP3 Server [17]. e subjected ESTs along with their gene names and resulting contigs are listed in Table 4.

Database Similarity
Searches. e consensus sequences or contigs (putative genes) obtained from clustering are only useful if their functionality are ascertained and it is only possible by database similarity search using some freely available tools like BLASTN and BLASTX. For transcriptome analysis, the ESTs are additionally aligned to the genome sequence of the organism using specialized programs like BLAT (BLAST like alignment tool) [18] to assist genome mapping and gene discovery. e 6 contigs generated from 4 genes (GCK, AVPR2, ICA1, and SOX13) were subjected to BLAT analysis with parameters reading (genome: human, assembly: Feb. 2009 (GRCh37/hg19), query type: translated DNA, sort output: Score, output type: hyperlink). e outputs are listed in Table 5.

Conceptual Translation of ESTs. e EST sequences
or data is informative only when its ontology, structure, and functions are obvious, for this the ESTs are correlated to protein-centric annotations by most accurate and robust polypeptide translations. e fact governing this process is that the polypeptides act as better templates for the identi-�cation of domains and motifs to study protein localization and assignment of gene ontology. e translations of ESTs are initiated by identifying the protein-coding regions or ORFs (open reading frames) from the consensus sequences or contigs. Here all 6 reported contigs were threaded to ESTScan2 [19,20] tool with parameters reading (format: plain text, species: human, insertion/deletion penalty: −50, output: protein). e graphical view of 6 reported proteins is shown in Figure 1 obtained by BioEdit [21]. From these proteins, only 3 long continuous transcripts (GCK liver, GCK pancreas, and ICA1 liver) were selected for further structural and functional annotations.

Functional
Annotation. e functionality of a putative polypeptide is predicted by matching against nonredundant databases of protein sequences, motifs, and family; this is because proteins act as better templates for functional annotation implementing multiple-sequence alignment, pro�le, HMM generation, phylogenetic analysis, domains, and motif

Results and Discussion
Current EST analysis includes several steps and a wide range of computational tools are available for each step featuring different strengths and generate vital information systematically. Again there exists some arguments and confusion in selecting the suitable tools for individual steps of EST analysis and subsequent annotations at DNA and protein level. In our EST analysis for identi�cation of novel diabetic genes as virtual targets for type II diabetes, we have followed a much cited procedure described by Nagaraj et al. [23]. From several successful and widely accessed EST databases, the UniGene database was selected as it uses mRNA and other coding sequence data of GenBank [24] as reference sequences for cluster generation. e UniGene clusters are updated weekly for progressive data management with the ever increasing EST data in GenBank. It stores all gene isoforms in a single cluster and does not generate consensus sequences. Aer search for the human diabetic gene in UniGene, the EST sequences of pancreatic and hepatic origin were selected due to their all-round association and greater functionality in the onset and continuation of diabetes. Only the 5 ′ ESTs of six genes were considered for analysis as the ESTs generated from the 3 ′ end are most error prone. Aer purposeful clustering of speci�c ESTs of a particular gene, we found out 1 contig each of hepatic and pancreatic origin for GCK, translation of these contigs in ESTScan2 provides six protein sequences from which we have considered only three (GCK liver, GCK pancreas, and ICA1 liver) as best for our analysis. e rest three sequences were le due to some erroneous readings (X, which does not code for any amino acid or refers to a stop codon) in their sequence. us the three proteins were GCK liver, a protein of 136 amino acids with molecular weight of 15474.87 Daltons; GCK pancreas, a protein of 313 amino acids with molecular weight of 34694.19 Daltons; and ICA1 pancreas, a protein of 270 amino acids with molecular weight of 31690.83 Daltons. ese three proteins were named as hypothetical protein 1, hypothetical protein 2, and hypothetical protein 3 for further annotation.

e Hypothetical Protein 1.
We have reported it from 5 ′ ESTs of liver tissues and it belongs to the hexokinase family of proteins with a distinct N-terminal and C-terminal. It is involved in the primary metabolic process like glycolysis and helps in the ATP-dependant conversion of aldohexose and ketohexose sugars to hexose-6-phosphate. e main function is carbohydrate kinase activity of various metabolic pathways like pentose phosphate pathway, fructose galactose metabolism, and glycolysis. It contains two structurally similar domains represented by PFAM families PF00349 [25] and PF03727 [26]. In structural classi�cation by CAT�, it belongs to the classi�cation lineage of hierarchy 3.30.420.40 featuring 3 (alpha beta), 3.30 (2-layer sandwich), and 3.30.420 (nucleotidyltransferase; domain 5).

e Hypothetical Protein 2.
We have reported it from 5 ′ ESTs of pancreas tissues and it also belongs to the hexokinase family of proteins with a distinct N-terminal and C-terminal. It is involved in the primary metabolic process like glycolysis and helps in the ATP-dependant conversion of aldohexose and ketohexose sugars to hexose- motif 6, contain amino acids that project into or near the ATP/sugar-binding pocket. Previously we have assumed that the glucokinase (GCK) gene is expressed and functions irrespective of the tissue types, but now it is obvious that there exit some structural differences although they function as same.

e Hypothetical Protein 3.
We have reported it from 5 ′ ESTs of liver tissues and it belongs to a family of proteins containing an arfaptin domain with a distinct N-terminal and C-terminal. e arfaptin domain interacts with ARF1 (ADP-ribosylation factor 1), a small GTPase involved in vesicle budding at the Golgi complex and immature secretory granules. e structure of arfaptin shows that, upon binding to a small GTPase, arfaptin forms an elongated, crescent-shaped dimer of three-helix coiled coils [27]. e N-terminal region of ICA69 is similar to arfaptin [28]. It is involved in a neurological system process and secretes several neurotransmitters. It is also involved in a cellular process like cell communication and  [36]. e statistical information gathered from the Ramachandran plots revealed that 95.10%, 93.60%, and 95.70% of the residues were in the allowed region for the proteins hypothetical protein 1, hypothetical protein 2, and hypothetical protein 3 at an average resolution of 2.68 Å. us the models were perfect for structural annotation. e hypothetical protein 1 had 4 -helices and 3 -sheets arranged in a 2-layered sandwich model (Figure 2(a)). e hypothetical protein 2 had 16helices and 2 -sheets arranged in a 3-layer sandwich model (Figure 2(b)). e hypothetical protein 3 had only 7 -helices acquiring an up-down bundle model (Figure 2(c)). erefore the structural annotations by both InterProScan and subsequent modeling were just similar which purposefully validate our work.
In general glucokinase occurs in human liver, pancreas, gut, and brain cells and plays an important role in suitable regulation of carbohydrate metabolism. It works as a glucose sensor and triggers shis in metabolism or cell function in response to the rising or falling levels of glucose. A mutation of the gene for this enzyme causes several forms of diabetes or hypoglycemia. Human islet cell autoantigen 1 protein is encoded by the ICA1 gene [37,38]. is protein contains an arfaptin domain and is found in both cytosolic and membrane-bound Golgi complex and immature secretory granules. It also works as an autoantigen in insulin-dependent diabetes mellitus. Our in silico analysis revealed three new proteins from which two (hypothetical protein 1 and hypothetical protein 2) were functionally similar to glucokinase and one (hypothetical protein 3) was functionally similar to the human islet cell autoantigen 1. Due to their association in diabetes, these can be treated as virtual therapeutic targets for treatment of diabetes. ere were structural variations among these three proteins and their functional homologues which need further structural analysis and interpretation.

Conclusion
e in silico EST analysis of seven reported genes associated with diabetes produced 6 consensus contigs which were annotated functionally and structurally. e functional annotations were similar to the corresponding proteins in which the ESTs were actually categorized. e structural annotations revealed that there is a variation which may be due the differences in source tissue types. is information can be used for further structure-based annotations, and new drug designs for the treatment of diabetes.