Bioinformatics Approach for Prediction of Functional Coding/Noncoding Simple Polymorphisms (SNPs/Indels) in Human BRAF Gene

This study was carried out for Homo sapiens single variation (SNPs/Indels) in BRAF gene through coding/non-coding regions. Variants data was obtained from database of SNP even last update of November, 2015. Many bioinformatics tools were used to identify functional SNPs and indels in proteins functions, structures and expressions. Results shown, for coding polymorphisms, 111 SNPs predicted as highly damaging and six other were less. For UTRs, showed five SNPs and one indel were altered in micro RNAs binding sites (3′ UTR), furthermore nil SNP or indel have functional altered in transcription factor binding sites (5′ UTR). In addition for 5′/3′ splice sites, analysis showed that one SNP within 5′ splice site and one Indel in 3′ splice site showed potential alteration of splicing. In conclude these previous functional identified SNPs and indels could lead to gene alteration, which may be directly or indirectly contribute to the occurrence of many diseases.


Introduction
Genetic alterations (mutations) in general can be divided into two categories, inheritable (germline mutations) with 2% to 4% occurrence and sporadic (somatic mutations) [1,2]. BRAF coding gene, member of RAF family, located on chromosome seven (7q34), region from 140,715,951 to 140,924,764 base pairs which cover approximately 190 kb, is composed of 18 exons, and its translated protein name is "B-Raf proto-oncogene serine/threonine protein kinase." This protein belongs to raf/mil family, which plays a role in regulating the MAP kinase/ERKs signaling pathway, which affects cell division, differentiation, and secretion [3]. Several studies reported the mutation prevalence in BRAF gene through various cancers, including non-Hodgkin lymphoma, colorectal cancer, malignant melanoma, thyroid carcinoma, non-small-cell lung carcinoma, and adenocarcinoma of lung [3][4][5]. Mutations in this gene have also been associated with various diseases such as cardiofaciocutaneous syndrome, a disease characterized by heart defects, mental retardation, and a distinctive facial appearance, Noonan syndrome, multiple lentigines syndrome or LEOPARD syndrome, giant congenital melanocytic nevus, and Erdheim-Chester disease [6,7].
Single nucleotide polymorphisms (SNPs) markers are single-base changes in DNA sequence, with allele frequency of 1% or greater among population; it normally occurs throughout the genome with frequency of about one in every 2 Advances in Bioinformatics 1000 nucleotides, which is considered the simplest and common type of the genetic markers leading to DNA variation among individuals [8]. Nonsynonymous SNPs (nsSNPs) are one of coding SNPs types, important type of SNPs leading to the diversity of encoded human proteins, whereas they affect gene regulation by altering DNA and transcriptional binding factors, maintain the structural integrity of the cell, and affect proteins function in the different signal transduction pathways [9]. About 2% of the all known single nucleotide variants associated with genetic diseases are nonsynonymous SNPs and contribute to the functional diversity of the encoded proteins in the human population [10]. SNPs may be responsible for genetic diversity, evolution process, differences in traits, drugs response, and complex and common diseases such as diabetes, hypertension, and cancers. Therefore, identification and analysis of numerous SNP variations in genes may help in understanding their effects on genes product and their association with diseases and also could help in the development of new medical testing markers and individualized medication treatment [11].
1000 Genomes Project showed that most human genetic variation is represented by SNPs. Database of SNP (dbSNP) is one of the most databases serving as a central and public store for genetic variation since its initiation in September 1998 [12]. Any laboratory or individual can use the index variation, sequence information around polymorphism, and specific experimental conditions for further research applications. As with all NCBI resources, the data within dbSNP is available for free and in a variety of forms. In November 17, 2015, SNP database contained 160508575 number of Homo sapiens variants. From total number of variants, of which 144205811 were SNPs, 16064552 were Indels (single or multiinsertion/deletion). Database of SNP contains the results of HapMap and 1000 Genomes Projects (http://www.ncbi.nlm .nih.gov/snp/). Through noncoding regions (3 UTR, 5 UTR), polymorphisms such as SNPs in microRNAs (miRNAs/mRNA) binding sites which are called mirSNPs can affect miRNAs function and then gene expression, resulting in many human diseases such as cancers [13]. Identification of SNPs responsible for phenotypes change is considered a difficulty, whereas it requires multiple testing for different SNPs in candidate genes [9]. One possible way to overcome this problem was to prioritize SNPs according to their structural and functional significance using different bioinformatics prediction tools. This study was focusing on functional SNPs within coding, 5 UTR, 3 /5 splice sites, transcription factor, and miRNA binding sites simple polymorphisms (SNPs/Indels) in human BRAF gene.

Materials and Methods
SNPs located in target gene were obtained from the database of SNPs (dbSNP); it is a public-domain archive for a broad collection of simple genetic polymorphisms. This collection of polymorphisms includes single-base nucleotide substitutions (SNPs), small-scale multibase deletions or insertions . SNP database contains SNPs or Indels within 3 /5 UTR, 3 /5 splice sites, coding synonymous, intron, and nonsynonymous which represent missense, nonsense, stop gain, and frameshift. In this study Homo sapiens SNPs and Indels (single insertion or deletion) within coding (nonsynonymous), 3 /5 UTR, and 3 /5 splice sites had been selected and submitted to bioinformatics tools for further investigation. Distributions of single variants are shown in Table 1.
About the main diagram of SNPs analysis, for missense SNPs, analysis was done by using three tools (SIFT server, PolyPhen, and SNAP2) and SNPs predicted as functional or damaging by previous triple servers were arranged in Table 2. More information about triple predicted SNPs is shown in Table 3. For frameshift SNPs, the analysis was done using SIFT server. By the same token for 3 UTR SNPs and Indels, PolymiRTS database was used ( Table 6). After that, for 5 UTR SNPs (in transcription factor binding sites), PROMO tool was used (Table 7). Lastly for 3 /5 splice sites SNPs and Indels, analysis was done using HSF tool (Table 8).

SIFT (Sorting Intolerant from Tolerant) Server.
SIFT server is an online bioinformatics server that is used to predict the damaging effect of nucleotide substitution and frame shift (insertion/deletion) on protein function based on the maintenance degree of the amino acid residues in sequence alignments derived from closely related sequences with the main assumption; that is, evolutionarily conserved regions tend to be less tolerant to mutations, and so mutations in these regions mainly affect its function [14]. SIFT server has different input data order as follows: dbSNP reference number (rs ID number), protein sequence, and chromosome location. For this tool coding SNPs and Indels were separated from total and submitted as rs ID numbers for (missense, nonsense, and stop gain) SNPs and as chromosome location for frame shift Indels. SIFT server assigns score for each residue from 0 to 1, where ≤0.05 score is considered by the algorithm to be damaging amino acid substitutions and   A|G  ENSP00000418033  F23S  -rs121913337  140453153  A|T  ENSP00000418033  D22E  -rs121913362  140453159  T|C  ENSP00000418033  I20M  -rs121913365  140453132  T|G  ENSP00000418033  K29N  -rs180177042 140449165 >0.05 score is predicting tolerance [15]. SIFT version 5.2.2 is available at http://sift.bii.a-star.edu.sg/index.html.

PolyPhen-2 (Polymorphism Phenotyping)
Server. An online bioinformatics server automatically predicts the nsS-NPs that affect with amino acid substitution structure and function of protein, using a comparative method. PolyPhen searches for protein 3D structures and make multiple alignments of homologous sequences and amino acid contact in several protein databases and calculate position-specific independent count scores (PSIC) for each of two variants and then computes the PSIC scores difference between two variants, where the higher PSIC score difference indicates that the functional impact of amino acid substitution is likely to occur [16]. PolyPhen-2 outcome can be one of the following: probably damaging, possibly damaging, or benign, with score range from 0 to 1 [9]. PolyPhen server is available at http://genetics.bwh.harvard.edu/pph2/index.shtml.

SNAP2
Server. SNAP2 is a trained classifier that is based on a machine learning device called "neural network." It distinguishes between effect and neutral variants/nonsynonymous SNPs by taking a variety of sequence and variant features into account. The most important input signal for the prediction is the evolutionary information taken from an automatically generated multiple sequence    alignment. Also structural features such as predicted secondary structure and solvent accessibility are considered. If available, also annotation (i.e., known functional residues, pattern, and regions) of the sequence or close homologs are pulled in. Predicting a score (ranges from −100 strong neutral prediction to +100 strong effect prediction), analysis suggests that the prediction score is to some extent correlated to the severity of effect [17] (https://rostlab.org/services/snap/). From the total functional nsSNPs predicted by the three previous tools (SIFT server, PolyPhen, and SNAP2), the higher 15 functional nsSNPs (got higher predicted score) were selected for next analysis.

I-Mutant
Suite. I-Mutant version 3.0 is a suite of support vector machine, based predictors integrated in a unique web server. It offers the opportunity to predict the protein stability changes upon single-site variations from the protein structure or sequence. I-Mutant result is designed as follows: DDG < 0: decrease stability, DDG > 0: increase stability, or DDG = 0: neutral [18]. I-Mutant 3.0 is available at http://gpcr2.biocomp .unibo.it/cgi/predictors/I-Mutant3.0/I-Mutant3.0.cgi.

CPH Models.
A protein homology modeling prediction server, used to predict the 3D structure of proteins with an unknown 3D structure model, in CPH models the template recognition based on profile-profile alignment guided by secondary structure and exposure predictions [19]. Protein sequences requirements were submitted to CPH server to get the model as PDB file (for the structure that could not be predicted by automated Project HOPE server). The resultant PDB files were opened using Chimera program which was used to visualize the PDB structure (http://www.cbs.dtu.dk/ services/CPHmodels/).

UCSF Chimera Model
Software. Chimera is a highquality extensible molecular graphics program designed to maximize interactive visualization, analysis system, and related data [20]. This software was produced by University of California, San Francisco [9]. Chimera outcome was used to get high-quality images of, first, whole protein 3D structure that needed protein IDs ENSP00000288602, ENSP00000418033 and ENSP00000419060 (Figure 1) and, second, determined native and mutant residues for mutations that could not be detected by next automated Project HOPE server ( Figure 2) (http://www.cgl.ucsf.edu/chimera/).

Automatic Protein Structural Analysis and Information
Using HOPE Server. Automatic mutant analysis server can provide insight into the structural effects of a mutation. HOPE collects information from a wide range of information sources including calculations on the 3D coordinates of the protein by using WHAT IF Web services, sequence annotations from the UniProt database, and predictions by DAS services. Homology models are built with YASARA. Data is ENSP00000288602 ENSP00000418033 ENSP00000419060 Figure 1: Showing proteins tertiary structure backbone and their protein secondary structures (alpha helix, beta sheet, and random coil) of higher deleterious nsSNP related proteins using CPH models 3.2 server and Chimera software. ID number below figures related to protein sequences records in UniProt database. stored in a database and used in a decision scheme to identify the effects of a mutation on the protein's 3D structure and function. HOPE builds a report with text, figures, and animations that is easy to use and understandable for (bio)medical researchers [21] (http://www.cmbi.ru.nl/hope/method) (Figure 2).

PolymiRTS Database (3 UTR).
It is an integrated platform for analyzing the functional impact of genetic polymorphisms (SNPs and Indels) within microRNAs binding sites [13]. Single variants within 3 UTR were selected from total variants and submitted to PolymiRTS server, to check if these variants could disrupt or create new miRNA binding sites or have no impact at all. PolymiRTS is available at http://compbio.uthsc.edu/miRSNP/ (Table 6).

Effect of SNPs within 5 UTR on Transcription Factor
Binding Sites. PROMO is a virtual laboratory for the identification of putative transcription factor binding sites (TFBS) in DNA sequences from a species or groups of species of interest. TFBS defined in the TRANSFAC database are used to construct specific binding site weight matrices for TFBS prediction. The user can inspect the result of the search through a graphical interface and downloadable text files [22]. Input data was two sequences for each SNP within 5 UTR: first sequence contained a wide nucleotide allele

Effect of 3 /5 Splice Sites SNPs/Indels (HSF Tool).
Human Splicing Finder (HSF) is a tool to predict the effects of mutations on splicing signals or to identify splicing motifs in any human sequence. It contains all available matrices for auxiliary sequence prediction as well as new ones for binding sites of the 9G8 and Tra2-serine-arginine proteins and the hnRNP A1 ribonucleoprotein. It also developed new position weight matrices to assess the strength of 5 and 3 splice sites and branch points [23]. In this study HSF was used to detect the functional SNPs and Indels within 3 /5 splice sites. Input data was nucleotide sequence containing the single substitution as SNP or insertion/deletion as Indel as in Table 8 (http://www.umd.be/HSF3/index.html).

Results and Discussion
Some information about total single variants and functional nsSNPs predicted with triple or double tools is obtained from many databases (dbSNP, UniProt, HapMap, 1000 Genomes Project, gene bank, and ClinVar) ( Tables 1 and 2). In addition there was no functional SNP presented within HapMap or 1000 Genomes Project databases.

Predicted Results by SIFT, PolyPhen, and SNAP2 Servers.
For 232 nsSNPs of BRAF gene, 111 variants were predicted to be damaging or effect by triple (SIFT, PolyPhen, and SNAP2) servers (Table 3). In addition one SNP (rs180177032, R70I) was predicted to be functional by double (SIFT and SNAP2) tools only. Furthermore five SNPs (V600M, L597V, L205V, V208M, and H2Q) were predicted as functional by double (SIFT and PolyPhen) servers only (Table 4). On the other hand, two Indels, frame shift (rs35546910, ch7:140834611; rs777474487, ch7:140783126-), showed no effect on protein at all. From the previous results (Table 3), 15 nsSNPs with the maximum predicted score through triple servers were selected to predict their stability index (Table 5) and visualize wide and mutant residues in their protein 3D structure ( Figure 2).

UTRs and Splice Sites. Results in untranslated regions
showed lower number of functional SNPs and Indels than coding nsSNPs. 3 UTR SNPs and Indels showed that five SNPs and one Indel were altered in microRNAs binding sites, which lead to disturbing or creating new binding sites (Table 6). Furthermore miRNAs associated with these functional SNPs/Indel are associated with many genes, and defect in these miRNAs could lead to effect on all associated genes expressions.
On the other hand, for 5 UTR SNPs (five SNPs obtained), results showed that two SNPs were found in transcription factor binding sites with none being altered, and the remaining three were not located within any TF binding sites, meaning that none of five SNPs showed an effect on TF binding sites (Table 7). In addition, about the three single variants (two SNPs and one Indel) within 5 /3 splice sites, analysis showed that one SNP within 5 splice site and one Indel in 3 splice site showed potential alteration of splicing (Table 8).
To date the complete mechanisms by which a nucleotide variant may result in a phenotypic change are for the most part unknown. In silico analysis using powerful software tools can facilitate predicting the phenotypic effect of nonsynonymous coding SNPs on the physicochemical properties of the concerned proteins. Such information is critical for genotype-phenotype correlations and also to understand disease biology. Given the fact that nsSNPs in critical cellular genes such as BRAF modify the normal programs of cell proliferation, differentiation, and death, they are believed to play an important role in disease predisposition. Therefore, efforts were made to identify SNPs that can modify the structure, function, and expression of the BRAF gene.
Through one of the most significant BRAF mutations, when thymine is substituted with adenine at nucleotide 1799, it results in an amino acid substitution at position 600 from valine (V) to a glutamic acid (E), which is called V600E, located in the activation segment that has been found in many human cancers. For example, it was reported as the most common genetic mutation related to papillary thyroid cancer and occurs in approximately 45% of patients [24,25]. In silico investigation also presented this mutation as highly damaging substitution that could cause a disease using SIFT and PolyPhen online tools. Furthermore Project HOPE server results showed that the wide type residue (V) is smaller in size (Figure 3), neutral in charge, and more hydrophobic. On the other hand mutant residue (E) is bigger in size (Figure 3), negatively charged, and less hydrophobic. In addition the mutated residue is located in a domain that is important for the activity of the protein and in contact with another domain that is also important for the activity. The interaction between these domains could be disturbed by the mutation, which might affect the function of the protein.

Conclusion
The current study shows the in silico analysis of genetic single variants within the coding region, 3 /5 UTR and 3 /5 splice sites of BRAF gene. These polymorphisms could directly or indirectly influence the intermolecular and intramolecular interactions of amino acid residues and protein expression and can culminate into disease risks. By analyzing the conformational changes and interactions of amino acid residues within BRAF proteins, we have identified significant structural and functional changes that can explain the activity deviations, caused by several mutations. Furthermore significant pathology or likely pathology showed association of many detected SNPs with many diseases through clinical variation database (http://www.ncbi.nlm.nih.gov/clinvar/). They include the following diseases: cardiofaciocutaneous syndrome, Noonan syndrome, LEOPARD syndrome, RASopathy, non-smallcell lung cancer, carcinoma of colon, adenocarcinoma of lung, thyroid cancer, malignant lymphoma, non-Hodgkin lymphoma. Screening for BRAF variants may be useful for molecular diagnosis and development of vital molecular inhibitors of genes pathways. This study demonstrates the significance of different bioinformatics tools to figure out the phenotypic changes and protein function, associated with the structure-function relationship of BRAF gene. More evidence is required for the involvement of deregulated miRNA networks in cancer development. Resultant SNPs can be applied for further investigation and diagnosis of many associated diseases.