A Comprehensive In Silico Analysis of the Functional and Structural Impact of SNPs in the IGF1R Gene

Insulin-like growth factor 1 receptor (IGF1R) acts as a critical mediator of cell proliferation and survival. Many single nucleotide polymorphisms (SNPs) found in the IGF1R gene have been associated with various diseases, including both breast and prostate cancer. The genetics of these diseases could be better understood by knowing the functions of these SNPs. In this study, we performed a comprehensive analysis of the functional and structural impact of all known SNPs in this gene using publicly available computational prediction tools. Out of a total of 2412 SNPs in IGF1R retrieved from dbSNP, we found 32 nsSNPs, 58 sSNPs, 83 mRNA 3′ UTR SNPs, and 2225 intronic SNPs. Among the nsSNPs, a total of six missense nsSNPs were found to be damaging by both a sequence homology-based tool (SIFT) and a structural homology-based method (PolyPhen), and one nonsense nsSNP was found. Further, we modeled mutant proteins and compared the total energy values with the native IGF1R protein, and showed that a mutation from arginine to cysteine at position 1216 (rs61740868) on the surface of the protein caused the greatest impact on stability. Also, the FASTSNP tool suggested that 31 sSNPs and 3 intronic SNPs might affect splicing regulation. Based on our investigation, we report potential candidate SNPs for future studies on IGF1R mutations.


Introduction
Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide (A, T, C, or G) in the genome is altered. SNPs make up about 90% of all human genetic variation, occurring every 100-300 bases along the 3-billion-base human genome, although their density vary between regions [1]. SNPs are found in both coding (gene) and noncoding regions of the genome. Many SNPs have no effect on cell function; however, others could predispose people to disease or influence their response to a drug. Nonsynonymous SNPs (nsSNPs) that lead to an amino acid residue substitution in the protein product are of particular interest because they are responsible for nearly half of the known genetic variations related to human inherited disease [2]. Coding synonymous SNPs (sSNPs) and SNPs occurring outside gene promoter or coding regions may nevertheless still have consequences for gene expression, splicing, or transcription-factor binding [3,4].
The identification of SNPs responsible for specific phenotypes appears to be a problem that is very difficult to solve, requiring multiple testing of hundreds or thousands of SNPs in candidate genes [5]. However, the question of how to choose the set of SNPs to be screened is critical to the success of association studies. A possible way to overcome this problem would be to prioritize SNPs according to their functional significance [6,7] by using Bioinformatics prediction tools, which may help discriminate neutral SNPs from SNPs of likely functional importance and could also be useful to reveal the structural basis of disease mutations. Without any careful preselection of SNPs to be screened, a huge number of individuals might be required to detect association at a reasonable level of statistical significance [5]. 2 Journal of Biomedicine and Biotechnology Although wetlab-based approaches used to identify disease-associated SNPs from a large number of neutral SNPs remain crucial evidence for the functional role of SNPs [8], numerous disease associations published could not be confirmed by subsequent independent studies [6,9]. Hence, independent evidence of functionality of SNPs obtained by using prediction tools could also serve as additional argument to discriminate true associations from false positives [5], as shown recently by the functional SNP analysis of the BRCA1, ABL1, ERBB2, CFTR, and EGFR genes [10][11][12][13][14].
Insulin-like growth factor 1 receptor (IGF1R) is a growth factor receptor tyrosine kinase that acts as a critical mediator of cell proliferation and survival. This receptor is implicated in several cancers, including both breast and prostate cancer [15,16]. Evidence suggests that IGF1R signaling is required for survival and growth when prostate cancer cells progress to androgen independence [17], as increased levels of the receptor are expressed in the majority of primary and metastatic prostate cancer patient tumors [18]. There have also been studies showing associations of IGF1R polymorphisms in dementia and ischemic stroke [19,20].
Although there are presently several articles describing the association of SNPs in the IGF1R gene with different types of diseases, computational analysis has not yet been undertaken on the functional consequences of SNPs in this gene. We applied different publicly available computational algorithms, namely, Sorting Intolerant From Tolerant (SIFT) [21], Polymorphism Phenotyping (PolyPhen) [22], and Function Analysis and selection tool for single nucleotide polymorphisms (FASTSNP) to identify likely deleterious SNPs which could affect protein function [23].
The SIFT algorithm predicts whether an amino acid substitution affects protein function based on sequence homology among related genes and domains over evolutionary time, and the physical-chemical properties of the amino acid residues [24][25][26]. Sequence conservation and the nature of the amino acid residues involved are also incorporated by PolyPhen, but it also values the location of the substitution within known structures and structural features of the protein available in the annotated database SwissProt [5,27]. By accessing a variety of heterogeneous biological databases and analytical tools, FASTSNP is able to identify SNPs most likely to have functional effects, such as changes to the transcriptional level and pre-mRNA splicing [23].
SIFT and PolyPhen were approximately 80% successful in benchmarking studies employing amino acid substitutions assumed to have a major negative impact on the residual activity of the variant protein as the test set [22,25,[27][28][29] and it has been estimated that the "false negative" and "false positive" error rates of SIFT is 31% and 20%, and 31% and 9% for PolyPhen [26]. FASTSNP was used to analyze 1569 SNPs from the SNP500 cancer database, and results showed that SNPs with a high predicted risk exhibited low allele frequencies for the minor alleles, which is consistent with the finding that a strong selective pressure exist for functional polymorphisms [23,30].
As the majority of disease mutations affect protein stability [31,32], we also proposed modeled protein structures for the mutant proteins and compared them with the native protein in order to evaluate stability changes.

Evaluation of the Functional Impact of Coding nsSNPs
Using a Sequence Homology Tool (SIFT). SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence (http://sift.jcvi.org) [21]. It is a multistep procedure that, given a protein sequence, (1) searches for similar sequences, (2) chooses closely related sequences that may share similar function, (3) obtains the multiple alignment of these chosen sequences, and (4) calculates normalized probabilities for all possible substitutions at each position from the alignment. Substitutions at each position with normalized probabilities less than a tolerance index of 0.05 are predicted to be intolerant or deleterious; those greater than or equal to 0.05 are predicted to be tolerated [24,26].
The analysis was performed by allowing the algorithm to search for homologous sequences using the default settings (UniProt-TrEMBL 39.6 database, median conservation of sequences of 3.00, and allowance to remove sequences more than 90% identical to query sequence). The IGF1R FASTA amino acid sequence of the NCBI Protein accession id NP 000866.1 was used as the query sequence, and a total of 24 IGF1R nsSNPs filtered from the dbSNP database were analyzed.

Evaluation of the Functional Impact of Coding nsSNPs Using a Structural Homology-Based Method (PolyPhen).
PolyPhen prediction is based on straightforward empirical rules which are applied to the sequence, phylogenetic and structural information characterizing the substitution [5]. The online input form available at http://coot.embl.de/PolyPhen was filled with the IGF1R amino acid sequence in FASTA format (NCBI Protein accession id NP 000866.1), and the position and substitution of each of the 24 nsSNPs analyzed by SIFT were also submitted for PolyPhen analysis. PolyPhen then searched for 3D protein structures, multiple alignments of homologous sequences and amino acid contact information in several protein structure databases, calculated position-specific independent counts (PSIC) scores for each of the two amino acid residues entered (the original residue and the nsSNP), and then computed the PSIC scores difference of the two residues. The higher a PSIC score difference, the higher functional impact a particular amino acid substitution is likely to have. A PSIC score difference of 1.5 and above is considered to be damaging. The query options were left with default values.

Functional Significance of SNPs in Regulatory Regions.
The online tool FASTSNP [23] was used to determine the impact of the sSNPs, 3 UTR regions SNPs and intronic SNPs on the regulation of the IGF1R gene. The FAST-SNP server (http://FASTSNP.ibms.sinica.edu.tw) follows the decision tree principle with external Web service access to TFSearch, which predicts whether a non-coding SNP alters the transcription factor binding site of a gene. The score is given on the basis of levels of risk with a ranking of 0, 1, 2, 3, 4, or 5. This signifies the levels of no, very low, low, medium, high, and very high effect, respectively.

Modeling of nsSNPs on Protein Structures and Calculation of their RMSD Difference.
Structural analysis was performed in order to evaluate and compare the stability of native and mutant structures. Information about mapping the nsSNPs in the protein structure was obtained from dbSNP [33]. The highest resolution (2.00Å) native structure of the IGF1R protein available in the Protein Data Bank (PDB) [34] has an id of 2oj9 [35]. The positions of the studied nsSNPs mutations on PDBid 2oj9 were confirmed by pairwise alignment between the FASTA amino acid sequence of the IGF1R protein obtained from the NCBI (NP 000866.1) and the 2oj9 FASTA amino acid sequence, using the Sequence Manipulation Suite [36]. The amino acid residue substitutions were performed using the Swiss-Pdb Viewer [37], followed by energy minimization of the modeled 3D structures using the GROMACS software version 4.0 [38]. The algorithms used for energy minimization were the steepest descent (1000 steps), followed by conjugate gradient (1500 steps) alternating with the steepest descent every 100 steps. The comparison between the resulting native and modeled structures was made by the calculation of the potential energy and RMSD values.

SNP Dataset.
Polymorphism data of the IGF1R gene investigated in this paper was retrieved from the dbSNP database [33]. It contained a total of 2412 SNPs, out of which 32 (1.3%) were nsSNPs, 58 (2.4%) were sSNPs, 83 (3.4%) occurred in the mRNA 3 UTR, and 2225 (92.2%) occurred in intronic regions. SNPs in the 5 UTR region were not found. It can be seen from the distribution in Figure 1 that the vast majority of SNPs occur in the intronic region, and that there are more 3 UTR region SNPs than nsSNPs or sSNPs. We selected missense nsSNPs, sSNPs, 3 UTR SNPs, and intronic SNPs for our investigation.

Deleterious nsSNPs by SIFT Program.
Protein sequence with mutational position and amino acid residue variants associated to 24 missense nsSNPs were submitted as input to the SIFT server, and the results are shown in Table 1, along with the corresponding heterozygosity and validation status description for each SNP, when available from   dbSNP. According to the classification proposed by Ng and Henikoff [24] and Xi et al. [28], the lower the tolerance index, the higher the functional impact a particular amino acid residue substitution is likely to have and vice versa. Among the 24 nsSNPs analyzed, 8 nsSNPs were identified to be deleterious with a tolerance index score ≤0.05. Five nsSNPs (rs61740868, rs45578132, rs45553041, rs45526336, and rs45504297) showed a highly deleterious tolerance index score of 0.00. The remaining deleterious nsSNPs showed tolerance index scores of 0.01 (rs45524940 and rs45512296) and 0.03 (rs45445894). Four deleterious nsSNPs showed a nucleotide change from G/A, four a change from C/T, two a change from T/C, and one a change from A/G.

Damaged nsSNPs by PolyPhen Server.
All the 24 protein sequences of missense nsSNPs submitted to SIFT were also submitted to the PolyPhen server. A PSIC score difference of 1.5 and above is considered to be damaging. Eight nsSNPs (rs70958401, rs61740868, rs45578132, rs45504297, rs45553041, rs45512296, rs45524940, and rs33958176) were considered to be damaging and exhibited a range of PSIC score difference between 1.503 and 2.609 ( Table 1). Out of these damaging nsSNPs, two changed from positively charged amino acid in the native protein to hydrophobic amino acid in the mutant type, two from aliphatic nonpolar amino acid to non-polar amino acid, two from positively charged amino acid to aromatic positively charged amino acid, one from polar amino acid to non-polar amino acid, and one from positively charged to polar amino acid, respectively. It can be seen from Table 1 that there was significant correlation between the results obtained from the evolutionary-based approach SIFT and the structuralbased approach PolyPhen for six nsSNPs predicted to be damaging by PolyPhen, suggesting that these nsSNPs may disrupt both the protein function and structure. The most damaging nsSNP (rs61740868) showed a PSIC score of 2.609, due to a mutation from arginine to cysteine.

SNPs in Regulatory
Regions. According to FASTSNP, out of 58 sSNPs in the IGF1R gene, 31 sSNPs were predicted to be damaging with a risk ranking of 2-3, and a possible functional effect on splicing regulation (Table 2). Among these, the A/G polymorphism (rs2229765) has been shown experimentally to affect the susceptibility to ischemic stroke in Chinese population [19] to be associated with higher plasma concentrations of circulating IGF1R and premature pubarche [39,40] and adult height variation in the human population [41]. Out of 2225 SNPs which occur in the intronic region of the IGF1R gene, 3 SNPs (rs55895813, rs36108138 and rs45495500) were predicted to affect the splicing site (3-4 risk) ( Table 2). It can be seen from Table 2 that a coding nonsense SNP (rs45437300) due to a nucleotide change from A to T was detected and showed a very high  level of risk, as it can truncate and even inactivate the IGF1R protein, causing disease as a result.

Structural Analysis of Mutant Structures.
Out of eight nsSNPs predicted to be deleterious by SIFT or PolyPhen, four (rs61740868, rs45526336, rs45512296, and rs45504297) were mapped to the PDB ID 2oj9 native structure. The amino acid residue substitutions were performed by Swiss-Pdb Viewer independently to get four mutant modeled structures (2oj9 R1216C, 2oj9 E1253K, 2oj9 R1216H, and 2oj9 L1211P, respectively). Then, energy minimizations were performed by GROMACS for the native structure (2oj9) and the mutant modeled structures.  (Table 3). Three out of four mutant modeled structures (2oj9 R1216C, 2oj9 R1216H, and 2oj9 L1211P) showed an increase in energy (less favorable change) in comparison with the native structure. This result correlates with the structural homology method (PolyPhen) results, which predicted all these three mutants to be deleterious (PSIC scores 2.609, 2.128, and 2.372, resp.) ( Table 1). The mutant model 2oj9 R1216C showed the greatest increase in energy, which may be explained by the energetically unfavorable substitution of a positively charged arginine amino acid residue to a nonpolar cysteine amino acid residue at the surface of the protein structure ( Figure 2).
It can be seen from Table 3 that the RMSD values between the native structure (2oj9) and the mutant modeled structures are all similar, ranging from 0.22Å to 0.48Å. Because these values are low, we can suggest that these mutations do not cause a significant change in the mutant structures with respect to the native protein structure.

Conclusion
In this paper, we investigated the functional and structural impact of SNPs in the IGF1R gene using computational prediction tools. Out of a total of 2412 SNPs in the IGF1R gene, 32 SNPs were found to be non-synonymous, 58 were synonymous, 83 occurred in the mRNA 3 UTR, and 2225 were found in intronic regions. Out of 24 missense nsSNPs, eight were found to be deleterious by SIFT, and eight were found to be damaging by the PolyPhen tool. A total of six nsSNPs were found to be damaging by both SIFT and PolyPhen tools. The structural analysis results showed that the amino acid residue substitutions which had the greatest impact on the stability of the IGF1R protein were mutations 2oj9 R1216C (rs61740868) and R1216H (rs45512296). Among the nsSNPs studied, a nonsense SNP (rs45437300) was found. Out of 58 sSNPs, 31 were predicted to affect splicing regulation by FASTSNP, including an sSNP (rs2229765) associated with several diseases. In the intronic region, 3 SNPs (rs55895813, rs36108138, and rs45495500) were predicted to affect splicing regulation. Based on our Journal of Biomedicine and Biotechnology 7 results, we conclude that these SNPs should be considered important candidates in causing diseases related to IGF1R malfunction.