A major challenge in the analysis of human genetic variation is to distinguish functional from nonfunctional SNPs. Discovering these functional SNPs is one of the main goals of modern genetics and genomics studies. There is a need to effectively and efficiently identify functionally important nsSNPs which may be deleterious or disease causing and to identify their molecular effects. The prediction of phenotype of nsSNPs by computational analysis may provide a good way to explore the function of nsSNPs and its relationship with susceptibility to disease. In this context, we surveyed and compared variation databases along with
Understanding the genomic variation in the human population is one of the primary challenges of current genomics research. Identifying genenomic variations that underlie the etiology of human diseases is of primary interest in current molecular epidemiology, medicine, and pharmarcogenomics [
Because most sequence variants are SNPs, a massive effort has been undertaken by several private and public organizations [
Alkaptonuria (MIM # 203500) is a rare autosomal recessive disorder of the phenylalanine and tyrosine catabolic pathway caused by the deficiency of homogentisate dioxygenase (HGO, EC 1.13.11.5). AKU was the first disease to be interpreted as a single gene trait and the mode of inheritance was reported by 2002 Garrod and Oxon [
The SNPs information (Protein accession number (NP), mRNA accession number (NM) and SNP ID) of
Over the past few years, there have been many computational methods utilizing machine-learning techniques (support vector machines, neural networks, and decision trees) that have been applied successfully in sequence-structure relationships predictions. Support vector machines (SVMs) are universal classifiers that learn a variety of data distributions from training samples and, as such, are applicable to classification and regression tasks [
Sorting intolerant from tolerant (SIFT) software developed by Kumar et al. [
PANTHER version 7 (Protein Analysis Through Evolutionary Relationships) estimates the likelihood of a particular nsSNP to cause a functional impact on the protein [
PolyPhen differs from SIFT in that it predicts how damaging a particular variant may be by using a set of empirical rules based on sequence, evolutionary conservation, and structural information characterizing a particular variant. PolyPhen is a multiple sequence alignment server that aligns sequences using structural information. Input for the PolyPhen server is either a protein sequence or a SWALL database ID or accession number together with sequence position with two amino acid variants. We submitted the query in the form of sequence with mutational positions each with two amino acid variants. In addition to using sequence alignments, PolyPhen utilizes protein structure databases, such as PDB (Protein Data Bank) or PQS (Protein Quarternary Structure), DSSP (Dictionary of Secondary Structure in Proteins), and three-dimensional structure databases to determine if a variant may have an effect on the protein’s secondary structure, interchain contacts, functional sites, and binding sites [
In order to efficiently identify nsSNPs with a high possibility of having a functional effect, FASTSNP tool was applied for the detection of nsSNP influence on cellular and molecular biological function, for example, transcriptional and splicing regulation. The online tool FASTSNP [
Structural analyses were performed based on the crystal structure of the protein for evaluating the structural stability of native and mutant protein. We used the SAAPdb [
The structure and function of proteins are determined by various factors. To check the stability of the native and mutant modeled structures, identification of the stabilizing residues is useful. We used the server SRide [
The functional impact of nsSNPs can be assessed by evaluating the importance of the amino acids they affect. We employed four widely used computational tools for determining the functional significance of nsSNPs. In this analysis, we applied two different approaches in computational analysis of deleterious nsSNPs, namely, empirical rule-based method and Support Vector Method (SVM). These approaches use alternative classification methods to decide which of the nsSNPs may have deleterious or neutral phenotypes. SVM approaches, a set of trained data, and trained attributes are required to forecast precisely the effects of amino acid substitutions on various protein properties such as protein stability, protein secondary structures, solvent accessibility of residues, residue-residue interactions, and protein 3D structures [
List of nsSNPs predicted to be deleterious by SIFT, PolyPhen, PANTHER, and I-Mutant 2.0 in the coding region of
rs IDs | Allele frequency and change | AA position | SIFT | PolyPhen | PANTHER | I-Mutant 2.0 | Reference | ||||
Tolerance index | Predicted impact | PSIC score | Predicted impact | subPSEC score | Predicted impact | DDG | Predicted impact | ||||
rs138356501 | A(0.000)/T(1.000) | Y37F | 0.15 | Tolerant | 0.534 | Benign | −1.92527 | Tolerated | 0.01 | Increase stability | |
rs138846036 | A(0.012)/C(0.988) | A48S | 0.12 | Tolerant | 0.497 | Benign | −2.05903 | Tolerated | −0.59 | Decrease stability | |
rs141965690 | A(0.000)/T(1.000) | E74V | 0.29 | Tolerant | 0.524 | Benign | −2.29059 | Tolerated | 0.33 | Increase stability | |
rs2255543 | A(0.262)/T(0.738) | Q80H | 0.45 | Tolerant | 0.258 | Benign | −1.49933 | Tolerated | −1.17 | Decrease Stability | [ |
rs35702995 | A(0.996)/C(0.004) | E87A | 0.50 | Tolerant | 0.881 | Benign | −2.18204 | Tolerated | −1.85 | Decrease Stability | |
rs143267384 | A(0.000)/T(1.000) | E101V | 0.06 | Tolerant | 1.817 | Probably damaging | −2.67878 | Tolerated | 0.82 | Increase stability | |
A(0.000)/G(1.000) | Intolerant | Probably damaging | Deleterious | Decrease Stability | [ | ||||||
rs140543217 | A(0.000)/G(1.000) | L163F | 0.00 | Intolerant | 1.105 | Benign | −3.70687 | Deleterious | −1.12 | Decrease stability | |
C/T | Intolerant | Probably damaging | Deleterious | Decrease Stability | [ | ||||||
A(0.000)/G(1.000) | Intolerant | Probably damaging | Deleterious | Decrease Stability | [ | ||||||
rs148641817 | G(1.000)/T(0.000) | A293E | 0.09 | Tolerant | 1.128 | Benign | −1.96145 | Tolerated | 0.84 | Increase stability | |
G/T | Intolerant | Probably damaging | Deleterious | Decrease Stability | [ | ||||||
rs143556739 | A(0.001)/G(0.999) | R307C | 0.01 | Intolerant | 1.535 | Probably damaging | −3.80886 | Deleterious | −1.59 | Decrease stability | |
rs143396290 | C(1.000)/T(0.000) | D326N | 0.02 | Intolerant | 0.503 | Benign | −2.1555 | Tolerated | 0.37 | Increase stability | |
G/T | Intolerant | Probably damaging | Deleterious | Decrease Stability | [ | ||||||
A(0.000)/C(1.000) | Intolerant | Probably damaging | Deleterious | Decrease stability | |||||||
rs120074173 | A(1.000)/G(0.000) | M368V | 0.00 | Intolerant | 2.373 | Probably damaging | −2.45276 | Tolerated | −0.35 | Decrease Stability | [ |
rs149326001 | G(1.000)/T(0.000) | T369N | 0.00 | Intolerant | 1.535 | Probably damaging | −2.88444 | Tolerated | −0.60 | Decrease stability | |
A/G | Intolerant | Probably damaging | Deleterious | Decrease Stability | [ | ||||||
rs150145204 | C(0.001)/G(0.999) | D376E | 0.84 | Tolerant | 0.089 | Benign | −1.09282 | Tolerated | 0.14 | Increase stability | |
rs141753513 | C(1.000)/G(0.000) | E379Q | 0.01 | Intolerant | 1.096 | Benign | −2.62114 | Tolerated | −0.38 | Decrease stability | |
rs138558042 | A(0.000)/G(1.000) | P373L | 0.00 | Intolerant | 2.074 | Probably damaging | −2.72749 | Tolerated | −0.66 | Decrease stability |
Highly deleterious by SIFT, Panther, PolyPhen and I-Mutant were indicated as bold.
The functional prediction of SNPs in untranslated region for the
List of SNPs that were predicted to be functional significance by FASTSNP.
SNPs ID | Allele frequency and change | Region | Possible functional effect | Ranking and Level of risk |
---|---|---|---|---|
rs7652072 | A/G (No frequency) | Intron | Splicing site | 3-4 (Medium to high) |
rs55661952 | C/T (No frequency) | 5′UTR (−201A>G) | Promoter/regulatory region | 1–3 (Low to medium) |
rs2733829 | C/T (No frequency) | 5′UTR (−339C>T) | Promoter/regulatory region | 1–3 (Low to medium) |
C/T (No frequency) | nsSNP | Missense (conservative) | 2-3 (Low to medium) | |
A(0.000)/G(1.000) | nsSNP | Missense (conservative); Splicing regulation | 2-3 (Low to medium) | |
rs35702995 | A(0.996)/C(0.004) | nsSNP (E87A | Missense (conservative); Splicing regulation | 2-3 (Low to medium) |
rs2255543 | A(0.514)/T(0.700) | nsSNP (Q80H) | Missense (conservative); Splicing regulation | 2-3 (Low to medium) |
rs2293734 | G/T (No frequency) | csSNP (P158P | Sense/synonymous; Splicing regulation | 2-3 (Low to medium) |
SNP IDs which were highlighted in bold were found to be deleterious by SIFT, PANTHER, PolyPhen and I-Mutant 2.0.
Knowledge of the 3D structure of a gene product is of major assistance in understanding the function within the cell and its role in causing disease. Proteins with mutations do not always have 3D structures that are analyzed and deposited in Protein data bank (PDB). Therefore, it is necessary to construct 3D models by locating the mutation in 3D structures. This is a simple way of detecting what kind of adverse effects that a mutation can have on a protein. The linear sequence of amino acids specifies the 3D structure of the protein. Even as single amino acid substitution can cause a disruption in structure of a protein by affecting its stability, this leads to change in structural and thermodynamic properties affecting the protein dynamics. Mutation analysis was performed based on the results obtained from highest SIFT, PolyPhen, I-Mutant 2.0, and PANTHER scores. The mutations at their corresponding positions were performed by SWISS-PDB viewer independently to achieve modelled structures. Then, energy minimizations were performed by NOMAD-Ref server for the native type protein and mutant type structures. According to this in
Superimposed structures of native and mutant modeled of
In addition to the molecular approaches, which are laborious and time-consuming, it is now possible to apply computational approaches to filter out deleterious substitutions that are unlikely to affect protein function. Alternatively, computational approaches, which are fast and relatively inexpensive methods, can offer a more feasible means for phenotype prediction based on the biochemical severity of the amino acid substitution and the protein sequence and structural information. Computational analysis performed here suggests that individual tools correlate modestly with observed results and by combining information from a variety of tools may significantly increase the predictive power for determining the functional impact of a given SNP. Different computational methods employed in this analysis have its own advantages and disadvantages in predicting the functional SNPs. The user must decide which tool is most suited to the specific objectives of their analysis to gain the optimum knowledge. This SNP prioritization analysis integrates relevant biomedical information and computational methods to provide a systematic analysis of functional and deleterious nsSNPs. In other respects, we attempted these methods to work as first-pass filter to identify the deleterious substitutions worth pursuing for further experimental research.
The authors declare that they do not have conflict of interests.
The authors thank the management of VIT University for providing the facilities to carry out this work.