Computational Analysis of Missense Variants in the Human Transmembrane Protease Serine 2 (TMPRSS2) and SARS-CoV-2

The human transmembrane protease serine 2 (TMPRSS2) protein plays an important role in prostate cancer progression. It also facilitates viral entry into target cells by proteolytically cleaving and activating the S protein of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In the current study, we used different available tools like SIFT, PolyPhen2.0, PROVEAN, SNAP2, PMut, MutPred2, I-Mutant Suite, MUpro, iStable, ConSurf, ModPred, SwissModel, PROCHECK, Verify3D, and TM-align to identify the most deleterious variants and to explore possible effects on the TMPRSS2 stability, structure, and function. The six missense variants tested were evaluated to have deleterious effects on the protein by SIFT, PolyPhen2.0, PROVEAN, SNAP2, and PMut. Additionally, V160M, G181R, R240C, P335L, G432A, and D435Y variants showed a decrease in stability by at least 2 servers; G181R, G432A, and D435Y are highly conserved and identified posttranslational modifications sites (PTMs) for proteolytic cleavage and ADP-ribosylation using ConSurf and ModPred servers. The 3D structure of TMPRSS2 native and mutants was generated using 7 meq as a template from the SwissModeller group, refined by ModRefiner, and validated using the Ramachandran plot. Hence, this paper can be advantageous to understand the association between these missense variants rs12329760, rs781089181, rs762108701, rs1185182900, rs570454392, and rs867186402 and susceptibility to SARS-CoV-2.

To date, physiological roles of the transmembrane protease serine 2 are unknown, but it participates in many biological processes such as digestion, fertility, blood coagulation, tissue remodeling, inflammatory responses, tumor cell invasion, and apoptosis [2]. TMPRSS2 in turn plays an essential role in prostate tumorigenesis via the proteolytic activation of the protease-activated receptor 2 (PAR-2) [3,4]. A study by Magi-Galluzzi et al. about prostate cancer (Pca) revealed that TMPRSS2-ERG fusion was significantly correlated with ethnicity and geography (50% of Caucasians, 31.3% African-Americans, and 15.9% of Japanese patients) [5]. Another study by Kong et al. explored the association between the TMPRSS2-ERG gene fusion and clinicopathological characteristics and reported that no significant correlation was observed between the TMPRSS2-ERG gene fusion and clinical parameters [6].
The S protein is composed of an extracellular N-terminal associated with S1 essential for binding the receptor and a Cterminal labelled S2 that is used for membrane fusion. The envelope E protein is composed of a hydrophilic amino acid terminus (7)(8)(9)(10)(11)(12), the transmembrane hydrophobic domain, and a long C-terminal domain that are essential   BioMed Research International for viral assembly and maturation. The M protein is composed of a hydrophilic C-terminal and amphipathic Nterminal which are needed for viral assembly. The N protein consists of an N-terminal RNA domain (NTD) and a Cterminal dimerization (CTD) domain separated by a serine-rich linker region that are essential for viral entry and assembly [8,9]. As TMPRSS2 is expressed in bronchial and lung cells, it can therefore facilitate entry of SARS-CoV-2 into host cells by cleaving the ACE2 receptor at arginine 697-716 positions [2]. The TMPRSS2 protein is responsible for the proteolytic cleavage of the viral spike protein (S) [10]. Several studies have demonstrated the existence of three residues of catalytic triad of TMPRSS2, namely, His 296, Asp 345, and Ser 441, that play a crucial role in the involvement of molecular complex between TMPRSS2 and viral spike protein S and consequently SARS-CoV-2 [10]. Recent studies show the existence of unique variants in TMPRSS2 (p. Val160Met, p. Gly181Arg, p. Arg240Cys, p. Pro335Leu, p. Gly432Ala, and p. Arg435Tyr) that can alter the efficiency of TMPRSS2 and might influence susceptibility to SARS-CoV-2 [11].
Taking into account all these considerations, this article is aimed at elucidating the plausible effect of TMPRSS2 genetic missense variants in structure, stability, and functions of TMPRSS2 using different publicly available bioinformatics algorithms. The use of a wide array of pathogenicity tools like SIFT, PolyPhen2.0, PROVEAN, SNAP2, and PMut provides consistent results. Also, stability, conservation, and flexibility approaches using bioinformatics tools, namely, I-Mutant Suite, MUpro, iStable, STRUM, CUPSAT, ConSurf, ModPred, and FlexPred, will aid comprehending the mutation effect on TMPRSS2 protein [12][13][14][15].

Materials and Methods
2.1. Datasets. The amino acid sequence of the TMPRSS2 gene was obtained in FASTA format from UniProt databases (UniProt ID: O15393) (https://www.uniprot.org). All the variants of the TMPRSS2 gene were collected from Ensembl Genome Browser (https://www.ensembl.org/Homo_sapiens/ Gene/Variation_Gene/Table?db=core;g= ENSG00000184012;r=21:41464305-41531116). A total of 392 missense variants were mapped in the human TMPRSS2 gene, but we limited our study to those SNPs who provide explanations for genetic susceptibility to COVID-19; therefore, six variants remained.

Functional Analysis of Human TMPRSS2 Missense
Variants. SIFT (Sort Intolerant from Tolerant) is a sequence homology-based algorithm that predicts tolerable and intolerable change in protein function caused by the substitution in amino acid sequence, which is available at https://sift.bii .a-star.edu.sg/ [16]. A substitution is predicted to be "deleterious" if the prediction score ranges from 0 to 0.05 and "tolerable" if the prediction score is greater than or equal to 0.05 [17]. PolyPhen2.0 (Polymorphism Phenotyping v2) is a web server that uses physical and comparative considerations to estimate the effect of substitution of an amino acid on protein function and structure, which is available at https:// genetics.bwh.harvard.edu/pph2/ [18]. PROVEAN (Protein Variation Effect analyzer) is an algorithm that predicts the possible impact of the substitution of amino acid, based on the alignment score approach, which is available at http:// provean.jcvi.org/ [50]. SNAP2 (Screening of nonacceptable Polymorphism 2) is a bioinformatics tool that uses the annotations from the protein mutant database (PMD) to predict       [22]. MUpro (http://mupro.proteomics.ics.uci.edu/) is an online server used to predict the stability change for single-site mutations, depending on the structure or sequence information. MUpro adopts a support vector machine as an estimator to calculate the ΔΔG value and evaluate the direction of stability change of the protein [23]. iStable (Integrated Predictor for Protein Stability Change Upon Single Mutation) (http://predictor.nchu.edu.tw/istable/indexSeq.php) is a web server that uses a support vector machine as an integrator to predict the value of free energy stability (DDG). DDG < 0 for each variant was considered as decreasing the stability of the protein [24,25]. STRUM (https://zhanglab .ccmb.med.umich.edu/STRUM/) is a tool based on a gradient boosting regression approach to predict the effect of a site mutation on stability. ΔΔG < 0 means the variant decreases the protein stability and vice versa [26]. CUPSAT (Cologne University Protein Stability Analysis Tool) is a web server that combines both structural environment-specific atoms and torsion angles to predict protein stability changes upon point mutations, which is available at http://cupsat.tubs.de/ [27].

Identification of Conserved Residues and Sequence
Motifs. Clustal Omega, a bioinformatics program, was used to align multiple homologous proteins or DNA/RNA sequences. It uses both the older clustalX and clustalW for multiple sequence alignment. Clustal Omega is available at https://www.ebi.ac.uk/Tools/msa/clustalo/ or can be used from the command line [28]. Jalview is a freely available system (https://www.jalview.org), which was used for visualization, editing, figure generation, and analysis of molecular sequences, alignment, and structures, provided by the European Bioinformatics Institute (EBI) and the University of Dundee [29].

Evolutionary Phylogenetic
Analysis of TMPRSS2. Con-Surf (https://consurf.tau.ac.il/) is an in silico tool that uses an empirical Bayesian method for estimating the degree of evolutionary conservation of an amino acid in macromolecules (protein or nucleic acid). The conservation grades are ranged from 0 to 9, where 1-4 score is variable, 5-6 score is intermediate, and 7-9 score is conserved [30].

Protein
Flexibility. FlexPred (http://flexpred.rit.albany .edu/) a bioinformatic program uses two sequence-derived information and solvent accessibility to evaluate residue positions involved in conformational switches. FlexPred classifies amino acid residues into rigid or flexible [32]. Pre-dyFlexy (https://www.dsimb.inserm.fr/dsimb_tools/ predyflexy/) is an online tool, which was used to predict protein flexibility. PredyFlexy adopts the X-ray B-factors and the root mean square fluctuations (RMSF) for predicting the flexibility of local protein structures [33]. RaptorX Property (http://raptorx.uchicago.edu/StructurePropertyPred/ predict/) is a web-based server implementing a powerful machine learning method named deepCNF (deep convolutional neural fields) to evaluate and calculate protein secondary structure, disorder regions, and solvent accessibility [34].
2.3.6. Secondary Structure. PredictProtein (https:// predictprotein.org/) is an automatic server that uses FASTA amino acid sequence as input and predicts protein structure such as secondary structure, solvent accessibility, disulfide     it starts from C-alpha trace and main chain hydrogenbonding networks. Secondly, the side chain is added onto the backbone conformation with the guide of a composite of physics and knowledge-based force fields [37]. PRO-CHECK (https://servicesn.mbi.ucla.edu/PROCHECK/) is a web-based tool for assessing the quality of protein structure. Its outputs contain a large number of plots including the Ramachandran plot [38]. Verify3D, a freely available online server (https://servicesn.mbi.ucla.edu/Verify3D/), was used to verify the quality assessment of protein models with three-dimensional profiles. A PDB file format was provided as input to generate a profile window plot [39]. TM-align (https://zhanglab.ccmb.med.umich.edu/TM-align/), an online tool, was employed to predict the best alignment between two structures using both TM-score rotation matrix and dynamic programming. A TM − score < 0:2 means that there is no similarity between two protein structures [40].

Ligand Binding Site Prediction.
COACH is a metaserver approach for prediction of protein-ligand binding sites. The server employs other comparative methods, like TM-Site and S-Site, FINDSITE, COFACTOR, and ConCavity, which are available at https://zhanglab.ccmb.med.umich.edu/ COACH/ [41]. RaptorX binding site (http://raptorx .uchicago.edu/BindingSite/), a tool, was used for the prediction of ligand binding regions by submitting the FASTA format as input [42].

Dynamic Cross-Correlation Matrix Analysis Using Bio3d
Package by RStudio Software and DynOmics Server. We  of TMPRSS2 native and mutants using the Bio3d package by RStudio program [9]. Then, we used DynOmics ENM server to determinate the correlation between observed and predicted fluctuations of TMPRSS2 native and mutants. DynOmics ENM, an online server, was used for computing biomolecular system dynamics of any PDB file. DynOmics ENM uses both elastic network models (ENMs)-the Gaussian Network Model (GNM) and the Anisotropic Network Model (ANM) [44]. Bio3d is an automated R package for the comparative analysis of biomolecular structure, sequence, analysis, and dynamic. Bio3d integrates multiple comparative methods such as principal component analysis (PCA), new ensemble difference distance matrix (eDDM) analysis, network analysis, and normal mode analysis (NMA) [45].

Results
All the reported missense variants of the TMPRSS2 gene were retrieved from Ensembl Genome Browser (https:// www.ensembl.org/Homo_sapiens/Gene/Variation_Gene/ Table?db=core;g=ENSG00000184012;r=21:41464305-41531116). In this paper, we selected only six missense variants (rs12329760, rs781089181, rs762108701, rs1185182900, rs570454392, and rs867186402) to investigate the potential genetic susceptibility to COVID-19. For that, we used a multitier approach using different algorithms such as functional analysis of human TMPRSS2 missense variants using SIFT, PolyPhen2.0, PROVEAN, SNAP2, PMut, and MutPred; stability analysis of mutant proteins using I-Mutant Suite, MUpro, CUPSAT, iStable, and STRUM; the implication of missense variants with conserved and exposed residues in TMPRSS2 protein by using Clustal Omega and ConSurf tools; analysis of the effect of missense variants on protein flexibility and secondary structure using FlexPred, PredyFlexy, RaptorX property, and PredictProtein, respectively; structure analysis and comparison between tertiary structures of mutant and native proteins using Swiss Model, ModRefiner, PROCHECK, Verify3D, TM-align; and finally ligand binding site prediction using COACH and RaptorX binding site servers.

Functional Analysis of Human TMPRSS2 Missense
Variants. Among the six missense variants tested, five were predicted damaging (prediction score was ranged from 0 to 0.02) ( Table 1). According to PolyPhen2.0, all the variants were identified as probably damaging (prediction score close to 1), while PROVEAN predicted five of the SNPs to be deleterious (G181R, R240C, P335L, G432A, and D435Y), SNAP2 predicted all of the submitted SNPs to affect protein function. When using the PMut, five of the subjected mutations were found to be disease-related (V160M, G181R, P335L, G432A, and D435Y). As presented in Table 2, MutPred analysis revealed that G181R was significantly associated with gain of a helix, loss of disulfide at C185, and the gain of ADPribosylation at G181 with g − value = 0:607 and p value < 0.05. It did find also that the G432A substitution induced a loss of loop, altered metal binding, the gain of disulfide linkage at C437, the gain of the catalytic site at D435, and gain of pyrrolidone carboxylic acid at Q431 with g − value = 0:874 and p value > 0.05. Finally, the D435Y substitution showed the highest g − value = 0:919 and a lower p value that was associated with a gain of disulfide linkage at C437.  The bold values show the residues included in the current study.   (Tables 3 and 4).

Conservation
Analysis of TMPRSS2 Gene. The amino acid sequence of TMPS2_Human transmembrane protease serine 2 protein was blasted against the UniprotKB/Swis-sProt in NCBI databases, and 100 sequences producing significant alignments were downloaded as Hit Table (CSV) files. Therefore, all sequences share more than 70% identity and an E-value equal to 0. Clustal omega was used for multiple sequence alignment (MSA). The residue identities were visualized and colored using Jalview program, according to the Clustal color scheme and the conservation score.

Evolutionary Phylogenetic Analysis of TMPRSS2.
The amino acid evolutionary conservation in TMPRSS2 protein was checked using the ConSurf server. As presented in Figures 1 and 2 and

12
BioMed Research International residues G181 (buried), G432, and D435 (exposed and functional) are highly conserved with an index conservation of 9 and identified less conserved amino acid residues V160 (buried) and R240 (exposed) with an index conservation of 5-6. P335 was observed to have a conservation score of 3 (variable and exposed).

Protein Flexibility.
FlexPred program was used to predict fluctuations and evaluate which amino acid residues are located in flexible or rigid regions of the TMPRSS2 protein.
For identifying the levels of residue dynamics, we used the PredyFlexy program based on B-factor (relative vibrational motion) and root mean square fluctuations (RMSFs). As shown in Table 6 and Figure 2, PredyFlexy analysis showed that residues V160, G181, and P335 shared moderately and highly flexibility scores (predicted flexibility between 0 and 0.5) with a confidence index of 7-11, while the residues R240 and D435 were identified as rigid with low index scores. Then, G432 is predicted as flexible but the low confidence score (CI = 2) makes the result not reliable.
To determine protein secondary structure, disorder regions, and solvent accessibility of TMPRSS2 protein, the RaptorX property was used. As exposed in Figure 3(c), 88 (17%) positions were predicted as disordered by RaptorX property; then, eight secondary structure types were identified in the TMPRSS2 protein, such as α helix, 3-helix, 5helix (ℼ helix), extended strand in β ladder, isolated β bridge, hydrogen-bonded turn, bend, and coil. Results of solvent accessibility of TMPRSS2 protein were 27% intermedia, 46% exposed residues, and 25% buried residues (Figure 3(c)).

Secondary Structure.
To validate the solvent accessibility and protein secondary structure, we applied the Predict-Protein tool. The most types of secondary structure of the TMPRSS2 protein are the helix, buried, exposed, and disordered regions. Then, three types of protein secondary structure were identified in the TMPRSS2 protein, which was helix 2.64% (H; includes α, Pi-, and 3_10-helix), β-strand 23.37% (E; extended strand in the β-sheet conformation of at least two residues length), and loop (L) 73.98%. Figures 3(a) and 3(b), display the PredictProtein analysis of the TMPRSS2 protein (46.14% buried residues and 53.86% exposed residues).
3.2.6. Modeling. The full three-dimensional structure of human TMPRSS2 protein was not available in the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB-PDB, http://rcsb.org). For that, the SwissModeller group has modeled the TMPRSS2 structure with a resolution of 1.95 Å and sequence identity equal to 98.69% was used for further analysis. The selected structures wild type and mutants were refined using ModRefiner and were validated using PROCHECK and Verify3D (Table 7). Ramachandran plot of the native protein identified 258 residues (86.9%) in favored regions, 38 residues (12.8%) in allowed regions (additional and generously allowed regions), and one residue (0.3%) in disallowed regions (Figure 3(d)). Furthermore, Verify3D analysis of the native and mutant proteins revealed that 95.38% (native) of the residues had an average 3D-1D score ≥ of 0:2, while the models (V160M, G181R, R240C, P335, G432, and D435) showed that  Besides, structural similarities between the wild-type and mutant structures were performed using TM-align tool based on TM-score to assess the topological similarity of two proteins and the RMSD (Root Mean Square Deviation) to measure the distance between the backbones of the superimposed protein structures. The RMSD values for all missense variants were significant (RMSD > 0:15), suggesting dissimilarity between wild-type and mutant models ( Figure 4, Table 8).
According to the COACH server, D435 was predicted as a binding residue. The detailed results of COACH are shown in Table 10.
3.2.8. Protein Display. The topology prediction was shown by the Protter server; the figure illustrates a long cytoplasmic N-terminus and suggests that the TMPRSS2 protein was located mostly at the extracellular part of the cell membrane. Then, the five amino acids (orange color) represent the predicted variants such as V160, S254, E331, K451, and D491 ( Figure 5).

Dynamic Cross-Correlation Matrix Analysis Using
Bio3d Package by RStudio Software and DynOmics Server. DCCM was done to comprehend the correlated communications between residues. The result showed that as compared with the wild type, the V160M, G181R, R240C, P335L, G432A, and D435Y variants decreased the degree of positive (red color) and negative (blue color) correlations observed in the TMPRSS2 native, despite the fact that no significant correlation in the movement of residues has been remarked in the Dynamic Cross-Correlation Matrix analysis ( Figure 6, Table 11).

Discussion
The transmembrane serine protease 2 (TMPRSS2) plays a crucial role in human cell entry of a diverse range of viruses including SARS-CoV-2 [2]. Strikingly, a recent investigation by Hou et al. found six deleterious variants such as p. Val 160Met, p. Gly181Arg, p. Arg240Cys, p. Pro335Leu, p. Gly432Ala, and p. Arg435Tyr in the TMPRSS2 gene, which are demonstrated as somatic mutations in different cancer databases and also suggest explanations for genetic susceptibility to COVID-19 [11]. This analysis reported that TMPRSS2 variants were probably associated with susceptibility to SARS-CoV-2 [11]. So, in this report, we look for these six missense variants (V160M, G181R, R240C, P335L, G432A, and D435Y) which previously might be important risk factors associated with COVID-19 suscepti-bility. The current study might also be helpful to understand the effect of those variants on TMPRSS2 structure, function, and stability. A series of in silico prediction analyses were used for the functional and structural annotations of human TMPRSS2 missense variants like SIFT, PolyPhen2.0, PRO-VEAN, SNAP2, PMut, MutPred2, I-Mutant Suite, MUpro, iStable, CUPSAT, and STRUM, respectively, which were utilized to find out the most deleterious variants of TMPRSS2 and to evaluate their effects on TMPRSS2 function, structure, and stability.
From our functional analysis of human TMPRSS2 missense variants, SIFT predicted five of total variants are deleterious; these five variations were predicted deleterious by PROVEAN (except for V160M), SNAP2, and PMut (except for R240C). Protein stability is essential for understanding the relationship between protein structure and function [46]. A total of six variants tested were identified decreasing the stability of TMPRSS2 by all algorithms for V160M and G181R and by at least three tools for the rest (R240C, P335L, G432A, and D435Y), by analyzing all missense variants through different servers. The six nsSNPs are potentially damaging. ConSurf analysis results showed that variants at positions G181R, G432A, and D435Y were in the highly conserved region and confirmed by MutPred2 to have crucial alterations on the TMPRSS2 protein. The prediction of posttranslational modification sites (PTMs) is one of the important characteristics for understanding different biological processes such as the cell signalling state, localization, and interactions. It can also be essential for the study of diseases or for development of drugs [47]. Therefore, the R240, P335, G432, and D435 residues identified PTMs for proteolytic cleavage and ADP-ribosylation. Flexibility is one of the most essential criteria related to protein functions. Herein, we used FlexPred and PredyFlexy to determine conformational changes and to comprehend dynamic system of TMPRSS2. Variants R240C and D435Y were predicted to be in a relatively rigid region, while G432A was defined as a flexible area. We have also investigated the secondary structure of native and mutants by identifying disordered regions in TMPRSS2 using PredictProtein and RaptorX property. Compared to the native structure of TMPRSS2 protein, 5 disordered regions were formed due to V160M and P335L variants, since this can change the function of TMPRSS2 because disordered regions are dynamically flexible. Prediction of three-dimensional structures of TMPRSS2 models is necessary for the validation of structural changes. Therefore, the three-dimensional structure of the TMPRSS2 native and mutants was generated using 7 meq as a template from the SwissModeller group and refined by using ModRefiner. Quality checking of Swiss-Model constructed models was done by using PROCHECK and Verify3D. Ramachandran plot analysis showed that all models of TMPRSS2 (wild-type and mutants) were of good quality and can be used for further study; then, quantitative assessment was done by using the TM-align tool for comparing native and mutant proteins by calculating RMSD values and TM-score. All RMSD values were significant (RMSD > 0:15). The highest RMSD value 0.58 was scored by the variant R240C, while the lowest 0.44 was scored by 14 BioMed Research International the variant G432A. Besides, we used the RaptorX binding site and COACH servers for finding sites of further variants. However, two residues D435 and P335 were identified to be implicated in ligand binding site interactions of ligands with the TMPRSS2 protein. Consequently, our results give the clue that V160M, G181R, R240C, P335L, G432A, and D435Y can be the most significant variants in the human TMPRSS2 gene and may influence stability, structure, function, and interaction of ligands with the TMPRSS2 protein.
To date, various in silico analyses have been made using different bioinformatics tools to identify and predict TMPRSS2 gene host polymorphism against SARS-CoV-2. As our results show, a study by [48] has shown that the TMPRSS2 p. Val160Met polymorphism was associated with SARS-CoV-2 infectivity. A recent investigation by Asselta et al. reported the existence of some TMPRSS2 polymorphisms, namely, rs2070788, rs9974589, and rs7364083. These variants showed a significant association between these SNPs and the SARS-CoV-2 infectivity [40]. Another study by Irham et al. (2020) demonstrated that some variants of TMPRSS2, namely, rs2070788, rs383510, rs464397, and rs469390, might affect the expression of TMPRSS2 in some many tissues and consequently were probably associated with SARS-CoV-2 infectivity [51].
Overall, this in silico analysis gives an interesting insight into the role of the TMPRSS2 variants in susceptibility to SARS-CoV-2 infection. The analysis consortium would also involve researchers and scientists in the future to confirm the selected mutations (V160M, G181R, R240C, P335L, G432A, and D435Y) as candidate variants. In the future, it should be noted that further in silico analysis and laboratory experiments must be combined for more justifying such important results.

Conclusion
Overall, we conclude that rs12329760 (V160M), rs781089181 (G181R), rs762108701 (R240C), rs1185182900 (P335L), rs570454392 (G432A), and rs867186402 (D435Y) are the most significant variants. All six nsSNPs were predicted to alter protein function and stability. Most of them are highly conserved (V160M, G181R, G432A, and D435Y) and comprise posttranslational modification sites (PTMs) (R240C, P335L, G432A, and D435Y). D435 was identified as a ligand-binding site that may interfere in the binding interactions of the TMPRSS2 protein. In this in silico analysis, for the first time, we tested the effect of those missense variants on TMPRSS2 structure, stability, and function by using various bioinformatics algorithms that may serve an important role in SARS-CoV-2 infection.

Data Availability
All data used in this paper are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest.

Authors' Contributions
Lahcen Wakrim and Anass Kettani contributed equally to this work.