Development of a Novel Bioinformatics Tool for In Silico Validation of Protein Interactions

Protein interactions are crucial in most biological processes. Several in silico methods have been recently developed to predict them. This paper describes a bioinformatics method that combines sequence similarity and structural information to support experimental studies on protein interactions. Given a target protein, the approach selects the most likely interactors among the candidates revealed by experimental techniques, but not yet in vivo validated. The sequence and the structural information of the in vivo confirmed proteins and complexes are exploited to evaluate the candidate interactors. Finally, a score is calculated to suggest the most likely interactors of the target protein. As an example, we searched for GRB2 interactors. We ranked a set of 46 candidate interactors by the presented method. These candidates were then reduced to 21, through a score threshold chosen by means of a cross-validation strategy. Among them, the isoform 1 of MAPK14 was in silico confirmed as a GRB2 interactor. Finally, given a set of already confirmed interactors of GRB2, the accuracy and the precision of the approach were 75% and 86%, respectively. In conclusion, the proposed method can be conveniently exploited to select the proteins to be experimentally investigated within a set of potential interactors.


Introduction
Proteins rarely perform their biological functions independently, since they usually interact with each other. In fact, most of the biological activities require the creation of protein complexes. As a consequence, the different levels of complexity of the biological systems are not exclusively determined by the number of proteins of an organism, but also by the number of their interactions. Many experimental methods have been developed to study protein interactions, such as the two hybrid system in yeast, the affinity purification followed by mass spectrometry and the phage display libraries [1][2][3][4]. The use of these techniques led to the creation of many databases containing a great number of protein-protein interactions, such as the Database of Interacting Proteins (DIPs), the General Repository for Interaction Datasets (BioGRIDs), the Human Protein Reference Database (HPRD), and the Biomolecular Interaction Network Database (BIND) [5][6][7][8]. Because of the high amount of false positives and false negatives resulting from the application of these experimental approaches on a large scale, such data repositories must be cautiously used. Once a list of candidates is obtained, it is necessary to analyze in vivo every possible interactor by expensive, timeconsuming, and labour-intensive experimental techniques in order to validate the in vitro experimental result. For this reason, in silico methods for the prediction of protein interactions are considered valid tools to reduce the number of candidates [9].
Two main computational approaches, based on sequence similarity and structural modelling, have been applied for the prediction of protein interactions. The former is based on the selection of potential interacting partners on the basis of the sequence similarity with confirmed interactors [10][11][12][13]. This approach is based on the "homology modelling" principle: similar protein sequences, sharing similar structures, should also share similar interactants [14]. Among sequencesimilarity based methods, one of the most widely used is the "mirror tree" approach, based on the assumption that interacting protein pairs are likely to evolve in a correlated fashion [15,16]. The latter set of approaches is based on the properties of the three-dimensional structures of the proteins, generally referred to as docking methods; these methods exploit surface complementarity and electrostatics properties to predict reliable structural complexes [17][18][19][20][21][22][23][24][25].
Although both these approaches are based on strong theoretical and experimental data, they exhibit some limitations. The first approach might fail to predict protein interactions, since a protein complex might be subjected to a different selection pressure than each single constituting protein during evolution [26,27]. For this reason, all the structural details of a protein interaction become important to determine the affinity and the specificity of protein interactions. Docking methods, in fact, analyze the interactions at a three-dimensional level and are therefore considered to be more accurate, also evaluating the biophysical parameters of the interaction sites [28]. However, docking methods are generally limited by the lack of the structures for the majority of the proteins and by incomplete bio-physical interaction knowledge [29,30]. For these reasons, at present, an integration of the two approaches is essential to better predict putative interacting proteins [31][32][33].
The aim of this work is to design a knowledge-based tool that can integrate the two approaches described above. In particular, considering a target protein, we first select a set of the candidate interactors already obtained from other experimental results, but not yet in vivo validated. Then, we exploit the sequence and the structural information on in vivo confirmed interacting proteins and complexes, to finally select the most reliable partners of the target protein.

Materials and Methods
The proposed bioinformatics approach is summarized in Figure 1. The algorithm predicts the protein interactors of  The approach follows five steps:  [34]. The score of the best alignment of each I ∈ PINT is the first element of the score function (Score 1 (I)). CINT aminoacid conservation with known protein structures is evaluated using the ConSurf tool [35]. PINT sequences are then aligned with the conserved regions of CINT members, using the Needleman and Wunsch algorithm, thus computing the second component of the score (Score 2 (I)).

Extraction of the Interaction Motifs from the Protein
Complexes. The aminoacids at the interface binding sites belonging to chains interacting with TP are retrieved on the basis of the information on known protein complexes. These interacting aminoacids are exploited to build a set of interacting motifs, which are then searched within the PINT members for the calculation of the third and fourth components of the score function (Score 3 (I) and Score 4 (I)). These steps of the methods are described in details in the following sections.

Finding the Interacting Aminoacids by Looking at the Intermolecular Distances.
Using the Cartesian coordinates of the complexes involving TP, reported in PDB files, we identify putative TP interacting proteins. For every interactor, we find the interacting aminoacids between TP and the interactor chains: these residues are defined as "centres of bond". In order to reduce the computational burden involved in the identification of the interacting residues, we primarily select the aminoacids at the protein interfaces. In detail, we first look for the amino acids that were less than 15 Å far from one residue of the target chain. Then, for each interactor, we consider as interface residues also those closer than 10 Å to the aminoacids already found. In this step, the distance between two aminoacids is calculated as the distance between the α carbon atoms. Once the interfaces are defined, we search for the interacting aminoacids, deriving from disulfur bridges, hydrogen bonds and salt bridges electrostatic interactions. The following aspects are taken into consideration: (i) A cysteine sulfur bridge satisfies the following geometrical constraints [36]: the two sulfur atoms (SG, according to PDB nomenclature) must be 2-2.1 Å far and the distance between the β carbon of a cysteine and the sulfur atom of the other cysteine is set to 3-3.1 Å . (ii) For hydrogen bonds, the distances between the acceptor and donor are computed. As shown in Figure 2, the distance (d) between a donor (D) and an acceptor (A) should be less than 3.5 Å and the angle σ should be less than π/2. Because the majority of Table 1: List of the atoms considered as acceptors and donors. For both classes, the 3-letter codes of the amino acids, the symbol used in the PDB files of the considered atom, and the maximum number of hydrogen bonds of atom are reported.
the PDB files do not contain the coordinates of the hydrogen atoms, we do not consider the other three geometrical features of the bond (r < 2.5 Å , β > π/2 and θ > π/2) [37,38]. Moreover, we define the maximum number of hydrogen bonds that an atom can form (as showed in Table 1), following the results of a statistical analysis reported by McDonald and Thornton [39]. If a residue overtook the maximum number of bonds, we considered only those with the lowest distances between acceptor and donor.
(iii) Salt bridges are electrostatic interactions between residues with opposite charges. The negatively charged atoms at physiological pH are OD1 and OD2 of asparagine, and OD1 and OE2 of glutamic acid. The positively charged ones are NH1, NH2 of arginine and NZ of lysine. For a salt bridge between two residues with opposite charges, the distance between the charged atoms is set to be less than 3.5 Å [36].

Enlargement of the Center of Bond.
Because of the folding of the primary structure, two residues that are neighbours on the surface of the three-dimensional structure can be far apart in the protein sequence. This explains why the interacting amino acids are often spread in the protein linear sequence, so that we can find them completely isolated. Moreover, although the interacting amino acids are the most important components of the interaction, also the neighbouring residues may effectively contribute. For this reason, we enlarge the "centre of bond" considering the neighbouring residues with the same hydropathy of the center of bond and then adding the so called proximal amino acids. Hydropathy and proximal amino acids were computed as follows. show the amino acid one-letter code, the residues coordinates, the status of proximity with respect to the pattern chain in the complex, the status of hydropathy (1 hydrophobic, 0 hydrophilic) and the secondary structure (H = alpha chain; L = loop) of every amino acid around a center of bond. The center of bond is represented by the amino acids within the bold lines (i.e., S and N); the greyhighlighted rows are the results of the symmetrical enlargement due to hydropathy, while the amino acid reported in italic (V) are grouped because of its proximity to the opposite chain.
Amino acid Position Proximity Hydropathy Sec.Struct. (i) We calculate the hydrophobic and the hydrophilic regions of the proteins, using the hydropathy scale of Kyte-Doolittle [40]. In particular, we set to 1 every amino acid with a positive value in the scale (hydrophobic residues) and to 0 those having negative values (hydrophilic residues).
(ii) Then, in addition to directly interacting residues, we also consider those that do not interact, but are closer than 4.2 Å to the corresponding ones of the TP chain, defined as "proximal aminoacids". To calculate the distances, we consider the most external atoms of the backbone of each amino acid. Table 2 reported an example of the centre of bond enlargement process.
(1) If the center of bond (i.e., SN) is in a hydrophilic region, we perform a symmetrical enlargement, until a hydrophobic amino acid is found (i.e., PK and RK).
(2) Then the proximal amino acids adjacent to the result of the enlargement (i.e., PKSNRK) is added (i.e., V at position 48) As a result, it is possible to group two or more adjacent centers of bond. We define a set of one or more grouped centers of bond as a "binding site." The set of binding sites belonging to an interface between two chains is denoted as "interacting site".

Building the Interacting Motifs.
For each binding site, we extract "interacting motif " through the analysis of secondary and tertiary structures of the proteins; these motifs are then searched within the PINT sequences.
To analyse the sequence motifs, we divide the aminoacids into six classes looking at their hydropathy and charge, as Table 3: Amino acids classes considered with respect to their hydropathy and charge. Every row shows the identification class and the amino acids components.

Class
Amino Acids  I  ILE-VAL-LEU  II  PHE-CYS-MET-ALA  III  GLY-THR-SER-TRP-TYR-PRO  IV  HIS-GLN-ASN  V  GLU-ASP  VI LYS-ARG shown in Table 3. Every amino acid belonging to the center of bond is assumed to be invariant in the motif, while the others residues (except proline) belonging to the binding site are variable within their class, as defined in Table 3. In case of proline, its structure do not allow the movement of the bond between the α carbon and the nitrogen of the backbone, blocking the rotation of the chain; as a consequence, the substitution of the proline with any other aminoacid could influence the stability of the interaction. The secondary structures of the interactors, as extracted from the PDB files of the protein complexes, are also considered to compute of the structural motifs. The obtained motifs of the example sequence (Table 2) is therefore where loops (L) and alpha chains (H) are reported. Finally, we assign a score to every motif according to the following rules: every unchanged residue is scored 20, while every variable amino acid is scored as the ratio between 20 and the number of allowed variations.
Then, the final score of the motif (motif score, ScoreM) is computed by multiplying the single scores of the aminoacids. For example, the score of the sequence motif

Searching for the Interacting Motifs in the Potential
Interactors: Score 3 and Score 4 . We search for both the sequence and structural motifs in all the predicted interactors. Because the secondary structure is unknown for the majority of the potential interactors, three of the most used tools for secondary structure prediction, that is, PREDATOR, NNPREDICT and NPS [41][42][43], are employed. All these algorithms assign to every amino acid a secondary structure as loop (L), alpha chain (H) and beta sheet (E). Therefore, we compute a score for every interactor I of the list PINT as Score 3 (I) = max ScoreM I length sequence(I) , where ScoreM I is the list of motif scores of the interactor I. Finally, assuming that a single motif is not sufficient to determine an interaction, it is also important to take into account how many complete interaction sites (sets of binding sites belonging to an interface between two chains) are present. For this reason, we define another score related to the subset of the motifs of the interactor I belonging to an interaction site Score 4 (I) = Score MI length sequence(I) , where Score M I is the list of motif scores of the interactor I belonging to an interaction site that includes all the motifs.

Final Score Calculation.
Once the four scores are computed, we calculate a normalised final score, which expresses a measure of the likelihood that a potential interactor is a real interactor of the target protein.

Results
The proposed procedure was tested by using the growth factor receptor-bound protein 2 (GRB2) as the target protein.
Moreover, a validation approach was applied (seeSection 3.6) to evaluate the accuracy and precision of the procedure and to choose a suitable score threshold to predict new interactors.

Database Search of GRB2 Interactors.
We initially retrieved 21 known structures of protein complexes containing GRB2 main domains (i.e., SH2 and SH3). The list of potential interactors (PINT) was then obtained by retrieving from HPRD database 46 in vitro discovered interactors. We also retrieved from the same database 141 interactors in vivo validated, which together with the sequences of the proteins extracted from the 21 GRB2 complexes, formed a list of 247 confirmed interactors (CINT).

Alignment with the Confirmed Interactors
: Score 1 and Score 2 . Each PINT member was aligned with CINT counterparts, evaluating Score 1 for the best alignments. Next, every potential interactor was aligned with the conserved regions of the confirmed ones, computed with ConSurf, thus obtaining the values of Score 2 (Table 4).

Extraction of the Interaction Motifs from the Protein
Complexes. We extracted binding sites for every interface between GRB2 and the different chains of each of the 21 retrieved complexes. We found 190 different interaction motifs, with scores ranging from a minimum of 20 (single aminoacid motifs) to a maximum of 1.58 ×

Searching for the Interacting Motifs in the Potential
Interactors: Score 3 and Score 4 . We searched for the interacting motifs in the PINT list and we selected, for each potential Table 4: Overview of the scores used to rank the putative GRB2 interactors. NCBI protein accession number, Score 1 , Score 2 , the sequence and the structure configuration of the best motif, Score 3 and Score 4 , were reported. The last column showed the final score assigned to each interactor. interactor, the one with the highest score. Table 4 highlighted the sequence, the structural configuration and the score (Score 3 ) of the motif with the highest ranking. By grouping the motifs belonging to the same interface between two chains, we found 76 interaction sites. Table 4 showed the number and the sum of the calculated scores only for the motifs belonging to an interaction site (Score 4 ).

3.5.
Final Score Calculation. The final score of each of the 46 possible GRB2 interactors was computed as the normalized sum of all the scores, as illustrated in Table 4.

Validation of the Proposed Method.
To test the methodology, a dataset made of 10 GRB2 true (confirmed) and 10 false (not confirmed, randomly chosen proteins) interactors was employed; the final scores were computed for each simulated interaction. To perform an unbiased comparison, we removed the positive interactors from the CINT sequences. Then, we applied a "Leave-one-out" crossvalidation procedure by considering each time a different protein as the unique "test" case and the remaining as "training set". We then applied a simple classifier, obtained by computing the best threshold (Th) on the scores of the training set to maximize the Information Gain (IG): p(cT Score)log 2 (cT Score), where c was the class of the protein (confirmed interactors/not interacting proteins) and TScore a binary variable such that As a result, the mean accuracy of the Leave-one-out procedure was 75%, and the mean precision was 86%. Finally, we calculated the median value of the thresholds obtained in the leave-one-out process (i.e., 1.63); this value was then used to select 21 proteins with the highest probability to be GRB2 interactors: among these, the isoform 1 of the mitogen-activated protein kinase 14 (MAPK14) had the highest probability score.

Discussion
We have developed a novel algorithm for prediction of protein-protein interactions that combines structure similarity and sequence conservation of protein complex interfaces.
The performance of the algorithm was tested on the ability to predict GRB2-interacting proteins. GRB2 is a small adapting protein composed of a SH2 and two SH3 domains. This protein plays a very important role in the process of signal transduction, as a mediator between the growth factors receptors at the cellular membrane level and the cytosolic RAS proteins. In particular, the mitosis promoting signal, stimulated by the epidermal growth factor (EGF), requires the tyrosine kinase activity, originated by specific trans membrane receptors (EGRFs). Starting from these receptors, the activation of RAS consists in a cascade of protein interactions that involves the GRB2/SOS-1 complex. Different sites of auto phosphorylation in the C-terminal region of EGFRs are binding sites for the SH2 domain of GRB2, while its SH3 domains mediate the recruitment of the exchange factor SOS-l, inducing the subsequent activation of RAS proteins. Then, RAS lead to a cascade that ends with the nuclear translocation of phosphorylated MAP kinase, which then activates transcription factors [44].
After our bioinformatics prediction, we ranked a set of 46 potential GRB2 interactors according to their scores, which we assumed to be putative interactors with the target protein. Resorting to a score threshold chosen by means of a cross-validation strategy, we further screened 21 of the most probable interactors. Among these, MAPK14 had the highest probability score of interaction with GRB2. MAPKs are a group of serine/threonine kinases, activated in response to many extracellular stimuli and mediating different signal transduction pathways. Four different MAPKs families were identified in mammalian cells: extracellular signalregulated kinase (ERK), c-Jun N-terminal kinase/stressactivated protein kinase (JNK/SAPK), ERK5/big MAP kinase 1 (BMK1) and MAPK p38. In particular, MAPK p38 proteins are involved in growth regulation, cellular differentiation, apoptosis, cellular response to inflammation and stress [45]. This subfamily is composed of four members (α, β, γ and δ) and MAPK14 is the isoform α, that together with β, is ubiquitary expressed [46]. To further confirm our in silico prediction, the MAPK14-GRB2 interaction was previously in vitro observed using the GST pull-down technique [47]. In particular, this study hypothesized that in platelets p38α bind to the SH2 domain of GRB2 in response to stimulation mediated by activation of FcγRIIA (CD32) receptor. This association could act by carrying the cytosolic GRB2, with its complexed proteins, towards specific subcellular topologies, driving the complex to specific substrates [47].
At a structural level, the best motif found for MAPK14-GRB2 interaction was PPP[IVL]; this motif was also identified as an interacting site, because in the in silico extracted complex (1GBQ), it allowed the SH3 domain to bind to SOS-1 protein.
However, global alignments between GRB2 and MAPK14 protein sequences did not directly reveal a contribution of the above mentioned motif: differently, three motifs, extracted from complexes in which the SH2 domain of GRB2 was involved, were aligned (i.e., PF, [ED]N[IVL] and [IVL]K). These motifs were localized in Pro58-Phe59, Glu81-Asn82-Val83, and Leu151-Lys152 in MAPK14 protein tertiary structure; as reported in Figure 3, a similar structural topology of the residues Pro56-Phe57, Glu79-Asn80-Ile81, and Leu148-Lys149 was highlighted for ERK2, a tyrosine kinase which has been in vivo confirmed to interact with GRB2 [48]. For this reason, it was possible to predict that MAPK14, as well as ERK2, can interact with the SH2 domain of GRB2, probably through the above mentioned amino acids.
In summary, the method herein proposed is a first step to the definition of a bioinformatics tool to support experimental studies on protein interactions. According to the validation procedure performed, the accuracy and precision of this method were 75% and 86%, respectively. These results might suggest that the proposed bioinformatics approach can be effectively applied to preliminary screen a wide set of protein interactants, such as those deriving from two hybrids systems, to select those to be primarily investigated.
Currently, the main limitations of our method are the small number of complexes with known structures and the relatively poor knowledge on confirmed interactors.
To overcome these limits, we are working on future refinements of the method, in particular on exploiting the available bioinformatics and database knowledge to define different levels of prediction.