Structure Topology Prediction of Discriminative Sequence Motifs in Membrane Proteins with Domains of Unknown Functions

Motivation. Membrane proteins play essential roles in cellular processes of organisms. Photosynthesis, transport of ions and small molecules, signal transduction, and light harvesting are examples of processes which are realised by membrane proteins and contribute to a cell’s specificity and functionality. The analysis of membrane proteins has shown to be an important part in the understanding of complex biological processes. Genome-wide investigations of membrane proteins have revealed a large number of short, distinct sequence motifs. Results. The in silico analysis of 32 membrane protein families with domains of unknown functions discussed in this study led to a novel approach which describes the separation of motifs by residue-specific distributions. Based on these distributions, the topology structure of the majority of motifs in hypothesised membrane proteins with unknown topology can be predicted. Conclusion. We hypothesise that short sequence motifs can be separated into structure-forming motifs on the one hand, as such motifs show high prediction accuracy in all investigated protein families. This points to their general importance in α-helical membrane protein structure formation and interaction mediation. On the other hand, motifs which show high prediction accuracies only in certain families can be classified as functionally important and relevant for family-specific functional characteristics.


Introduction
Membrane proteins are essential for many fundamental biological processes within organisms.Active nutrient transport, signal and energy transduction, and ion flow are only a few of the numerous functions enabled by membrane proteins [1].Membrane proteins obtain their specific functionality by individual folding and interactions with the hydrophobic membrane environment as well as, in many cases, by oligomeric complex formation and protein-protein interactions [1,2].The identification of such complexes and interactions is valuable, since, on the one hand, detailed information of the function of an unknown membrane protein can be obtained by analysing its interactions with proteins of known function.On the other hand, biological processes can be comprehended as a dynamically fluctuating system, whereby the biological role of the unknown membrane protein can be defined more precisely [3,4].Accordingly, destabilisation of the three-dimensional structure of a membrane protein caused by mutations or ligand interactions are triggers for numerous diseases, for example, diabetes insipidus, cystic fibrosis, hereditary deafness and retinitis pigmentosa [5][6][7].
Although 20%-30% of all open reading frames of a typical genome are encoding membrane proteins [5,8,9] and 60% of all drug targets are membrane proteins [2], membrane proteomics is still an experimentally challenging field due to poor protein solubility, wide intracellular concentration range, and thus, inaccessibility to many proteomics methodologies [10].Hence, the number of known three-dimensional structures is relatively small, with 394 nonredundant membrane protein chains currently available [11][12][13].Therefore, there is a necessity for approaches that allow to predict structural and functional features of unknown membrane proteins.A variety of methods have been developed to predict structural features from sequence, such as -helical membrane-spanning helices and extra/intracellular domains (i.e., TMHMM [14], PHDhtm [15], MEMSAT3 [16]) as well as membrane-spanning beta-strands of transmembrane barrel proteins (i.e., BOCTOPUS [17]).Furthermore, in genome-wide membrane protein sequence analyses, numerous short conserved sequence motifs were identified [18].As an example, the most widely discussed GxxxG motif has been shown to be significantly present in transmembrane helices.With both glycines resting on one side of the helix as spatially neighbouring residues and by that forming a smooth helix membrane surface, structural studies confirmed that the GxxxG motif plays an important part in mediating helix-helix interactions [18][19][20][21][22].In general, short conserved membrane protein motifs are considered to be significantly relevant for membrane protein folding and structural stability as well as being involved in defining a protein's function.Hence, sequence motif analyses and resulting insights can support the understanding of protein dynamics.Information can be derived which may contribute to study the dynamics of mutant proteins and the effects of mutagens [23][24][25].Additionally, as addressed in [26], the analysis of sequence motifs in proteins with similar function or structure might help to identify essential functional sites and locations which contribute to structural stability.
In this work, we focused on previous studies and results that have been reported by Liu and colleagues [18].In the process, various integral membrane protein families with polytopic membrane domains had been obtained from Pfam database [27].As part of their studies, locations of the least conserved residues (glycine, proline, and tyrosine) in helical transmembrane regions had been investigated.As a result, short motifs consisting of pairs of small residues (glycine, alanine, and serine) surrounding single or multiple variable positions had been identified in conserved sequences and Pfam-classified families.Based on these results, we have developed a prediction approach to allocate the topological state of a sequence motif in the protein structure based on sequence information.We have used cross-validation to verify the prediction accuracy.However, prediction accuracy has been found to be variable for certain motifs with regard to the investigated protein families.According to this, we hypothesise that short sequence motifs can be separated into structure-forming motifs on the one hand, as such motifs show high prediction accuracy in all investigated protein families.This points to their general importance in -helical membrane protein structure formation and interaction mediation.On the other hand, motifs which show high prediction accuracies only in certain families can be classified as functionally important and relevant for familyspecific functional characteristics.

Used Membrane Protein Families.
As the first step of our analysis, 32 membrane protein families with domains of unknown functions (DUF) were obtained from the Pfam database [27] using extended keyword searching.All 7051 sequences were retrieved for statistical analysis.The full list of employed membrane protein families is given in Table 1.
Table 1: Thirty-two membrane protein families were derived from Pfam database [28] and employed for statistical analysis.

Accession
Subsequently, 50 sequence motifs, identified by Liu and colleagues [18], were localised in the obtained set of families.

Programs and Tools.
To avoid generating misguiding statistics by including identical or highly similar sequences, a set of nonredundant sequences was generated.Here, we defined the sequence redundancy threshold at 25% sequence identity.In the first step of sequence processing, CD-HIT [29] was applied for first clustering.However, CD-HIT accepts only nonredundancy thresholds of >40%.This limitation is caused by the internal word-length filtering approach and statistical presets.Hence, to ensure clustering sensitivity, a 60% nonredundancy threshold, which corresponds to tetrapeptide word filtering used by the program, was applied.In the second step, sequence clustering using the 25% redundancy threshold was obtained by means of utilising BLAST-Clust [30].The representative sequences of all clusters were extracted, leading to a set of 2511 nonredundant sequences.
Intuitively, the reported short sequence motifs can be written in a generalised, regular expression-like form of XYn, where X and Y correspond to amino acids separated by  − 1 highly variable positions.However, in the process of analysis we found that short motifs with a relatively small number of variable positions (more precisely, if  is found to be <3) do not contain enough information to be investigated by our approach.Thus, these motifs have been discarded in the process, which resulted in a final set of 33 sequence motifs.In our nonredundant sequence set, almost 250,000 single motif occurrences were identified.As an example of motifs located in a membrane protein structure, Figure 1 illustrates seven motifs which can be found in the structure of the bacteriorhodopsin (PDB-Id: 1brr).

Information Extraction and Clustering.
In this work, a novel approach is elucidated which predicts the topology state of a short sequence motif in membrane proteins.The following steps were completed to realise this approach.
At first, all single motif occurrences were identified in the nonredundant sequence set.Including TMHMM predictions, each motif occurrence was assigned to a topology state as elucidated in Section 2.2.Additional to the defined topology states "TM" and "nTM, " a further state has been defined for this study.Each motif, where the beginning and the end has been located in the different topology states "TM" and "nTM, " has been assigned with the "trans" state.Subsequently, all variable positions within each motif occurrence were examined more closely.Ultimately for each variable position, the relative occurrence of each amino acid at the specified position of each motif was calculated.To define a separation rule for the investigated motifs, an information-based approach was applied.Formally, a motif , for instance LG5, can be interpreted as a set of variable strings with a length of .Intuitively, in case of LG5  equals 4. To include the membership information of the three topology states, we separated  into three motif subsets  TM ,  nTM and  trans according to the topology state  in which each single motif occurrence  ∈  is located.Furthermore, in each motif   each position pos  with  ∈ [1, ] can be investigated concerning its amino acid distribution.To this end, interpreting   as a set of strings  1 ,  2 , . . .,   (all identified motif occurrences found in topology state ) allows formulating the relative probability ( | pos  |   ): with where  corresponds to one of the 20 canonical amino acids.
To weight the significance of each probability ( | pos  |   ), the probability ( | Nature) is applied in a log-odd formula: The amino acid distribution ( | Nature) used to test the significance of the observed relative probability at each Structural Biology motif position was computed from the NCBI nonredundant protein sequence set [32] (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz).
Using these log-odd values, visualisation, clustering, and information extraction can be performed.To this end, we transformed each position pos  into a vector consisting of logodd values which we refer to as log-odd profile LOP (pos  |   ) and which is defined as Clustering all resulting LOP (pos  |   ) was finally ensured by implementing the following distance  formula: where  (LOP (pos  |   ), LOP (pos  |   )) corresponds to the Spearman's rank correlation coefficient.Clustering methods were applied to the LOPs to derive characteristics in motifs which determine the protein's structural and functional features.Furthermore, with these values at hand, the algorithm for predicting the topology state  based on a single motif occurrence  was implemented.At this, the precalculated LOPs of the corresponding motif  are employed as lookup values to compute a straight-forward winner-takes-it-all formula: The assessment of topology state prediction was performed by means of cross-validating and F-measure calculation.By utilising clustering methods, differences and similarities of all LOPs can be visualised and analysed in detail.
For dimensionality reduction and finally data clustering of the 20-dimensional LOP data, we used the unweighted pair group method with arithmetic mean (UPGMA) [33] and the exploratory observation machine (XOM) [34].This analysis is helpful to understand the correspondences of physicochemical properties observed in LOPs and topology states.Furthermore, this analysis enforces the found predictability of topology states.We chose the UPGMA as visualisation approach, since it is a widely used bottom-up clustering method that can be understood intuitively.
The XOM algorithm is relatively new for dimensionality reduction.A great advantage lies in its visualisation capabilities, since it can transform neighbourhood or distance relations embedded in multidimensional data into humanintelligible spaces, such as into R 2 .In the literature, this property is referred to as topology-preserving mapping.However, the degree of topology-preserving mapping achieved by the XOM depends on the given problem (mainly influenced by the structure of data and applied distance measure), and thus the XOM output can be insufficient for analysis.In application to LOP data, however, it has shown to perform more than satisfying.Further, visualisations were obtained by generating heat maps.

Results and Discussion
3.1.Identification of Topology-Discriminative Positions.The identification of topology-discriminative positions in motifs is crucial for drawing meaningful correlations between physicochemical properties plus structural and functional features.A straight-forward approach to address this task is the utilisation of a method to determine the residue conservation at each variable motif position.WebLogo [31], for instance, is a widely used method to address such problems.However, WebLogo does not include any aminoacid-specific background information in deriving residue conservation, since natural amino acid frequencies are not taken into account.To circumvent this problem, we used LOPs for visualisation instead, which, as shown in ( 4), include natural amino-acid-specific background probabilities.Essentially, this approach is quite similar to the methods recently described in [36].Single LOPs can be visualised as heat maps [37] (see Figure 3), and amino-acid-specific propensities at each variable position in each motif can be extracted and thus information can be gained.

LOP Visualisation and Classification.
The LOP heat map depicted in Figure 3 exemplary shows the apparent aminoacid-specific propensities according to the three topology states.Here, increasing amino acid propensities defined in (3) are illustrated by increasing red colour content.In comparison to the WebLogos (Figure 2), distinct amino acid propensities become obvious.For instance, glycine is observed more frequently in all LG5 motifs which are located in transmembrane regions.In nontransmembrane regions, the propensity of glycine is found to be reduced distinctly.As a second example, the LG5 motif found in transmembrane regions, leucine is observed more frequently at the third variable position as at other positions.This sequence constellation results into two spatially adjacent leucine residues that form a bulky helix surface.In general, relations of topology states and the amino-acid-specific propensities can be derived.This emphasises the predictability of topology states based on single motif occurrences.The full LOP heat map generated by this approach consists of 471 motif positions.To visualise LOP-wide correspondences, we applied UPGMA hierarchical clustering as well as the XOM algorithm.Distance measurement between LOPs was realised by utilising (5).Since 471 variable motif positions were investigated, the UPGMA-tree generated by the first approach consists of 471 nodes.To ease the analysis of the tree, the nodes were coloured according to the topological state in which the corresponding motif is located.Due to the huge number of nodes, we depicted the tree only as a schematically representation which represents the observed general tree topology and identified memberships (see Figure 4).As shown, a distinct clustering, more precisely a formation of Figure 2: WebLogos [31] of the LG5 motif in the order of three topology states "TM" (a), "nTM" (b), and "trans" (c).However, the symbol height in each logo reflect only the relative occurrence of the corresponding amino acid.Additionally, background amino-acid-specific frequencies are not taken into account which decreases the sensitivity of this method.Compared to the heat map generated from LOPs (see Figure 3), less information can be gained.By applying WebLogo, residue propensities, with regard to the topology states, cannot be derived or identified.For instance, the leucine amino acid in "TM" (WebLogo A) cannot be observed as more frequently at the third variable position as at other positions.Figure 3: LOP heat maps of the LG5 and LY6 motif.LOP heat maps reflect the propensities of each amino acid relative to natural amino-acid-specific frequencies.Increasing amino acid propensities are illustrated by an increased red colour content.The below listed colour scale represents the colour assigning to each amino acid propensities.This visualisation allows a sensitive approach to analyse amino acid propensities of each variable position of a motif according to topology states.Here, the LOP heat map is separated by topology states, so that amino acid propensities become obvious.For example, cysteine can be observed more frequently at the second variable position of transmembrane-located LY6 motif.This results in two spatially adjacent cysteine residues which form a bulky surface in transmembrane helices.Such a bulky helix surface might be important in mediating helix-helix interactions, as knob-to-hole helix packing has been reported as a key folding process in many studies (e.g., [1,35]).three distinct subtrees, according to the topology states is obvious.The cluster arrangement correlates to the physicochemical properties found in membrane and nonmembrane located regions, since greater LOP distances are mainly dictated by the propensities of hydrophobic, hydrophilic, and polar amino acids.The sub-tree mainly consisting of motifs located in "trans" regions is arranged in between, which the 471 LOPs of all variable motif positions were clustered using UPGMA hierarchical clustering [33] by utilising the LOP distance measure defined in (5).Due to the original size, the resulting UPGMA-tree is only depicted schematically.However, the tree shows three separated, distinct subtrees which correlate to the topology states in which the corresponding motifs are located.The cluster arrangement corresponds to amino acid propensities and thus to physicochemical properties observed in motifs.This tree proves that the topological location of short sequence motifs are well separable and especially predictable from their amino acid sequence in the variable positions.
points to intermediate physicochemical motif compositions and equally distributed amino acid compositions.Similar to these findings, the XOM output (see Figure 5) shows three main clusters which correspond to the topology states too.Additionally, the cluster arrangement is found to be equal to the arrangement observed in the UPGMA-tree, where the causes of cluster formation are analogue as well.The distinct cluster formation observed by the output of both methods points to a good separability of the variable motif positions.
A possible approach to predict the topology state of a motif from the amino acid sequence alone was implemented as elucidated in Section 2.4.In this calculation, for each motif, the three log-odd sums of all variable positions are computed Table 2: Statistical analyses of the motifs in the protein families with domains of unknown functions (EDS1).The results are split into three subtables.The "TMHMM prediction, " the "Prediction on log-odds, " and the "F-measures"-table.Thereby the "TMHMM prediction"-table represents the absolute occurrences of a motif in all investigated protein families with domains of unknown functions.The "Prediction on log-odds"-table represents the topology state winners (see ( 6)) followed by the "F-measures"-table which indicates how good or bad a motif can be separated and assigned to a topology state.

Motif
TMHMM prediction Prediction on log-odds F-measures with respect to the three topology states.The highest logodd sum leads to the topology state winner (see ( 6)).Crossvalidation was performed by excluding the evaluation set of motifs from the training motif set, which was used to generate the look-up log-odd values.In the process, each topology state winner has been assessed by F-measure.The corresponding F-measures for each investigated sequence motif are listed in the given result Tables 1, 2, and 3.It is apparent from these tables that there are motifs with high and rather small F-measures.Each representative F-measure value indicates how good or bad a motif can be separated and assigned to the respective topology state.For example, the LY6 motif with an F-measure >0.8 in all result tables says that this motif is well assignable (by ( 6)) to each topology state.

Evaluation of the Prediction Accuracy.
To evaluate the prediction accuracy, our new approach has been applied to three datasets.The first dataset (EDS1) consists of DUF-families sequence information described in previous Section 2.1.The second dataset (EDS2) consists of 2254 membrane protein sequences with 55 known structures of the bacteriorhodopsin-like protein (PF01036) family.EDS2 was also obtained from Pfam database [27].EDS1 and EDS2 include the topology specific recorded statistically occurrence Table 3: Statistical analyses of the motifs in the bacteriorhodopsin-like protein families (EDS2).The results are split into three subtables.The "TMHMM prediction, " the "Prediction on log-odds, " and the "F-measures"-table.Thereby the "TMHMM prediction"-table represents the absolute occurrences of a motif in all investigated bacteriorhodopsin-like protein families.The "Prediction on log-odds"-table represents the topology state winners (see ( 6)) followed by the "F-measures"- for each motif generated from TMHMM information.These statistics are listed under the "TMHMM prediction"-table heading and the right of it followed by our predicted (see (6)) information.The prediction quality is determined by the respective F-values.The comparison evidence of the number of statistical determined motifs with the predicted ones shows how well our approach for the most motifs works.For all proteins from DUF families and for the bacteriorhodopsinlike protein families, our approach works well and can be stated for the majority motifs.Deviations can be traced back to motifs with different functions.Furthermore, our approach has been transferred to all common known structures.EDS3 as third evaluation dataset consists of all known alpha helical membrane proteins with structures obtained from PDBTM [13].It is important to note that results from EDS3 only include PDBTM protein information.That means, each found motif has been annotated with one of three given topology states "H, " "Side1, " and "Side2, " in which "H" stands for alpha-helix structure and both Side states refer to the outside or inside of the membrane.Here, "H" can be equated with "TM" because "H" includes only alpha-helical information referring to the interior of the cell membrane.Both Side states can be equated with "nTM." The "trans" state is not included at this point by less membrane information.This means that we have separated a motif  into three motif subsets  H ,  Side1 , and  Side2 according to the topology  4 show that our approach can be applied on known structures.The topology specific recorded statistically motif occurrence is listed in the "PDBTM prediction"-table heading and the right of it followed by our predicted information.

Conclusion
In this work, 33 short sequence motifs reported in [18] were investigated in 32 polytopic membrane protein families with domains of unknown functions.Transmembrane and nontransmembrane sequence regions were predicted using the TMHMM method [38] and topology states were annotated to all detected sequence motif occurrences.These amino acid propensities were derived and employed to define logodd profiles (LOP) of all variable sequence positions in the investigated motifs.Propensity tendencies according to the topology states were identified using UPGMA and XOM clustering.Both methods pointed to good separability and predictability of the topology state of a motif from its amino acid sequence.An information-based prediction algorithm was implemented and assessed using cross-validation and Fmeasure evaluation.Motifs showing high F-measures over   [34] is a relatively new approach for dimensionality reduction and clustering of multidimensional data.We used this approach to visualise the distance relations of the 471 investigated variable motif positions by employing the distance measure defined in (5).Here, XOM delivers a two-dimensional mapping of the distance relations of all LOPs.Coloured according to the topology state in which the corresponding motif is located, three well separable clusters can be seen.The LOP distances which contribute to the cluster formation are mainly dictated by the propensities of hydrophilic, hydrophobic, and polar residues.Thus, the XOM output reflects physicochemical correspondences which also applies for the general cluster arrangement, with the cluster of LOPs mainly observed in "trans" topology states (which corresponds basically to helix caps) located between the other two clusters.Similar to the UPGMA-tree depicted in Figure 4, the XOM output points to a good separability and predictability of topology states of short sequence motifs from their amino acid sequence in variable motif positions.
all or only in certain investigated protein families were identified.From this insight, we postulate that short sequence motifs can be divided in general, structure-forming elements, which are present in numerous protein families and highly specific to their topology location.But they are probably less important for functional properties.Finally, motifs showing high F-measures only in certain membrane protein families may be important elements in establishing the individual properties which are necessary for the function of an entire protein family.Also, the information of the spatial structure and the folding of proteins to be explored can be evaluated by affinities, because the spatial structure of proteins has been stronger conserved in evolution than the sequential composition of the folded protein chains.These are individual motifs or characteristic sequence parts which expose a certain biochemical function of proteins.Why does the nature pursue the principle of structure and function separation?Residues, which support a stable domain folding, are separated from those that induce a specific function.This procedure is a very efficient strategy of evolution.Two areas were simultaneously optimised [39]: (i) the stability of the protein backbone in a given folding pattern, (ii) the design of the amino acid sequence according to a specific function.
Based on this information, further work will discuss and deal with how the evolution has spawned motifs in their function as structure building blocks.In addition, motifs originated by evolution and spatially interacting with other should be determined as structure stabilizing.

Figure 1 :
Figure 1: In the bacteriorhodopsin trimer (PDB-Id: 1brr), seven of 33 sequence motifs which were analysed in this study are present.Each motif can be written in a regular expression like XYn, where X and Y are amino acids separated by  − 1 highly variable positions.For example, the LG5 motif occurrence (highlighted in red) corresponds to a pair of leucine and glycine residues which are separated by four amino acids.

Figure 4 :
Figure4: Schematic UPGMA-tree derived from LOP clustering: the 471 LOPs of all variable motif positions were clustered using UPGMA hierarchical clustering[33] by utilising the LOP distance measure defined in(5).Due to the original size, the resulting UPGMA-tree is only depicted schematically.However, the tree shows three separated, distinct subtrees which correlate to the topology states in which the corresponding motifs are located.The cluster arrangement corresponds to amino acid propensities and thus to physicochemical properties observed in motifs.This tree proves that the topological location of short sequence motifs are well separable and especially predictable from their amino acid sequence in the variable positions.

Figure 5 :
Figure 5: Output of the XOM clustering: XOM[34] is a relatively new approach for dimensionality reduction and clustering of multidimensional data.We used this approach to visualise the distance relations of the 471 investigated variable motif positions by employing the distance measure defined in(5).Here, XOM delivers a two-dimensional mapping of the distance relations of all LOPs.Coloured according to the topology state in which the corresponding motif is located, three well separable clusters can be seen.The LOP distances which contribute to the cluster formation are mainly dictated by the propensities of hydrophilic, hydrophobic, and polar residues.Thus, the XOM output reflects physicochemical correspondences which also applies for the general cluster arrangement, with the cluster of LOPs mainly observed in "trans" topology states (which corresponds basically to helix caps) located between the other two clusters.Similar to the UPGMA-tree depicted in Figure4, the XOM output points to a good separability and predictability of topology states of short sequence motifs from their amino acid sequence in variable motif positions.
table witch indicates how good or bad a motif can be separated and assigned to a topology state.

Table 4 :
Statistical analyses of the motifs in all known PDBTM protein structures (EDS3).The results are split into three subtables.The "PDBTM prediction, " the "Prediction on log-odds, " and the "F-measures"-table.Thereby the "PDBTM prediction"-table represents the absolute occurrences of a motif in all investigated PDBTM protein structures.The "Prediction on log-odds"-table represents the topology state winners (see (6)) followed by the "F-measures"-table witch indicates how good or bad a motif can be separated and assigned to a topology state.