Prediction of G Protein-Coupled Receptors with SVM-Prot Features and Random Forest

G protein-coupled receptors (GPCRs) are the largest receptor superfamily. In this paper, we try to employ physical-chemical properties, which come from SVM-Prot, to represent GPCR. Random Forest was utilized as classifier for distinguishing them from other protein sequences. MEME suite was used to detect the most significant 10 conserved motifs of human GPCRs. In the testing datasets, the average accuracy was 91.61%, and the average AUC was 0.9282. MEME discovery analysis showed that many motifs aggregated in the seven hydrophobic helices transmembrane regions adapt to the characteristic of GPCRs. All of the above indicate that our machine-learning method can successfully distinguish GPCRs from non-GPCRs.


Introduction
The G protein-coupled receptors (GPCRs) are only discovered in eukaryotes, which constitute a vast protein family and perform their various functions always through coupling with G proteins in the cell. GPCRs have many aliases such as heptahelical receptors, serpentine receptor, G protein-linked receptors (GPLR), and seven-transmembrane (7TM) domain receptors; all the GPCRs contain a single polypeptide chain that pass through the cell membrane seven times [1]. There are roughly 1000 GPCRs in human genome (accounting for about 2% coding genes); thus, they form the largest receptor superfamily [2]; they are also involved in various diseases and constituted approximately 40% of drug targets. Because Robert J. Lefkowitz and Brian K. Kobilka revealed the biochemical mechanism of GPCRs for signaling pathways, they were awarded with 2012 Nobel Prize in chemistry [3].
Many different approaches have been utilized for GPCRs classification, such as protein motif-based systems, machinelearning methods [4], and other techniques. Based on the original sequence similarity and phylogenetic studies, GPCRs superfamily can be divided into five, six, or seven classes at different periods [5,6]. According to GPCRdb (http://gpcrdb.org/) database developed by Kolakowski and updated by Horn et al. [7], which contains data, diagrams, and web tools involving collection of both GPCRs crystal structures and receptor mutants, GPCRs are classified into six main families: class A (Rhodopsin), class B1 (Secretin), class B2 (Adhesion), class C (Glutamate), class F (Frizzled), and other GPCRs. The former five classes are consistent with the Glutamate, Rhodopsin, Adhesion, Frizzled, and Secretin (GRAFS in short) classification system [8,9]. Table 1 shows the protein number and composition for every class.
Class A rhodopsin-like receptors constitute the largest (more than 80%) of the human GPCR subtypes. They mediate numerous effects of endogenous peptides including neurotransmitters, hormones, and paracrine signals. For example, biogenic amines [10] such as norepinephrine, dopamine, and serotonin commonly play their role of drugs for pathological diseases through binding to GPCRs. Although the Nterminal extracellular domain is very short, class A receptors can form dimers, in homo/heterodimerization [11]. This class also includes approximately 60 orphan receptors which have no defined ligands or functions at all [12,13].
Class B1 secretin-like receptors belong to one of hormone and neuropeptide receptor families; they consist of a large  [14]. Moreover, they are of ancient origin and can bind with various peptides such as secretin, corticotrophin releasing factor, glucagon, parathyroid hormone, calcitonin, growth hormone releasing hormone, and calcitonin gene-related peptide [15]. Class B2 adhesion-like receptors are also known as the adhesion G protein-coupled receptors (ADGRs) with ancient origin; they make the function in various tissues include synapses of the brain [16]. Most ADGRs contain various domains in the N-terminus provided for binding site of other cells [17]; these domains have over sixteen types, including cadherin-like repeats, thrombospondin-like repeats, and calnexin domain. ADGRs have the characteristic of N-terminal adhesive domains [18]. For example, ADGR subfamily G4 (ADGRG4) has the sequence characteristics of a unique highly conserved motif and some functionally important motifs similar to class A, class B1, and combined elements [19].
Class C GPCRs mainly comprise metabotropic glutamate receptors (mGluRs), one type of L-glutamate binding receptors; another type is ionotropic glutamate receptors (iGluRs) which belong to a ligand-gated ion channels not the GPCR family. Class C GPCRs contain a large N-terminal domain for ligand-binding. There exist 8 isoforms of mGluRs to form signaling molecules via second messenger systems [20], which transfer extracellular signal through the mechanism of receptor dimer packing and allosteric regulation [21]. The activation of mGluRs is an indirect metabotropic process by the aid of binding to glutamate, a major excitatory neurotransmitter in the brain. The extracellular glutamate concentration (at micromolar range) is lower than the intracellular (at millimolar range) in neuron [22]. Human mGluRs are found in pre-and postsynaptic neurons, including the hippocampus, cerebellum, and other brain regions' synapses, and in peripheral tissues. mGluRs play an important role in regulating neuronal excitability and synaptic plasticity and in serving as mental disorders drug targets [23].
Class F frizzled/smoothened receptors are involved in Wnt binding whereas the smoothened receptor (belongs to GPCRs) reconciles hedgehog signaling via the required region cysteine-rich domain (CRD) in the N-terminus [24], because smoothened protein sequence is homologous to frizzled. The two proteins have the same 7TM structure and evolutionary relationship [25]. But the secreted frizzledrelated proteins can exert its function by promoting or blocking Wnt3 / -catenin signaling in different concentration of secreted frizzled-related protein 1 and cellular context [26].
Other GPCRs include some orphan receptors except for the above classes; the characteristics of these receptors are that they have a similar structure to other identified receptors but lack endogenous ligand. They have altogether 37 proteins and 6 in human. Among them, Gpr175 (also called Tpra1) and GPR157 are well studied. Gpr175 is an orphan GPCR with positive regulation of the Hedgehog signaling pathway [27]; GPR157 couples with Gq protein and then activate IP 3mediated Ca 2+ cascade, which is also a signaling molecule involved in positive regulation of neuronal differentiation of radial glial progenitors through the GPR157-Gq-IP 3 cascade pathway [28].
GPCRs are transmembrane receptors that transduce extracellular stimuli into intracellular signals through activating intracellular heterotrimeric G protein complex, which comprise 15 G subunits, 5 G subunits, and 12 G subunits. Based on the sequence similarity and functional characteristics of G subunits, G proteins are divided into four major classes: G s, G i/o, G q/11, and G 12/13 [34]. G activation or deactivation cycle controls the signal transduction, when cell is at resting mode, GDP binds to G forming G -GDP and then joins G generating G complex, and G is inactive at this stage; when stimulate signal is introduced from GPCR, G raises a conformational change, GTP binds to G forming G -GTP and destabilizing the G complex, G are disassociated and bound by G interacting proteins, and G is active at this stage. When G fulfilled signal transduction to the downstream pathway, G hydrolyzes GTP to GDP through its intrinsic GTPase activity to form G -GDP and returns to the resting mode; this process constitutes a G protein cycle [35]. Activated G s catalyzes ATP to cAMP by adenylyl cyclase (AC) and results in the activation of protein kinase A (PKA) and phosphorylation of downstream effector. On the contrary, G i plays inhibition role of AC and suppresses cAMP production. G q/11 activates phospholipase C (PLC ) and produces inositol-1,4,5-trisphosphate (IP 3 ) and diacylglycerol (DAG) which can form PLC -IP 3 -DAG signaling pathway. G 12/13 activates Rho GTPase families through RhoGEF to regulate cytoskeleton remodeling; these G protein families take the major effect in signal transduction [3]. Therefore, GPCR-G -AC-PKA and GPCR-G -PLC-IP 3 constitute two main signal transduction cascades within the cell.
In this paper, we performed an in silico analysis on the GPCRs amino acids information and other polypeptide physicochemical features and constructed 188D feature vectors (Table 2) of the proteins into an ensemble classifier [36][37][38][39][40][41]. The first 20D of 188D represents the 20 kinds of natural amino acids composition; the other 168D includes eight physical-chemical properties each deriving from the so-call CTD mode [42], where C stands for amino acid contents for each type of hydrophobic amino acids, T stands for the frequency of bivalent peptide, and D stands for amino acid distribution from five positions of a sequence. These 188D feature vectors have been integrated into software BinMemPredict which performed well in membrane protein prediction [42]. Moreover, we also performed motif analysis by MEME Suite (http://meme-suite.org/) because a motif may directly accord with the active site of an enzyme or a domain of the protein. MEME have been not only used to predict conserved motif regions but also employed for primers design with low quality sequence similarity patterns in multiple global alignments [43].

Data Retrieval and Pretreatment.
GPCR sequences with fasta format were retrieved from the UniProt database (http:// www.uniprot.org/); we obtained initial 5027 sequences altogether. To improve analysis performance, the raw dataset was preprocessed by the protein-clustering program CD-HIT (http://cd-hit.org/) for reducing the sequence homology bias of prediction; the sequence identity threshold was set at 0.80 and other parameters as default; thus, the highly homology sequences were removed, and finally 2495 GPCR protein sequences were gained as positive dataset, and the negative examples were from all the protein sequences but removing the positive ones, and 10386 entries (non-GPCRs) were acquired as negative dataset.

Extracting the Discriminative Feature
Vector for Classifying and Testing by Random Forest Classifier. Protein features were extracted from the primary sequences according to their compositions of 20 kinds of amino acids and their eight types of physical-chemical properties; based on these characteristics, Cai et al. [44] and Zou et al. [42] had raised 188D feature vectors of SVM-Prot. The workflow was as follows: (1) All distinct positive protein samples were employed to extract their corresponding protein families for Pfam number from the "Family and Domains" section of uniprot website and excluded the same and redundant Pfam number; the unique Pfam number set for positive dataset (in fasta format) was acquired.
(2) All the protein sequences were integrated into a Pfam number file; the same Pfam sequences were combined to the same file named with Pfam number; then, the positive Pfam number files were removed; the rest of Pfam number files were extracted only in the longest sequence for each Pfam as the negative dataset (in fasta format).
(3) Because the protein sequences possessed different length, each sequence needed to transform into fixed-size vectors for classification, both the positive and negative datasets were input to the 188D SVM-Prot programme for their feature vectors, the positive samples were given the label "1" at the end of vectors, the negative samples were given the label "−1" at the end of vectors, and the positive and negative files combined into a file with the filename format ended in .arff.
(4) The above file on positive and negative vector datasets was randomly divided into five parts, respectively, among which, every four parts were served as training examples and the remaining one part as test ones, every part contained both positive and negative samples (Table 3), and fivefold crossvalidation was used.
(5) The training and test datasets were successively imported into weka data mining package (http://www.cs .waikato.ac.nz/ml/weka/), a machine-learning workbench. In weka, the training datasets were filtered with the synthetic minority oversampling technique (SMOTE) [45,46] and changed the positive samples from 100 percent into 300 percent to overcome the highly imbalanced property of positive and negative cases; after preprocessing with SMOTE technique the two-group data kept an amount equilibrium, and the vector data were classified automatically via visualization analysis [47]. Based on the optimal features with some preliminary trials, we finally chose a Random Forest (RF) [48] module and "use training set" item on test options as classifier for training dataset, while for test dataset we chose "supplied test set" item on test options to predict the samples as GPCRs or non-GPCRs: that is, the prediction module  Total number  1st  Training  1996  8309  10305  1st  Test  499  2077  2576  2nd  Training  1996  8309  10305  2nd  Test  499  2077  2576  3rd  Training  1996  8309  10305  3rd  Test  499  2077  2576  4th  Training  1996  8309  10305  4th  Test  499  2077  2576  5th  Training  1996  8308  10304  5th Test 499 2078 2577 Correlation coefficient TP (true positive) stands for the number of true GPCRs that are predicted correctly, TN (true negative) stands for the number of true non-GPCRs that are predicted correctly, FP (false positive) is the number of true non-GPCRs that are incorrectly predicted to be GPCRs, and FN (false negative) is the number of true GPCRs that are incorrectly predicted to be non-GPCRs.
using the results of the just training set to distinguish the two classes.
To measure the performance quality of the statistical classification more intuitively in the field of machine learning, we adopted 5-fold cross-validation for test dataset and calculated four common parameters [49,50]: sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthew's correlation coefficient (MCC) to adopt for evaluating the SVM-Prot features and classifier, which are formulated as Table 4.

Conserved Motif Analyses of Human GPCR Proteins.
Online MEME Suite 4.11.0 (http://meme-suite.org/) was used to analyze conserved motif analyses. MEME was a powerful, comprehensive web-based tool for mining sequence motifs in proteins, DNA, and RNA [51]. Currently, the MEME Suite has added 6 new tools since the Nucleic Acids Research Web Server Issue in 2009, and the web-based version tools reached 13. The maximum motif width, the minimal motif width, and the maximum number of motifs were set to 50, 6, and 10, respectively.

Reclassification of Positive and Negative Proteins on Five
Test Datasets. We obtained the 188D feature vectors containing positive and negative samples and divided them into training and test datasets as input to the Weka explorer, respectively, the results showed exactly classifying for all the five training datasets; therefore, the trained classifier could be utilized to verify the predication effect, and the test dataset was used to predict its class label directly. The correctly classified rates for five testing datasets were 90.64%, 90.37%, 88.04%, 93.28%, and 95.73%, respectively (mean ± SD: 91.61%±2.96%); the other indices were shown in Table 5.

Conserved Motifs Analysis for Human GPCRs.
For the purpose of disclosing the evolutionary relationship of the conserved motifs of GPCRs, we randomly selected six classes of human GPCRs and gained 66 protein sequences which were analyzed by MEME software. The multiple local alignments were performed by MEME to generate the most significant 10 conserved motifs for the sequences (Figure 1 and Table 6).

Discussion
In this study we show that the novel SVM-Prot features based binary classifier can well discriminate GPCRs from non-GPCRs; we obtain exact classification model from the five training datasets and the AUC equals 1, and on the five testing datasets we get the average correctly classified rates of 91.61% and the average AUC of 0.9282; these indicate that predicted GPCRs and true GPCRs have a good overall consistency. AUC is a plot with -axis representing false positives (equal to  1 − specificity) and -axis representing true positives (equal to sensitivity), which is based on different cutoff values of a score from a binary classifier [52,53]. AUC of 1 represents a perfect model; the more AUC is close to 1, the better prediction model we can develop, but if the value is reduced to 0.5, the model becomes no predictive ability at all. On our binary classification model we acquired high specificity and accuracy for testing datasets, but the values of sensitivity and Matthew's correlation coefficient were relatively low at about 0.7; this might be due to the problem of imbalance dataset where the size of positive was less than negative with the proportion of about 1 : 4; thus the false negative rate was relatively higher. This defect may also come from the intrinsic restriction of supervised learning algorithm, because the classification model built from training dataset can only have a good predictive effect on the test dataset having the same probability distribution as the training dataset [54]. The top ten human GPCR motifs show the feature of some motifs aggregation that appeared from the block diagram; this reflected in the structure characteristic of 7TM helices regions of GPCRs. Motifs 1,4,6,7, and 10 belonged to these 7TM domains; among them, the former 4 motifs displayed containing the region highly homologous to the class B1 secretin family, and motif 10 was a Fz domain in the membrane spanning region which is located near to the intracellular C-terminal region of GPCRs, which contained an alpha-helical Cys-rich domain (CRD) of Frizzled that was essential for Wnt binding [55,56]. Motifs 3, 8, and 9 were CRD Frizzled-1 like domains involved in Wnt signal as well [57]. Motif 5 was latrophilin/CL-1-like G protein-coupled receptor proteolysis site motif (GPS) which was first identified in a neuronal Ca 2+ -independent receptor of alpha-latrotoxin (CIRL)/latrophilin, an orphan GPCR [58]. GPS was a part of GPCR autoproteolysisinducing (GAIN) domain which held a formative feature of adhesion GPCRs, and GPS cleavage process played an important role in renal organ physiology [59]. Take the first sequence Q9BY15, for instance, there listed 3 kinds of conserved domains start from the N-terminus: calciumbinding EGF domain (not shown), GPS domain, and 7TM domain of secretin family. The latter two domains appeared with concentration on the block diagram. Support Vector Machine (SVM) is a supervised machinelearning algorithm on the basis of statistical learning theory [53,[60][61][62][63][64][65]. Due to the robustness, rapidness, and repeatability, machine-learning method is regarded as one of the best ways to efficiently classify numerous protein molecules. In two-class problems, our SVM classifier mapped the input 188D feature vectors into a higher dimensional feature space and then founded the optimal separation hyperplane [66] for GPCRs and non-GPCRs, while avoiding overfitting and underfitting problems. This approach belongs to linear classification model [67].
intracellular loops [IL1-3], and the protein termini. Therefore, GPCR can be sequentially distributed into the following regions: N-terminus-TM1-IL1-TM2-EL1-TM3-IL2-TM4-EL2-TM5-IL3-TM6-EL3-TM7-C terminus. In summary, we have successfully developed a SVM-Prot features based Random Forest for identifying GPCRs from non-GPCRs based on the protein sequence information and their physicochemical properties. Nevertheless, this prediction model needs to be further explored so as to discriminate the subfamily and sub-subfamily of GPCRs.