PROCARB: A Database of Known and Modelled Carbohydrate-Binding Protein Structures with Sequence-Based Prediction Tools

Understanding of the three-dimensional structures of proteins that interact with carbohydrates covalently (glycoproteins) as well as noncovalently (protein-carbohydrate complexes) is essential to many biological processes and plays a significant role in normal and disease-associated functions. It is important to have a central repository of knowledge available about these protein-carbohydrate complexes as well as preprocessed data of predicted structures. This can be significantly enhanced by tools de novo which can predict carbohydrate-binding sites for proteins in the absence of structure of experimentally known binding site. PROCARB is an open-access database comprising three independently working components, namely, (i) Core PROCARB module, consisting of three-dimensional structures of protein-carbohydrate complexes taken from Protein Data Bank (PDB), (ii) Homology Models module, consisting of manually developed three-dimensional models of N-linked and O-linked glycoproteins of unknown three-dimensional structure, and (iii) CBS-Pred prediction module, consisting of web servers to predict carbohydrate-binding sites using single sequence or server-generated PSSM. Several precomputed structural and functional properties of complexes are also included in the database for quick analysis. In particular, information about function, secondary structure, solvent accessibility, hydrogen bonds and literature reference, and so forth, is included. In addition, each protein in the database is mapped to Uniprot, Pfam, PDB, and so forth.


Introduction
Carbohydrates play a key role in a variety of important biological recognition processes like infection, immune response, cell differentiation, and neuronal development. All of these biological phenomena may be regulated by the interaction of these carbohydrates with proteins [1][2][3][4]. One area of therapeutic significance in protein-carbohydrate interactions has relied on the role of carbohydrates as cell surface receptors enabling adherence of bacteria, parasites, and viruses by a process known as bioadhesion [5][6][7][8][9][10]. Bacteria are often competent enough to efficiently adhere to the surface membranes of the host cells via lectin binding, thus enabling subsequent colonization and progression of the disease [11]. Irregular structure and levels of certain tumor cell surface sugars may also present opportunities for therapeutic intervention [12]. On the other hand, the ubiquitous application of carbohydrates in nature potentially poses severe specificity issues. Understanding the molecular basis of carbohydrate recognition might offer the essential basis to rationally plan biologically active saccharide analogues [13].
In spite of their numerous important biological roles, there is no appropriate database dedicated to these proteincarbohydrate complexes. Although, the Protein Data Bank (PDB) [14] stores all the experimentally determined proteincarbohydrate complexes, yet it is not easy to identify a protein-carbohydrate complex in PDB. The GLYCO-SCIENCES.de web resource [15] provides numerous tools and databases which aid in searching the PDB for various carbohydrates. Moreover, the available databases like 2 Advances in Bioinformatics Lectines [16] & Glycoconjugate [17] databank dedicated to protein carbohydrate complexes do not have detailed information on the functionally important carbohydratebinding residues and proteins. Hence, there is a need for a single resource where all the relevant information about a pair of interacting protein and carbohydrate would be available. Therefore, the PROCARB ( Figure 1) has been developed to provide, not only a single source of annotated complexes, but also a number of precomputed features of these carbohydrate-binding proteins like solvent accessibility, secondary structure, and hydrogen bonding information. Also the role of carbohydrates in the complex is also provided in the database wherever possible. This core module consists of 604 protein-carbohydrate complexes with at least one but possibly more carbohydrate molecule(s) in each complex. Total number of carbohydrate molecules, thus is 4240, which are bound to 5360 residues in proteins.
Structure-based approach to drug design has become a standard protocol in the pharmaceutical industry where large databases of potential small drug candidates may be docked into an active site of a particular target molecule [18]. Structures of many glycoproteins of interest have not been solved yet but can be modeled because suitable templates of matching structures are available. Therefore, we have also attempted to generate the three-dimensional structures of different types of glycoproteins (both N-and O-linked), with unknown structures by using homology modelling. This module of PROCARB consists of 26 N-linked and 20 Olinked modelled structures.
Finally, functional annotation of proteins and understanding of functions in cases were only an amino acid sequence of protein is available requires predicting potential carbohydrate-binding sites, which experimentalists can then verify. Based on our previous work in this direction [19], we developed a web server which can take an amino acid sequence provided by users and predict carbohydratebinding sites, albeit with a modest success rate keeping in view the difficulty in sequence-based prediction, which nonetheless provides useful clues for experiments.

Database Description
Overall organization of the database is illustrated in Figures 2(a) and 2(b). As shown in the figure and stated above, the PROCARB is composed of three modules, which work largely independently. These modules are described in the following sections.

PROCARB Core Module.
The PROCARB core module is developed by systematically locating protein-carbohydrate complexes in the protein data bank (PDB) and manual verification of existence and identification of carbohydrate ligand. A protein is considered as a carbohydrate binding if any atom of its amino acid is within a 3.5Å cutoff distance from any atom of the sugar in the protein-carbohydrate complex [19]. Various structural and contact properties like secondary structure, hydrogen bond, van der Waal contacts, solvent accessibility, and so forth, are computed for all entries and stored in this core module of the database. In addition, a Jmol [20] visualisation is provided with preloaded scripts allowing identifying the location and nature of carbohydrate binding sites. All structures found by keyword search were validated manually for the presence of carbohydrate ligands. Specifically at the time of last update, 914 hits were obtained using keyword search in the PDB, of which only 604 proteins were found to have a carbohydrate attached, making it important that these ligands be manually annotated. The databases, so compiled, are also available for free download, both in the raw PDB file as well as a subset of entries which consists of representative structures selected at 25% sequence similarity. For each complex, the carbohydrate details were retrieved from the PDBsum [21] and to confirm whether one of the bound ligands is a carbohydrate, all ligands were  manually checked either in the PDBeChem [22] database which classifies sugars as a saccharide or from the literature reference.
FASTA formatted sequences and 3D coordinates for both raw and nonredundant datasets are also stored in the database. These data sets are scheduled to be regularly updated as new entries become available from the PDB. For a quick analysis a set of four residue-wise structural features, namely, contact with carbohydrate, secondary structure, and solvent accessibility is included. These features are computed using standard software such as DSSP [23], ASAView [24], and HBPlus [25], respectively.
Information on each complex is stored in an MYSQL database where the central protein table contains information regarding the protein, its bound ligand, function, and literature reference. Web interface uses PHP and JavaScript and allows searches by a variety of text-based options like PDB code, ligand name, protein name, and source organism. Data entries are displayed using dynamically generated pages which describe the relevant information including protein name, source, ligands, Pfam [26] description Uniprot [27] ID, and so forth. Information about gene name, SCOP [28] classification, function of the protein, mutation (if any), and its attached ligands or metal ions is also provided. Information about all these proteins was extracted from various biological databases like PDB [14], Swissprot [29], and Pfam [26], and each of these entries is also directly hyperlinked to their respective entry in these databases. Precomputed structure information such as secondary structure, solvent accessibility, hydrogen bonds, and residuecarbohydrate contacts at 3.5Å distance cutoff (using an inhouse perl program) is also provided for further analysis. To help us keep the database up to date, users are encouraged to add protein-sugar complexes in the database through an online submission system. User submissions will be reviewed and added to the database after manual inspection and calculation of related properties.

Homology Models Module.
In this module, we have attempted to generate the three-dimensional structures of a large number of glycoproteins (both N-and O-linked) with hitherto unknown structure, using automated web-based homology modeling. As a case study, a detailed project model-based 3D-structure of Hev b 4, a latex allergen Nglycoprotein has also been completed which is described elsewhere in our earlier work [30].
To select proteins for modeling, Swissprot [29] search was performed for N-linked glycoproteins using the keyword "N-linked". O-linked glycoprotein sequences were collected from O-glycbase [31] database. To have at least one model for each protein family, the sequence data was grouped into families at 30% sequence identity and one member from each family was selected for modeling. In all cases, at least one glycosylation site was identified and annotated in Swissprot [29]. This data set has two groups each one corresponding to O-linked and N-linked glycoproteins.
Selected glycoprotein sequences, having at least one experimentally verified glycosylation site, were used as an input for the web server 3D-JIGSAW [32]. This server builds three-dimensional models for proteins on homologues of known 3D structure. The automated mode of 3D-JIGSAW  [32] web server resulted in 50 homology-based models of Nlinked glycoproteins out of 73 N-glycoprotein sequences and 104 structure models of O-linked glycoproteins out of initial 173 O-glycoprotein sequences. After careful examination of each model, it was noted that there were only 26 N-linked and 20 O-linked models in which at least one experimentally verified glycosylation site was modeled. Optimization of these models was carried out via CHARMm all atom forcefield minimization. Energy was minimized for a gradient of 1.0 kcal/mol by using conjugate gradient protocol available in Discovery studio version 2.0 [Accelry's Software Inc] [33] to remove any steric clashes and stabilize the models. The various types of initial potential energy, potential energy, Van der Waals energy, and electrostatic energy of N-and O-glycoprotein models after energy minimization are listed in Tables 1 and 2. Additionally, Ramachandran analysis was performed for subsequent optimization on all the 46 models using SAVES [34] web server (Tables 3 and 4). In other models, the 3D-JIGSAW [32] server was not able to model the experimentally determined glycosylation site due to the absence of a suitable template so they were not included in the web resource. Graphics highlighting the experimentally determined glycosylation sites were generated for the modeled structures using VMD [35] and form the part of database and can also be visualized in Jmol [20]. Though this is based on using automated web-based homology modeling, most of the models are within the acceptable ranges of Ramachandran score (Tables 3 and 4) and may provide some initial encouragement to use the homology models in understanding their structure-function relation by designing mutagenesis and drug designing experiments. Protein structure models can be of enormous help in functional genomics. One of the most important assistance of homology models lies in the functional genomics where they could provide structural insights to understand the protein function [36]. The 3D models have already been employed to identify the enzymatic activities [37] and ligand-binding [38] functions of proteins. Additionally, it is well known that homology modeling requires high quality of sequence alignment between the target and the template proteins; therefore, human intervention may be a possible solution for models with low scores. In spite of various limitations, homology modelling will remain an essential tool in predicting the 3D structures of proteins as the number of protein sequences will keep on increasing and it is impracticable to resolve the 3D structure of each sequence [39].

CBS-PRED Module.
Many proteins which interact with carbohydrates (either covalently or noncovalently) are known without the knowledge of residues that participate in these interactions. Only few computational methods have been described till date which predict the covalently attached Glycosylation sites [40,41] in proteins. Similarly, only three methods are reported for the prediction of carbohydrate   binding sites in proteins based on the 3D structure of the complex [42][43][44]. In view of this, we have earlier developed an algorithm to identify carbohydrate-binding residues from single sequences or their evolutionary profiles [19]. CBS-Pred is an implementation of these algorithms into PRO-CARB. This module is made up of two submodules, namely, CBS-SS and CBS-PSSM, which utilize single sequence or alignment profiles in the backend to make a residue-wise prediction. Although PSSM-based predictions are more accurate, single sequence module is provided as a highspeed alternative as generating PSSM is time consuming. Exact performance score of these submodules is likely to change as we update neural network parameters, used for prediction with every update in training data sets. Therefore, prediction performance scores are returned with the server output and can be used to estimate the degree of false predictions.
We also tested the CBS-Pred on Area under the ROC curve (AUC) (  ligand in a PDB coordinate file is a carbohydrate or not. Carbohydrate Finder identifies diverse types of carbohydrates in a given protein-carbohydrate complex. Currently, it can recognize 100 different types of carbohydrates.

Contact Calculator.
Contact Calculator calculates the contacting pairs in a given protein-carbohydrate complex at different cutoff distances and can also recognize 100 different types of carbohydrates that may be in contact with the amino acid residues (Table 6).

Conclusions
A database of protein-carbohydrate complexes and models of unknown glycoprotein structures was developed, and an associated sequence-based prediction module was compiled. We expect that PROCARB will facilitate functional annotation, designing of site-directed mutagenesis experiments, and modeling protein-carbohydrate interactions which in turn will help the experimental and bioinformatics research on understanding protein-carbohydrate interactions.