CMD: A Database to Store the Bonding States of Cysteine Motifs with Secondary Structures

Computational approaches to the disulphide bonding state and its connectivity pattern prediction are based on various descriptors. One descriptor is the amino acid sequence motifs flanking the cysteine residue motifs. Despite the existence of disulphide bonding information in many databases and applications, there is no complete reference and motif query available at the moment. Cysteine motif database (CMD) is the first online resource that stores all cysteine residues, their flanking motifs with their secondary structure, and propensity values assignment derived from the laboratory data. We extracted more than 3 million cysteine motifs from PDB and UniProt data, annotated with secondary structure assignment, propensity value assignment, and frequency of occurrence and coefficiency of their bonding status. Removal of redundancies generated 15875 unique flanking motifs that are always bonded and 41577 unique patterns that are always nonbonded. Queries are based on the protein ID, FASTA sequence, sequence motif, and secondary structure individually or in batch format using the provided APIs that allow remote users to query our database via third party software and/or high throughput screening/querying. The CMD offers extensive information about the bonded, free cysteine residues, and their motifs that allows in-depth characterization of the sequence motif composition.


Background
Disulphide bonds are formed by oxidation of two cysteine residues in a protein and are significant to a protein's conformational stability as they confer greater thermal and chemical stability as well as stabilizing structural intermediates to ensure the correct folding pathway. However, the connectivity of the disulphide bonds in protein sequences can only be determined experimentally. Given this difficulty, the ability to evaluate or predict the disulphide bonding state and connectivity from the sequence would prove to be highly valuable in engineering proteins for biotechnological and medical applications. Computational approaches towards disulphide connectivity prediction have been based on various descriptors. One of these descriptors is the sequence motifs generated by combining the flanking residues on the either side of the the cysteine residue [1,2]. These immediate residues flanking the cysteine have been shown to influence the cysteine's redox potential and the cysteine's steric accessibility [3]. These sequence motifs have been fed into various prediction methods [4] such as machine learning approaches (i.e., statistical methods, neural networks (NNs) [5], and support vector machine (SVM) [6][7][8] such as DiaNNA [3], DISULFIND [9], DCON [10], and CysView [11]. Currently, all the cysteine motifs are extracted by parsing data from protein databases and feeding them into the prediction tools. Motivated by the absence of a database and usefulness of the cysteine flanking motifs in predicting the cysteine bonding state and connectivity prediction, we have developed cysteine motif database (CMD) as a tool to mine and store these motifs. The creation of CMD allows the motif extraction and facilitates the study of their secondary 2 Advances in Bioinformatics structures, bonding and connectivity propensities. In this paper, we present CMD as a publicly available tool that complements existing prediction tools.

Construction and Content
2.1. Content. The CMD data was compiled from Protein Data Bank (PDB) (http://www.rcsb.org) and UniProt (http://www.uniprot.org). For each databank, two different datasets were created; a complete protein dataset and a second 100% nonhomologous unique sequence dataset (100% similar sequences were omitted). We have featured CMD with both datasets for each PDB and UniProt, allowing researchers to utilize the database in its entirety (73656 structures for PDB and 531462 structures for UniProt) or to include only unique sequences (33874 for PDB and 140723 for UniProt). Using these datasets, we extracted 878,000 cysteine motifs based on 1st, 2nd, 3rd, 4th, and 5th flanking residues of the cysteine as these immediate residues are within proximity to exert influence on the cysteine ( Table 1). The assignment of the bonding state of cysteine residues and their bonding partners is based on the SSBOND and DISULPHIDE BOND tags in each PDB and UniProt files. The motifs were clustered according to the occurrence of the bonding state, that is, always bonded, always nonbonded, and both bonded and nonbonded (nonbonded state with another cysteine or to other atoms such as metals). Each of the bonded cysteine is also mapped to each inter and intrachain disulphide bond cysteine partner.
The motifs were categorized between inter and intradomain with the secondary structure assignments for each motif sequence (if available) determined using secondary structure reference files retrieved from PDB.

Construction.
The data contained in CMD is stored in Microsoft SQL server 2005 data storage architecture. Cysteine motif pattern tables are indexed based on Protein ID, motifs, chain number, and secondary structure to enhance the efficiency of the querying performance. Table-based partitioning was used to increase the flexibility and performance on Motif data tables. In these tables, over three million motifs are stored which can be queried and processed. All preprocessing, data extraction, and injection for motif sequences and their secondary structure were carried out in Net 4.0 platform using C# programming language. The web interface of CMD is based on ASP. Net extension integrated with Ajax technology to provide a strong, simple, and user friendly environment for end users. The web application is hosted on an Internet Information Services (IIS) HTTP server version 7.5.7600.16385. CMD will be updated automatically with latest data from PDB and Uniprot.
In addition, several APIs available in CMD enable developers to query our database remotely and embed the results in their own system independently. A complete list of available APIs together with the method of inline implementation is available in the FAQ section of the CMD website.

Data Update
Using RCSB and UniProt API's, the software will retrieve all the Protein IDs available in the mentioned resources. A query will list all the existing Protein IDs in our local dataset. All new Protein IDs will be identified using both above references. Using RCSB and UniProt ftp services, all the newly identified protein files will be downloaded using the Protein ID's to our local server. As in our method of preprocessing and data set preparation, all SEQRESS and SSBOND tags will be extracted from the downloaded files. All cysteine motifs based on the 1st, 2nd, 3rd, 4th, and 5th number of flanking residue on each side (neighboring residues) will be captured and extracted to the records of data with cysteine at the meddle. Each record contains the motif sequence, Chain ID, cysteine residue position in the sequence, bonding status of cysteine residue and the Protein ID as the reference. Each record will be inserted into our database. A log will be generated for the successful procedure or any run time error.  (Figures 1, 2, and 3).

Utility: Example
Applications. CMD facilitate studies focused on cysteine disulphide bonding status prediction and analysis by processing the data. Here we present two applications of our system that illustrate the potential of CMD in greater details.

Application 1: Statistical Analysis of Bonding State.
To analyze the predictive power of CFMD, we investigated the cysteine bonding pattern of human protein disulphide isomerase (PDII, P07237 [UniParc]). PDI catalyses the formation (oxidation) and rearrangement (isomerisation) of disulphide bonds during the folding of secretory and membrane-bound proteins (for review see [12]), thus stabilising the native structure of these proteins. PDI contains two domains with high sequence homology to thioredoxin. One of these thioredoxin motives is found at position 52-55, while the second motif is located at position 396-399. The active site cysteine residues in the thioredoxin motives are essential for the oxidase/isomerase activity of PDI. In each motif the two cysteine residues within the sequence-WCGHC-can potentially form a disulphide bond.
To investigate whether both thioredoxin motives have similar disulphide bond propensities, that is, whether both thioredoxin motives are in the same bonded form, we analysed the disulphide bonding pattern with the CFMD (Figure 4 and Table 2). Our analysis predicted that the first thioredoxin motif around residues 52-55 indeed forms an intradomain disulphide bond; the second cysteine residue in the sequence CGHCKAL has a very high propensity of forming a disulphide bond with the first cysteine residue. However, the second thioredoxin motif is not predicted to be disulphide bonded, since the second cysteine residue in the sequence CGHCKQL has zero propensity of forming a disulphide bond with the first cysteine residue in this motif. We therefore predict that the two thioredoxin motives in PDI are in different bonding states; while the first-WCGHCmotif is in the oxidized and thus disulphide bonded form, the second thioredoxin motif is in the reduced form. From this analysis we conclude that the two thioredoxin motives in PDI have different reduction potentials. This result is in excellent agreement with the findings of Chambers and coworkers [13], who showed that the two thioredoxin motives react differently to Ero1a, the in vivo oxidant of PDI.

Application 2: Protein Identification and Motif Exploration.
Catalytic functionalities of some enzymatic proteins are dependant on the oxidation and reduction of state of their cysteine residues. The oxidation of cysteine residues and formation of disulphide bonds take place in a reducing environment. In prokaryotes, disulphide bonds are mainly formed in the periplasmic space outside the membrane. In contrast, the formation of disulphide bonds takes place in endoplasmic reticulum (ER) in eukaryotes. As a result, proteins with stable disulfide bonds rarely reside in the  cytoplasm. This knowledge would apply on a larger scale, making the local and global profile of each protein environment, its folding localization, and classification becoming a potential contribution on the disulphide bonding prediction mechanism.
CMD offers the user a unique ability to identify and mine all known proteins using specific motif sequence, and explore their classification, motif sequences, structure, and bonding status. During the creation of the datasets, we discovered 15875 unique motifs that are always bonded 4 Advances in Bioinformatics   (EATLRCWALGF with the highest occurrence) and 41577 unique patterns that are always nonbonded (ALSVPCSDSKA with the highest occurrence) for the five flanking residues that can be utilized for cysteine state prediction. The number of these unique motifs is considerably higher than prior number of motifs used in cysteine bond prediction [3,14] and not limited to specific genomes [15].

Data Availability.
The CMD databases are accessible through a web portal at http://birg4.fbb.utm.my/cmd. The entire database with annotations is available for download in the SQL format, describing the relations between classes and fragments. As an additional service for programmers and third party developers, all queries available in CMD are freely accessible using available web services and web application programming interfaces (API). Also for automated highthroughput querying, all information contained in the CMD database can be downloaded using ftp services.

Discussion
The CMD combined data of bonded and free cysteine motifs aims to fill a gap in the knowledge query that will allow indepth characterization of the composition propensity, and its role in determining the bonding state. Despite the bonding information regarding cysteine residues in proteins available in many databases and several applications focused on disulphide bridge formation prediction, there is no complete reference with a proper form of representation and analysis available at the moment. This database is automatically updated from the PDB and UniProt that currently contain 878000 cysteine motifs with more than 77,000 unique cysteine motifs and cysteine pairing motifs. Compilation of these cysteine motifs together with their secondary structures and propensity value assignments, and the ability to query using Protein IDs and motif sequences is a novel and significant feature over prior prediction works which use considerably smaller datasets [3]. In addition to the novelty of the motif query tool, CMD has several novelties such as inclusion of UniProt data, the distinction between inter or intrachain disulphide bonds, inter or intradomain bonds, and an application programming interfaces (APIs) for interfacing with other bioinformatics tools.

Conclusion
The creation of CMD is useful when analyzing cysteine/ disulfide bond formation and its motif sequence composition analysis by providing (1) a query tool for cysteine motifs based upon a comprehensive cysteine motif database curated from PDB and UniProt, (2) secondary structure and propensity values assignments of each motif sequence, and (3) datasets of detailed information of the motifs such as occurrence frequency and their amino acids propensity value. We believe that CMD's usefulness will be the query tool that will complement other protein 3D structural databases and similarly motif-based prediction tools.