An unprecedented opportunity for identification of disease biomarker candidates has been provided by the advent of high throughput technologies in the past decade [
Current strategies for biomarker discovery tend to focus on one of two approaches: data-driven [
As one of the leading causes of death worldwide, chronic obstructive pulmonary disease (COPD) is a prevalent condition that is characterized by progressive and not fully reversible airflow limitations [
Accumulating evidence has suggested that elevated adenosine levels in lung are associated with chronic lung diseases in both human and animal models [
In this paper, we describe a semiautomated framework, identification of signatures from integrated clustering (ISIC), for merging data-driven and knowledge-driven approaches into a biomarker selection scheme in an iterative manner, with a defined metric provided for performance evaluation. To demonstrate ISIC, we applied it to proteomics data sets of bronchoalveolar lavage fluid (BALF) and plasma from a mouse model of COPD, the Ada-deficient mouse model [
Data from the Ada-deficient mice were used for the initial biomarker identification [
The peptides were identified and quantified using a collection of in-house developed tools that are freely available at
The quantified proteins in BALF and plasma were compared quantitatively and qualitatively between the time-matched Ada +/− and Ada
All clustering analyses using the mouse data were performed on the complete data set of significantly altered proteins. Missing values were imputed at the protein abundance level using a regularized expectation-maximization algorithm [
A biological function-centric approach was used to determine the functionally enriched biological processes in the BALF and plasma samples of mice [
An expert knowledge-driven disease selection was subsequently implemented on the enriched GO terms selected above. Specifically, a subset of the enriched terms was further selected based on expert knowledge on the Ada-deficient model, the
A Bayesian integration approach was applied on clusters to derive the optimal probability models for the data sets [
The optimal algorithm for a specific cluster was the one providing the best CA among the four probability models. The posterior probabilities were integrated through a Bayesian approach using the different combinations of the subsets along with their optimal algorithms determined [
Biomarker candidate selection was conducted separately in the clustering approach versus the expert-driven functional selection. In the clustering approach, we selected the biomarker candidates as the set of protein clusters that gave the best integrated CAs. In the expert-driven functional selection, the biomarker candidates were selected as the several most differentially altered proteins belonging to the functional clusters providing the best integrated CAs.
Validation was performed on a proteomics data set of human plasma at both the cluster (six cluster optimization) and individual protein levels. For the validation on clusters, we used the clusters identified from the mouse BALF and plasma proteins using their joint distance matrices. In each cluster, we filtered to only include proteins that were detected in both human and mouse. The CAs for individual clusters and the integrations from all six clusters were calculated. For the validation on the individual proteins, the biomarker candidates selected from mouse BALF, which were also detected in the plasma data sets, were evaluated in the human plasma data set.
The objective of this study was to develop a semiautomated framework for integrating expert knowledge into disease marker selection scheme in an iterative manner guided by the use of a defined metric providing the evaluation of performances. The framework, named ISIC, was designed to serve as a conceptual pipeline rather than a collection of detailed protocols; the basic flow is illustrated in Figure
Flowchart of the ISIC framework in a biosignature discovery process.
To demonstrate ISIC, we applied it to three proteomics data sets associated with COPD. First, we identified the potential biomarkers in data obtained in the BALF and plasma samples from the Ada-deficient mouse model of COPD. This model system has a clear distinction between diseased and nondiseased samples and therefore is well-suited to developing and testing our classification approach. To validate the candidates identified from the development phase, we chose to examine plasma data from smokers with COPD along with their corresponding controls. This data set, derived from the actual patient samples, allowed us to test whether the signatures identified from mouse would be robust in their ability to classify diseased and nondiseased human samples with complex and varied genetic and environmental backgrounds.
A special effort was focused on how to appropriately handle the missing values in the data sets. Missing values are an inevitable issue in many proteomics studies. It is not uncommon to have 30% or more missing values, that is, measurements for a specific protein that are missing from individual samples but present in others, even from a carefully designed and implemented proteomics data set [
We calculated distances between all proteins based on their abundance levels across all observations or their functional similarity based on their annotations in GO. These sets of distances were then integrated and used for hierarchical clustering. The clusters derived from three different dissimilarity matrices were used in the Bayesian integration and classification step to obtain CA scores for each combination of parameters. No significant differences in the CA scores were observed between different weighted averages or logistic functions (data not shown). The integrated CA scores from using the full and partial data sets are listed in the Table
Optimal integrated CAs derived from (A) the distance-based hierarchical clustering and (B) the disease-model-related functional selection approaches.
Clustering based on | Optimal Integrated CA | ||||||||
---|---|---|---|---|---|---|---|---|---|
Distance matrix | Data expression profiles | Functional relationships | A combination of the two | ||||||
No. of clusters | 1 | 6 | 12 | 6 | 12 | 6 | 12 | ||
(A) Different distance matrices | BALF | Full | 0.83 | 0.86 | 0.79 | 0.93 | 0.80 | 0.81 | 0.81 |
Partial | 0.93 | 0.96 | 0.96 | ||||||
Plasma | Full | 0.66 | 0.68 | 0.54 | 0.62 | ||||
Partial | 0.77 | 0.79 | |||||||
Number of proteins1 | All | Top 3 | |||||||
No. of clusters | 1 | 12 | 1 | 12 | |||||
(B) Disease model-related functional selection | BALF | Full | 0.81 | 0.90 | 0.88 | 0.82 | |||
Partial | 0.93 | ||||||||
Plasma | Full | 0.57 | 0.56 | 0.59 | 0.65 | ||||
Partial | 0.73 |
A total of 303 GO terms (data not shown) were determined as enriched
The list of the enriched general functional groups from the Ada-deficient model of COPD extracted by the expert knowledge-driven functional analysis using the BALF data. This list is based on (A) a GO-based biological process enrichment and (B) the disease-model-related expert selection.
No. | Enriched general functional group | (A) GO-based biological process | (B) The disease model-related GO cluster |
---|---|---|---|
1 | Immune system process | (13-1)1 Immune system process | |
2 | Stress/stimulus response | ||
3 | Cellular response to stimulus | ||
4 | Metabolic process | ||
5 | Biological regulation | ||
6 | Death | ||
7 | Localization | (13-2)1 Localization; | |
8 | Cellular organization | (13-4)1 Cellular component organization or biogenesis | |
9 | Proliferation | ||
10 | Others | (13-1)1 Immune system process; |
The CA performances of the ten functional groups were assessed in a similar way as those in the clustering approach. Specifically, the optimal individual CA scores for functional clusters from using all and the top three differentially expressed proteins within individual clusters were calculated and are summarized in Table
In the distance-based clustering approach, the biomarker candidates were the protein clusters. Specifically, 215 (out of 396) proteins in two clusters (with the total number of clusters set as six) or 129 proteins in two clusters (with the total number of clusters set as twelve) in BALF from the best performing combination of clusters were considered to be the biosignatures. Similarly, a group of 13 proteins from the best performing cluster were selected as a narrowed set of biomarker candidates in plasma. Because our approach combines patterns in abundance with functional relationships, we hypothesized that these signatures would be more robust relative to the individual candidates with top performances.
In the disease model-related functional selection driven by expert knowledge, CAs from the top three most differential proteins of each cluster generally outperformed the CAs from all differential proteins of the cluster. Therefore, we examined the biomarker candidates in the top three proteins from each GO term instead of all proteins under it. The information on the five best CA performances is listed in Table
To compare the robustness of signatures derived using different approaches and to validate our findings, we chose to use a data set of human plasma samples. These samples were taken from subjects with low BMI (<25) who smoke and have been diagnosed with COPD and their corresponding healthy controls. A total of 44 proteins in human data were differentially expressed in the mouse plasma, which was used in validation. The optimal CAs from using the 44 common proteins in the six clusters that were defined by the mouse plasma are listed in Table
The validation results (in CA) on the cluster-based biomarker candidates using a human plasma data set.
Functional group no. in Table | CA from mouse plasma-defined clusters in | CA from mouse BALF-defined clusters in | ||
---|---|---|---|---|
Mouse plasma | Human plasma | Mouse BALF | Human plasma | |
1 | 0.54 | 0.93 | 0.79 | 0.93 |
2 | 0.58 | 0.86 | 0.93 | 0.79 |
3 | 0.56 | 0.71 | 0.72 | 0.71 |
41 | 0.83 | 0.71 | 0.79 | 0.79 |
5 | 0.63 | 0.79 | 0.83 | 0.64 |
6 | 0.53 | 0.79 | 0.62 | 0.64 |
1, 4 | 0.80 | 0.86 | ||
2, 51 | 1.00 | 0.71 | ||
1–6 | 0.83 | 0.81 |
At the level of individual proteins, four out of the six candidates selected by the expert-driven functional selection were also identified in human plasma samples. The CA scores from individual marker candidates and several top integrations of both mice and human samples are summarized in Table
The validation results in CA on the individual biomarker candidates of COPD in the human data and the CAs from the mouse data.
Individual protein or a panel of proteins | Optimal CA in | Belong to the general functional group1 | ||
---|---|---|---|---|
Mouse BALF | Mouse plasma | Human plasma | ||
Prothrombin, THRB | 0.86 | 0.50 | 0.93 | 1–10 |
Vitamin D-binding protein, VTDB | 0.69 | 0.63 | 0.86 | 2–10 |
Complement C3, CO3 | 0.69 | 0.67 | 0.79 | 1–10 |
Adiponectin, ADIPO | 0.66 | 0.53 | 0.64 | 1–9 |
THRB; VTDB | 0.97 | 0.57 | 0.93 | |
THRB; CO3 | 0.86 | 0.70 | 0.93 | |
VTDB; CO3 | 0.83 | 0.67 | 0.79 | |
THRB; CO3; ADIPO | 0.86 | 0.70 | 1.00 | |
VTDB; CO3; ADIPO | 0.83 | 0.70 | 0.93 | |
THRB; VTDB; CO3; ADIPO |
The bar graph of the average fold changes of the protein abundances in diseased group relative to their controls of four potential biomarkers identified in mouse BALF. The positive fold changes indicate the observed upregulation in the diseased group, the Ada −/− mice, and the negative fold changes indicate the observed downregulation. The significances of these changes are indicated with two (
With the data sets available for the time-matched diseased and controlled animals, we were able to compare the individual and integrated optimal CAs at the different time points during the developmental course of COPD from the Ada-deficient mouse model. In particular, we compared the optimal CAs derived from the proteins that were individually and cumulatively significantly changed at the five time points in both fluids (Figure
The comparative results in the optimal CAs between the BALF (solid lines) and plasma (dashed lines) from the Ada-deficient mice during the disease developmental course. The CAs were derived from the cumulatively (blue) and individually (green) significantly changed proteins at different time points. The CAs were obtained from the resulting clusters using the joint distance matrix of protein expression patterns and their functional relationships (XOA).
Not surprisingly, the ability of this discrimination shows an increasing trend as the Ada −/− mice get sicker in both plasma and BALF. It is also interesting that these four candidates are able to classify mice fairly well at very early time points, even before outward manifestations of disease.
In the field of biomedical science, the primary challenge has been shifting from data generation to data interpretation. The explosive growth of high dimensional data sets has demanded the development of semiautomated or automated tools as a necessity for knowledge discovery [
The focus of the semiautomated clustering approach is to separate the initial marker searching candidates, that is, the differentially expressed proteins in the data sets, into several different groups that contain features with similar expression patterns and functionalities within groups. In contrast, the expert-driven functional selection may be somewhat subjective; however, it can be an efficient way to extract a handful of biomarker candidates with the incorporation of proper knowledge. The classification performances of this demonstration of COPD data sets on both approaches obtained comparable results that were both quite good. It is also worth mentioning that our intention here is to illustrate the individual merits and weaknesses of both approaches in the biomarker selection schemes in order to gain insight on how to comprehensively and efficiently extract valuable information from data sets.
Biomarker identification is a process to select a limited number of biomolecules that convey the essential biological information distinguishing a disease state from a nondisease state. In the clustering approach, our results show that our cluster-based biomarkers are much more robust in their ability to classify human patients than the individual proteins. A possible explanation is that features in clusters may capture more consistent and comprehensive information from data relative to the individual proteins. We are currently working to include an extra step of feature selection, which will rigorously identify subsets of proteins with optimal CAs, to focus smaller biomarker sets.
We found that small sets of proteins could be selected with good performance using our expert knowledge-driven approach. The biomarker candidates selected in this way have a subjective component but also can potentially filter out the low quality markers identified from pure statistical processes. Another limitation of expert-driven strategy is that not all gene or gene products have annotations, which eliminates the possibilities for exploring the functional relationships among them in the currently available knowledge databases.
One noticeable consistency of the two approaches is that all the optimal classification performances, indicated by the optimal integrated CAs, resulted from using partial instead of the full data sets (Table
As previously emphasized, ISIC is intended to be conceptual as well as flexible. The five components function independently and collaboratively. Each component serves its distinct functionality and is implemented at a different stage in the biomarker discovery process. The independence between them makes it easy to tailor individual compartments for specific data sets or to substitute using other methods with similar functions. The data reduction largely is a data-dependent process. As a means to group similar data into clusters based on a similarity criterion, the distance-based hierarchical clustering can also be achieved by other grouping mechanisms, such as k-means, self-organized maps, and fuzzy clustering [
In conclusion, we describe a generalizable framework for integrating expert knowledge into processes of disease biomarker discovery. Our framework, ISIC, consists of several independent and collaborative components and is flexible enough to accommodate addition, subtraction, and modification of analyses. The integration of data-driven and knowledge-driven information is used in a distance-based clustering approach in a semiautomated manner. An expert-driven functional selection approach was also performed to select individual proteins for comparison to our automated approach. We identified signatures in a mouse model of COPD and subsequently validated them in a human cohort, where they demonstrated a comparable accuracy in discriminating patients with COPD from those without COPD. This was in contrast to standard approaches to identify biomarkers in the mouse model, which were not robust in the human cohort. We believe that ISIC represents a generalizable platform for identification of robust biosignatures from integrated data sources.
The authors would like to thank Dr. Harish Shankaran for his help in the functional enrichment analysis. This study was supported by the Signatures Discovery Initiative, a component of the Laboratory Directed Research and Development Program at Pacific Northwest National Laboratory, a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy under Contract DE-AC05-76RL01830. The proteomics measurements were conducted in the Proteomics Facility at the Environmental Molecular Sciences Laboratory, a DOE BER national scientific user facility at Pacific Northwest National Laboratory (PNNL) in Richland, WA. Support was also received from the Pulmonary Systems Biology Initiative of the Battelle Memorial Institute. The work on the patient samples was supported by funds from National Institutes of Health (NIEHS) (Grant U54-ES016015).