Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Data

Recent reports from our laboratory and others support the SELDI ProteinChip technology as a potential clinical diagnostic tool when combined with n-dimensional analyses algorithms. The objective of this study was to determine if the commercially available classification algorithm biomarker patterns software (BPS), which is based on a classification and regression tree (CART), would be effective in discriminating ovarian cancer from benign diseases and healthy controls. Serum protein mass spectrum profiles from 139 patients with either ovarian cancer, benign pelvic diseases, or healthy women were analyzed using the BPS software. A decision tree, using five protein peaks, resulted in an accuracy of 81.5% in the cross-validation analysis and 80% in a blinded set of samples in differentiating the ovarian cancer from the control groups. The potential, advantages, and drawbacks of the BPS system as a bioinformatic tool for the analysis of the SELDI high-dimensional proteomic data are discussed.


INTRODUCTION
Ovarian cancer has the highest fatality-to-case ratio of all gynecologic malignancies [1,2]. This is attributed to the lack of early warning signs and efficacious early detection techniques [1,3]. Another problem hindering the successful management of the disease is the paucity in prognosticators that could assist the selection of treatment modality. One of the most promising routes towards improvement in the detection and surveillance of ovarian cancer is the identification of serum markers. Utilization of the CA125 as an ovarian cancer serum marker has improved cancer detection rates during the last few years [1,2,3]. Nevertheless, CA125 does not diagnose earlystage cancers with high accuracy and is prone to false positives. Therefore, the need to identify additional serum markers for ovarian cancer is paramount to the successful management of this disease.
A major obstacle in finding a diagnostic biomarker is the tremendous molecular heterogeneity that exists for nearly all human cancer, suggesting that simultaneous screening of a patient specimen for multiple biomarkers will be required to improve the early detection/diagnosis of cancer. DNA chip technologies address this problem at the genomic level, and provide accessibility to gene expression profiles. However, since proteins are, for the most part, the mediators of a cell's function, the study of the changes in proteins that result from a pathological lesion, such as cancer, would appear to be a rich source of potential cancer biomarkers.
Most of the previous studies in search of diagnostic biomarkers have employed two-dimensional electrophoresis (2DE) which can resolve hundreds to thousands of proteins present in complex protein mixtures, such as cell lysates and body fluids. Although some successes have been reported in detecting potential ovarian cancer-associated biomarkers [4,5,6,7], this classical proteomic technique is very time consuming, not highly reproducible, and not easily adaptable to a clinical assay format.
A recently developed mass spectrometry proteomic approach, the SELDI (surface-enhanced laser desorption/ionization) ProteinChip System (Ciphergen Biosystems, Inc, Fremont, Calif), appears to hold promise for biomarker discovery and as a potential clinical assay format [8,9]. (The SELDI system and its applications are described in the report by Reddy and Dalmasso [10]; and a recent review by Wright [11]). Using this system, distinct protein patterns of normal, premalignant, and malignant cells were found for ovarian, esophageal, prostate, breast, and hepatic cancers [12,13,14]. Potential biomarkers for breast and bladder cancers were also detected in nipple aspirate fluid and urine, see respectively [15,16], by the SELDI system.
Recent reports also support that analysis of the SELDI data by "artificial intelligence" algorithms can lead to the identification of protein "fingerprints" specific for prostate, ovarian, and breast cancers, significantly increasing the accuracy in differentiating cancer from the noncancer groups [17,18,19,20]. These studies employed different algorithms to analyze the SELDI data, including a genetic algorithm [19], a decision tree [17,18], and a support vector machine algorithm [20]. Each method appeared to be effective in developing accurate classification systems.
The high dimensionality of the data generated by SELDI requires a mathematical algorithm to analyze the data without overfitting. Since the SELDI protein profiling approach is new, it is difficult to determine up-front which algorithm to select for the data analysis and development of a "diagnostic" classifier. It is also fair to assume that different bioinformatic tools may be required for different cancer or disease systems. The objective of this study was to evaluate the commercially available classification algorithm (biomarker pattern software [BPS]) developed by Ciphergen Biosystems Inc for analysis of the SELDI serum protein profiling data from patients with ovarian cancer, benign pelvic diseases, and normal women. The potential, advantages, and drawbacks of this approach as well as suggestions for improvement are discussed.

Serum samples
Serum samples were obtained from patients with epithelial ovarian cancer prior to treatment administration (n = 44), benign pelvic diseases (n = 61), and from women with no evidence of pelvic disease (n = 34) enrolled through the Division of Gynecologic Oncology, University of Texas, Southwestern Medical Center. Informed consent was obtained from all patient and control groups. The demographics of the patients and the stage distribution of the ovarian cancers are presented in Table 1. Benign conditions included benign pelvic masses (endometriosis, cystadenomas, hydrosalpinx, lipoma, Brenner tumor, fibroids, endometrial polyp). The sera were aliquoted and stored at −80 • C.

SELDI processing of serum samples
Serum samples were applied on the strong anion exchange (SAX) and immobilized-copper (IMAC) chip surfaces. In brief, 21 µL of serum were mixed with 30 µL 8M urea in 1% CHAPS-PBS pH 7.4 buffer for 30 minutes at 4 • C, followed by the addition of 100 µL of 1M urea in 0.125% CHAPS-PBS buffer and 600 µL of binding buffer compatible with the type of surface in use (PBS for IMAC and 20 mM Hepes containing 0.1% Triton for SAX). Fifty µL of the diluted samples were then applied onto the chips using a bioprocessor. Following a 30-minute incubation, nonspecifically bound molecules were removed by 3 brief washes in binding buffer followed by 3 washes with HPLC-gradient H 2 O. Sinapinic acid (2X 1 µL of 50% SPA in 50% ACN-0.1%TFA) was applied to the chip array surface and mass spectrometry was performed using a PBS2 SELDI mass spectrometer (Ciphergen Biosystems Inc). Protein data were collected by averaging a total of 192 laser shots. Mass calibration was performed using the all-in-one peptide standard (Ciphergen Biosystems Inc) which contains vasopressin (1084.2 daltons), somatostatin (1637.9 daltons), bovine insulin β-chain (3495.9 daltons), human insulin recombinant (5807.6 daltons), and hirudin (7033.6 daltons). All samples were processed in duplicate.

Processing of SELDI data
Protein peaks were labeled and their intensities were normalized for total ion current (mass range 2-200 kd) to account for variation in ionization efficiencies, using the SELDI software (version 3.1). Peak clustering was performed using the Biomarker Wizard software (Ciphergen Biosystems) and the following specific settings: spectral data from IMAC surface; signal/noise (first pass): 4, minimum peak threshold: 10%, mass error: 0.3%, and signal/noise (second pass): 2 for the 2-20 kd mass range and signal/noise (first pass): 5, minimum peak threshold: 10%, mass error: 0.3%, and signal/noise (second pass): 2.5 for the 20-100 kd mass range. Spectral data from the SAX surface were analyzed with the same set of settings with the difference that the minimum peak threshold was set to 5%. With these labeling parameters, a total of 122 protein clusters (45 from the IMAC and 77 from the SAX surface) were generated. Peak mass and intensity were exported to an excel file, and the peak intensities from each duplicate spectra were averaged. Pattern recognition and sample classification were performed using the BPS. The decision tree described in the result section was generated using the Gini method nonlinear combinations. A 10-fold cross-validation analysis was performed as an  initial evaluation of the test error of the algorithm. Briefly, this process involves splitting up the dataset into 10 random segments and using 9 of them for training and the 10th as a test set for the algorithm. Multiple trees were initially generated from the 122 classifiers by varying the splitting factor by increments of 0.1. These trees were evaluated by cross-validation analysis. The peaks that formed the main splitters of the tree with the highest prediction rates were then selected, the tree was rebuilt based on these peaks alone and evaluated by the test set. The values of P were calculated based on t-test (Biomarker Wizard software). The value P < .05 was considered to be statistically significant.

RESULTS
One hundred thirty-nine serum samples were assayed by SELDI mass spectrometry. Both SAX and IMAC surfaces could effectively resolve low-mass (< 20 kd) protein peaks, although the SAX surface appeared superior in resolving larger (> 20 kd) protein peaks. Figure 1 shows representative protein spectra from one serum sample processed on SAX and IMAC chips.
Of a total of 139 serum samples, 124 (85 controls and 39 cancers) were randomly selected to form the learning set and 15 (10 controls and 5 cancers) to form the blinded test set for the algorithm. Five peaks were selected by the BPS algorithm to discriminate cancer from the noncancer groups. Figure 2 is the decision tree that was generated from the learning set to classify the two groups. Three peaks (5.54, 6.65, and 11.7 kd) detected on the IMAC chip and 2 (4.4 and 21.5 kd) detected on the SAX surface form the main splitters. Their mass spectra and gray-scale/gel views are shown in Figures 3, 4, 5, 6, and 7. These peaks have significantly different intensity levels between the cancer and benign or normal controls with the exception of the 6.65 and 21.5 kd peaks, which did not differ significantly between cancers and benigns ( Table 2). A 10-fold cross-validation analysis was performed as an initial evaluation of the accuracy of the algorithm in predicting ovarian cancer. A specificity of 80% and sensitivity of 84.6% were obtained (Table 3). In the test set, sensitivity and specificity of 80% were obtained ( Table 3). The misclassified samples in the test set included one benign (uterine fibroid), one normal, and a stage III C cancer.

DISCUSSION
The high degree of genetic heterogeneity associated with human cancers makes it likely that panels of multiple biomarkers will be needed to improve early detection/diagnosis. This entails the development of highthroughput proteomic and genetic approaches as well as of reliable bioinformatic tools for data analysis. The SELDI proteinChip system offers the advantage of rapid and simultaneous detection of multiple proteins from complex biologic mixtures. We employed this system in combination with the BPS classification algorithm for protein profiling of ovarian cancer in serum. Using this approach, a classifier that was 80% accurate in discriminating patients with ovarian cancer from patients with benign disease and healthy controls from a blinded test set was generated. Evaluation of the classifier by crossvalidation and the analysis of the independent test set offers statistical confidence of the potential of this approach as an ovarian cancer detection tool. However, the sample size included in this study decreases the validity of generalized conclusions. Complete evaluation of this classifier will require testing its prediction rates for larger "blinded" and independent serum sets.
The BPS software was found to be relatively simple to use. However, BPS, like other mathematical algorithms, is prone to data overfitting, and also is not reliable when a large number of variables relative to samples sizes are included in the analysis. A preselection process of the most significant variables using statistical analysis (eg, ROC curve, ANOVA) may help in alleviating this problem.
Petricoin et al [19] recently reported the successful application of a genetic algorithm for the analysis of SELDI proteomic data from ovarian cancer patients. In this study, five discriminatory peptides were detected, moleculalr mass range 500-2500 daltons, and the accuracy in predicting ovarian cancer in a blinded set of samples was 97.4%. We focused on the analysis of potential biomarkers in higher mass ranges (> 2000 daltons). Furthermore, in contrast to the case where BPS algorithm is processed, that is, labeled peak information is analyzed, the genetic algorithm employed by Petricoin  "raw" SELDI data. In this case, prerequisite for the further identification of the potential discriminatory markers is the coupling of the genetic algorithm with a peak identification system where the raw data are translated into protein peak information. BPS employs the peak identification system of the SELDI software facilitating  biomarker detection. It should be noted, however, that careful and precise selection of the peak labeling settings and normalization of peak intensities are considered critical for biomarker identification and for the efficient and reliable performance of any learning algorithm used in conjunction with the SELDI system. Besides providing a preliminary evaluation of the suitability of BPS for the comparison of SELDI data, our study also demonstrates the potential of combining spectral data from different types of surfaces as a means to increase protein resolution. Although, compared to SELDI,  Figure 6. Spectra (top) and grey-scale or gel views (bottom) of the peaks (arrows) forming the splitting rules. The protein peak was detected on the SAX surface. The peak appears to be upregulated in the cancer (C1-C4) compared to the begin (B1-B2) and normal (N1-N2) groups.
the resolving power of 2D gel electrophoresis remains unchallenged, we have found that this combinatorial approach can significantly enhance biomarker discovery and increase test accuracy for ovarian and breast cancers from 70-75% up to 90% [21].  Figure 7. Spectra (top) and grey-scale or gel views (bottom) of the peaks (arrows) forming the splitting rules. The protein peak was detected on the SAX surface. The peak appears to be downregulated in the cancers.
In conclusion, the BPS software appears to be potentially suitable for analysis of the high-dimensional SELDI spectral data. Avenues for improvement of the algorithm performance include optimization of the peak labeling process as well as preselection of the most significant peaks by statistical approaches. More extended studies will be required to validate the potential and reliability of BPS as a bioinformatic tool for proteomic studies. It should also be emphasized that comparative analysis of different types of algorithms will be of paramount importance for the better evaluation of their performance and the selection of the bioinformatic features needed for effective biomarker discovery and discrimination of cancer.