The Use of Urine Proteomic and Metabonomic Patterns for the Diagnosis of Interstitial Cystitis and Bacterial Cystitis

The advent of systems biology approaches that have stemmed from the sequencing of the human genome has led to the search for new methods to diagnose diseases. While much effort has been focused on the identification of disease-specific biomarkers, recent efforts are underway toward the use of proteomic and metabonomic patterns to indicate disease. We have developed and contrasted the use of both proteomic and metabonomic patterns in urine for the detection of interstitial cystitis (IC). The methodology relies on advanced bioinformatics to scrutinize information contained within mass spectrometry (MS) and high-resolution proton nuclear magnetic resonance (1H-NMR) spectral patterns to distinguish IC-affected from non-affected individuals as well as those suffering from bacterial cystitis (BC). We have applied a novel pattern recognition tool that employs an unsupervised system (self-organizing-type cluster mapping) as a fitness test for a supervised system (a genetic algorithm). With this approach, a training set comprised of mass spectra and 1H-NMR spectra from urine derived from either unaffected individuals or patients with IC is employed so that the most fit combination of relative, normalized intensity features defined at precise m/z or chemical shift values plotted in n-space can reliably distinguish the cohorts used in training. Using this bioinformatic approach, we were able to discriminate spectral patterns associated with IC-affected, BC-affected, and unaffected patients with a success rate of approximately 84%.


Introduction
With the rapid development of methods in the fields of genomics (DNA), transcriptomics (mRNA), proteomics (proteins), and metabonomics (low molecular weight metabolites) there is general enthusiasm towards revolutions in systems biology that will lead to more advanced approaches to diagnostics and thera-peutics. Much of the effort in these areas focuses on comparing thousands of species between unaffected and diseased individuals with the hope that one, or a few, key differences in the two states may be identified. While ideally these differences would be recognized in readily obtainable biofluids such as urine, plasma, or serum, the inter-person variability of these samples makes the identification of unique, disease-reflective differences quite challenging. While unique biomarkers, such as HCG for pregnancy, are extremely effective, others such as Cancer Antigen 125 and prostate specific antigen possess poor positive-predictive value -particularly for early disease stage diagnosis. Petricoin et al. have recently demonstrated that low molecular weight serum proteomic patterns from surface-enhanced laser desorption ionization time-offlight mass spectral (SELDI TOF-MS) data can distinguish neoplastic from non-neoplastic disease within the ovary [16]. A key aspect to their study was the application of a high-order self-organizing cluster analysis approach based on a genetic algorithm that was "trained" on SELDI-TOF MS spectra from serum derived from either healthy women or women with ovarian cancer. The "trained" algorithm was applied to a masked set of samples and resulted in a sensitivity of 100%, a specificity of 95% and a positive-predictive value of ovarian cancer of 94% [16]. The success of the use of proteomic patterns for the diagnosis of stage I ovarian cancer suggests that patterns generated from other biomolecules within biofluids may also provide a useful indicator of the early onset of a particular disease state.
Since proteomic patterns of serum acquired using SELDI TOF-MS can be diagnostic of a particular disease state, it follows that spectral patterns of biofluids acquired using other types of analytical techniques may also be useful diagnostic tools. Nuclear magnetic resonance (NMR) spectroscopic analysis of bulk biofluids such as urine or plasma (e.g. metabonomics) has been utilized as a means to measure time-related biochemical responses resulting from physiological, pathological, or interventional genetic events [12][13][14]. High-field proton ( 1 H) NMR spectra of biofluids typically contain several hundred resolvable lines, potentially providing structural and quantitative information on hundreds of compounds in a single, nondestructive analysis that takes only a few minutes. The resulting spectrum provides a profile of the metabolic status of the organism. Recently, Brindle et al. showed the capability of discriminating serum samples acquired from patients with coronary heart disase from those with angiographically normal coronary arteries by analyzing the 1 H-NMR spectra of each sample using a supervised partial least squares discriminant algorithm [1]. This non-invasive method was shown to have a specificity of >90%.
We studied the effectiveness of analyzing MS and 1 H-NMR data using a genetic algorithm combined with a self-organizing cluster analysis to correctly discriminate urine samples from individuals suffering from interstitial cystitis (IC) from those of healthy individuals. IC is a debilitating chronic bladder disease of unknown etiology that affects an estimated 750,000 women in the United States, with one-tenth as many men also diagnosed with this disease [2,15,17,18]. IC is currently diagnosed only by symptomatic criteria (urinary frequency plus pain and/or urgency) in the absence of specific identifiable causes, combined with cystoscopic findings (including petechial hemorrhages called "glomerulations" in approximately 90% of patients, or ulcers that extend into the lamina propria in approximately 10%) [3,6,20]. None of these symptoms, however, are specific for IC, and the specificity of glomerulations for this disorder has also been called into question [21], making it currently difficult to establish the diagnosis of IC in a particular patient. Several urine biomarkers have been associated with IC that ultimately may prove to be useful for the noninvasive diagnosis of this disorder, including an antiproliferative factor (APF) that inhibits the proliferation of normal primary human bladder epithelial cells in vitro [4,7,9], heparin-binding epidermal growth factor-like growth factor, and epidermal growth factor [4,8,10]. Additional noninvasive diagnostic criteria based on urine or serum markers would be useful for establishing the diagnosis of IC as well as for understanding the pathogenesis of this disorder. In addition, to determine the specificity of findings related to IC specimens, we generated proteomic and NMR spectral patterns of urine samples from people suffering from acute bacterial cystitis (BC). In the following we describe the use of MS and 1 H NMR spectra of urine obtained from patients with IC, patients with BC, and unaffected controls to identify those patients with interstitial cystitis.

Patients
All 40 female and 10 male IC patients had previously undergone cystoscopy and fulfilled the NIDDK diagnostic criteria for IC [3]. In addition, 30 females were identified as having acute bacterial cystitis (diagnosed by the presence of bacteriuria with >10 3 of a single type of bacteria per milliliter of urine, plus pyuria, in combination with appropriate symptoms). Asymptomatic controls included individuals that were age-(± 5 years), race-and sex-matched to the IC patients (i.e. 40 females and 10 males). All participants were at least 18 years old and were enrolled in accordance with guidelines of the Institutional Review Board of the University of Maryland School of Medicine.

Urine specimens
Urine was collected by the clean catch method in which each IC patient, bacterial cystitis patient, or control wiped the labial area with 10% povidone iodine solution and then collected midstream urine into a sterile container. Specimens were initially kept at 4 • C, then transported to the laboratory where cellular debris was removed by low speed centrifugation at 4 • C. Urine samples were adjusted to pH 7.2 (using 10 N HCl or 10 N NaOH) and 300 mOsm (using 1 M NaCl or ddH 2 O), and filtered through a 0.2 µm pore size filter (Gelman Sciences, Ann Arbor, MI). Each specimen was aliquoted under sterile conditions and stored at -80 • C.

ProteinChip array sample preparation
WCX2 ProteinChip arrays were loaded into a 96well bioprocessor (Ciphergen Biosystems Inc., Palo Alto, CA) and activated with 10 mM HCL. Arrays were washed with HPLC-grade water and pre-equilibrated with binding buffer (50mM sodium acetate, pH 4.5). One hundred µL of urine (diluted 1:1 in binding buffer) was added in duplicate to the WCX-2 ProteinChip array surface and incubated for 3 hours at ambient temperature with gentle agitation. The ProteinChip arrays were washed three times with 100 µL of binding buffer, followed by a final wash with 100 µL of HPLC-grade water. ProteinChip arrays were removed from the bioprocessor and air-dried. One µL of 20% α-cyano-4-hydroxycinnamic acid solution in 50% acetonitrile, 0.5% trifluroacetic acid was added to each WCX-2 ProteinChip array bait surface.

PBS-II TOF MS analysis
ProteinChipTM arrays were analyzed by a Protein Biological System II time-of-flight mass spectrometer (PBS-II, Ciphergen Biosystems Inc.) and mass spectra were recorded using the following settings: laser intensity 185, detector sensitivity 8, m/z range 0-20,000, 130 shots per sample. Data were collected using the Ciphergen ProteinChip software version 3.0. The PBS-II TOF MS was externally calibrated using the "All-In-One" peptide mass standard (Ciphergen Biosystems, Inc.).

Proteomic pattern analysis
Proteomic pattern analysis was performed by exporting the raw data files generated from the PBS-II into tabdelimited files possessing approximately 15,000 data points. The mass spectra were randomly segregated into equal groups for training, and blind testing. The models were built on the training set using Proteome-QuestTM (Correlogic Systems Inc., Bethesda, MD) and tested using blinded sample sets. The Proteome QuestTM software itself implements a pattern discovery algorithm combining elements from genetic algorithms [5] and self-organizing adaptive pattern recognition systems [11]. Genetic algorithms organize and analyze complex data sets as if they were information comprised of individual elements that can be manipulated through a computer-driven analog of a natural selection process. Self-organizing systems cluster data patterns into similar groups. Adaptive systems recognize novel events and track rare instances. The genetic algorithm component of analysis begins with the random generation of a population of 1500 subsets of combinations of features in the urine mass spectra. This number was chosen based on adequate coverage of the data, with a heuristic that no value can be duplicated within each of the 1500 feature subsets. Each feature subset in the population specifies the identities of the exact m/z values in each urine mass spectrum but not their relative amplitude. The number of features in the subset ranges from 5 to 20. For this study, MS data was normalized by linearly scaling each m/z value, V, within any randomly generated pattern subset between the largest and the smallest values within that subset so that 0 NV 1. In this way, differences in spectral quality that may emanate from biases such as in ProteinChip variance and not from the inherent disease process itself can be minimized. The spectra are normalized according to the following formula: Where NV is the normalized m/z value, V is the intensity value for the specific randomly chosen m/z bin, Min is the intensity of the smallest intensity value of any of the m/z bins within the randomly selected feature set and Max is the maximum intensity of the m/z bin within the randomly selected feature set. This equation linearly normalizes the peak intensities in the feature set so as to fall within the range of 0 to 1. Prior to analysis, the data is randomly divided into training and testing data sets. The training data set is further divided into and labeled as diseased or unaffected based upon known clinical diagnosis.

NMR data acquisition
Before 1 H-NMR data acquisition, each urine sample was equilibrated to ambient temperature. A D 2 O stock solution containing 0.21% (w/v) sodium 4,4-dimethyl-4-silapentanoate-2,2,3,3-d 4 (TSP) was prepared by dissolving 10.5 mg of TSP in 5 mL of D 2 O. Thirty-three µL of this solution was added to each 325 µL urine sample, which was then vortexed and transferred to a 5 mm Shigemi NMR tube with 15 mm susceptibility matched plungers. 1 H-NMR spectra were acquired of urine samples obtained from 47 control, 50 IC-affected, and 30 BC-affected individuals.
NMR spectra were acquired on a 500 MHz Varian INOVA Spectrometer equipped with a Nalorac indirect gradient HCNP probe and using the B1-insensitive WET water suppression pulse sequence as described by Smallcombe et al. [19]. The WET selective pulses were a 6 ms Gaussian at 8.1 dB. The gradients were 2 ms in length with levels of 24000, 12000, 6000, and 3000, respectively, each followed by a 2 ms delay. The spectra were collected at 27 • C with a spectral width of 6500 Hz, 5 s acquisition, 5 s equilibrium delay, 32 transients preceded by 1 steady state transients, and a 45 • acquisition pulse (4.5 µs at 56 dB). The transmitter was set on the water resonance at -175.5 Hz and was not changed from one sample to the next. The probe was retuned for each sample.

Metabonomic pattern analysis
The 32499 complex points from each 1 H-NMR spectrum was zero-filled to the next power of two, 32768 complex points, and transformed with 0.5 Hz exponential line broadening. Each spectrum was phased, referenced to TSP and drift corrected. The Varian's binning package, provided by Dr. Bruce Adams, was used to convert each spectrum into 531 bins starting from 0.16 ppm to 10.80 ppm with widths of 0.02 ppm. The integration value for bins in the regions between 4.60-4.88 ppm and 5.52-6.04 ppm were set to zero to remove contributions from the residual water and urea peak respectively. The data was normalized by scaling the sum of the 531 bins for each spectrum to a value of 50,000.

Identification of spectral outliers
The 127 binned NMR spectra were normalized so that each has a binned intensity that sums to 50,000. Prior to classification, the data were analyzed for the presence of any strikingly different spectra (i.e. an "out-lier") from all others within the cohort of 127 spectra. Outliers were identified using either principal component analysis (PCA) or a Sammon Map [1]. A Sammon Map is a projection of a high-dimensional set of data onto a lower-dimensional space such that the distance between all pairs of data points is preserved to the greatest extent. If D i,j is the calculated distance between a pair of cohorts and d i,j is the distance in the lower-dimensional space, the Sammon mapping tries to minimize the following metric.
In this study, each of the 127 NMR spectra was projected onto a 2-dimensional plot by randomly placing each cohort in a plane and then performing 400 Newton-Raphson optimizations of each coordinate to minimize the above expression. This procedure is repeated 400 times, and the 2-dimensional mapping that yields the lowest metric is used.
Instead of using the difference-squared as a measure of the disagreement between two cohorts in a given bin, the agreement between their overall profiles can be used as a measure of their similarity. By comparing the intensities in each bin for two cohorts, the sum of the minimum intensities represents the overlap in their profiles. This overlap is equivalent to the summed intensity (50,000) minus one-half of the Manhattan distance between them. The Manhattan distance (L1-norm) is simply the sum of the absolute difference in intensities. The percent similarity is then 100.0 times the overlap divided by 50,000. The similarity matrix is then used in a corresponding K-Most Similar Neighbor analysis using the same predictive procedure as above.
These procedures that use the overall NMR profiles are unbiased, but may be strongly affected by dietary or other random factors. In addition, though they use the magnitude of the differences in the profiles, they do not determine where the profiles are significantly correlated to the Class of the cohort and therefore yield no information about a possible metabolite that may be useful in the classification. In an attempt to find significant bins, a feature selection method is used to select a small number of bins and only the intensities in these bins are used to construct a distance matrix. This matrix is then used in a K-Nearest Neighbor algorithm that is slightly different from that described above.

Distance-dependent K-nearest neighbors
analysis A modified evolution programming (EP) method was used to identify features that can classify the NMR spectra as being obtained from urine acquired from IC-affected, BC-affected, or healthy patients. An EP was selected because it allows the parent population to maintain diverse solutions from one generation to the next, thereby allowing the final population from this method to be used as a good starting population for a new search if new samples are added to the analysis. With the other three methods, this new search would have to start from scratch since the population is homogeneous. The same argument applies if, upon analysis of the features selected, it is found that one or more of the features have no biological basis. This feature can be randomly changed to another feature in any members of the final EP population that contain it and the search can continue, while the other methods would again have to start from scratch since the new search would be limited to a one-dimensional search of the replaced feature.
In the EP feature selection method used here, a population of N genetic vectors of length L (L = 4 or 8 here) was randomly generated. Each set of L features was then used in the modified KNN procedure described below to generate a cost function that measures the degree to which the cohorts are incorrectly classified. In each generation, each parent generates a new genetic vector by randomly replacing one or two of the features in the parent's genetic vector with new features. One of the features is required to be replaced while the second is probabilistically replaced. In the results presented here, the probability of a second replacement is 50% in the first generation and linearly decreases to 1% in the last generation. Before the cost of this offspring is determined, its genetic vector is compared with all genetic vectors in the parent population and all vectors of offspring that it has produced so far. If it is found to be the same as any existing solution, this offspring is destroyed and the same parent is used to generate a new offspring. This uniqueness operator represents one of many possible maturation operators that can be used with the EP method and guarantees that the parent population will be diverse from generation to generation. A generation is complete once each parent has generated a unique offspring and the offspring's cost has been determined. At the end of each generation a (µ + λ) deterministic selection procedure is used to select the parents for the next generation. This process means that the parent and offspring populations are combined to form a population with 2 × N solutions, and the N solutions with the lowest cost become parents in the next generation. This process is continued for M generations, at which time the search stops and the 50 feature sets with the lowest cost are reported.
When each set of L features is examined, the intensities in these bins maps each spectrum onto a point in L-dimensional space. A Manhattan and Euclidean distance metric was used to determine the distance between each cohort-pair and construct a distance matrix. In the KNN procedure outlined above, the K nearest samples to a given sample are used to predict its class (i.e. IC, BC, or healthy). In the case of 4-nearest neighbors, the probability that this cohort belongs to a certain class can be 0, 25, 50, 75, or 100%, depending upon the classification of the four closest samples, and independent of the distance they are from the given cohort. In this analysis, the distances to the four closest neighbors affect the prediction.
The classification method used in this analysis is actually a Distance-Dependent KNN. In a DD-KNN classification if one of the four neighbors is significantly closer to the given sample than the rest, its classification influences the prediction more than the others. Similarly, if the given sample is far away from any of its neighbors, its classification is less certain. If one of the nearest neighbors is Cohort-i with a classification of Cl(i), the unnormalized probability that this cohort belongs to this Class, p[Cl(i)], is a determined by a monotonically decreasing function of the distance between the given cohort and Cohort-i. If d(i) is the distance from a given sample to Neighbor-i, the unnormalized probability of being in the same class is This unnormalized probability is truncated at a large value if d(i) is sufficiently small. This function increases the probability that the cohort has the same class as a neighbor if the neighbor is close, but does not take care of the case where the cohort is far away from all neighbors. To handle this case, a fourth classification called Unknown is added and the unnormalized probability that the cohort belongs to this class relative to each neighbor is given by the expressions In the results presented here, Pu is set to 0.1, meaning that the unnormalized probability of the cohort belonging to the Unknown-Class for a given neighbor is 0.1 if the probability that it is the same class as this neighbor is 0.8 or less; it decreases from 0.1 to 0.0 as the probability of being in the same class as the neighbor from each of the four nearest neighbors, they can be normalized by division by their sum. If the given cohort belongs to Class-I, the error in the characterization of this cohort is simply (1-P(I)), where P(I) is the predicted, normalized probability that it is in this Class based upon its four neighbors. By summing this error for all cohorts, the Cost of this set of features is obtained.
The last requirement is to set a value for α in the expression for p[Cl(i)] above. This constant is determined by the user-supplied value of HALF which controls the value of d h (i) such that Since the magnitudes of the intensities changes for different bins, d h (i) is set to HALF times the theoretical maximum distance (TMD) between cohorts for a given set of bins (features). TMD is determined by using the difference between the maximum and minimum intensities in each of the selected bins. α is then determined from the expression a = 1/2 × HALF × T MD In the results presented here, both scaled and unscaled differences in intensities are used to calculate either the Manhattan or Euclidean distance between two samples (and the TMD). The scaled difference is the absolute difference divided by the average of the intensities and this option produces a relative change instead of an absolute change. In addition, HALF is set to 0.1, 0.15 and 0.2 in different runs to study the effect of increasing α.
The next section therefore presents the results of 24 different classification runs. The EP method searches for the optimum set of either four or eight features and the Cost of each set is determined from one of 12 Distance-Dependent KNN examinations (two possible differences, two distance metrics, and three values of α).

Results
Urine samples were collected from both healthy individuals as well as those had previously undergone cystoscopy and fulfilled the National Institute Diabetes and Digestive and Kidney Diseases (NIDDK) diagnostic criteria for IC [3]. Prior to 1 H-NMR analysis the samples were adjusted to pH 7.2 and 300 mOsm and filtered through a 0.2 µm pore filter. The MS and 1 H-NMR spectra of a selection of urine samples from both healthy and IC-affected individuals are shown in Figs 1 and 2, respectively. A comparison of the spectra showed that while there is variability between those acquired from the two sample sets, there also exists vari- ability within a single sample set. This inherent variability makes it difficult to visually identify signals that are consistently unique to either the control or disease samples, requiring the application of analytical methods that utilize bioinformatic algorithms to distinguish patterns between these three groups.

Diagnosis of urine samples by proteomic pattern analysis
Proteomic pattern analysis was performed by exporting the raw data file generated from the PBS-II TOF-MS. The training set consisted of MS spectral data accumulated from urine samples obtained from 14 asymptomatic controls and 29 patients with IC. The models were built on the training set using ProteomeQuestTM and blind testing was performed with 16 control and 21 IC urine samples. The m/z (their intensities) that were found to be classifiers used to distinguish urine from a patient with IC from that of an unaffected individual are based on actual values from the raw MS spectra. A total of eight different m/z classifiers (m/z 2980.07, 3939.11, 4003.20, 4391.23, 5386.83, 9769.87, 10090.55, and 18893.23) were required to correctly segregate the urine samples obtained from healthy vs. IC-affected individuals. Blinded testing of this model generated from the training set resulted in 100% sensitivity and specificity for the diagnosis of the spectra obtained of the urine samples from the 16 control and 21 IC-affected individuals. Obviously this level of sensitivity and specificity only applies to this limited sample set; whether this diagnostic accuracy can be achieved over a larger cohort would need to be determined by conducting a larger trial.

Classification of source of urine samples by metabonomics pattern analysis
The encouraging results obtained in the proteomic profiling of urine samples from healthy and IC-affected individuals prompted us to investigate whether similar diagnostic capabilities could be obtained using 1 H-NMR. In addition, we sought to investigate whether the 1 H-NMR spectra acquired from urine samples obtained from more than two groups of patients with specific disorders could be segregated. For this investigation, we tested alternative bioinformatic algorithms with the goal of distinguishing IC patients not only from healthy controls, but also from patients with bacterial cystitis (BC). While there are much more cost effective methods to diagnose BC rather than the use of NMR, nonetheless it would be crucial to develop methods that could effectively diagnose IC with a high positive predictive value. Therefore, it was important to determine if NMR could be used to segregate urine samples obtained from three distinct conditions with a low rate of false positive identiification.

Search for outlying spectra
A plot of the first versus the second Principal Component (PC) of all of the 1 H-NMR spectra acquired was constructed in order to identify outlier spectra that may arise from either errors in sample collection, process-ing, or data acquisition, as shown in Fig. 3. A plot of the first few PCs is not guaranteed to reveal outliers, but is sufficient for this dataset because the first PC accounts for approximately 50.3% of the total variation in the data within this particular spectrum and an analysis of this component shows that the coefficient for bin/feature 332 is −0.9936. Since this component is virtually composed of a single feature, its coefficient is negative, and one sample spectrum has a large negative value relative to the rest, this cohort has an intensity that is many times larger than for any other cohort. Because the difference is concentrated in a single feature, this feature will have a large variance and it will be one of the first few PCs. If, on the other hand, the difference between this cohort and the rest was spread out across all of the features, it may not appear as an outlier in this type of plot.
To confirm the presence of the outlier spectrum determined by PCA, a two-dimensional Sammon Map of the set of 127 samples was generated (data not shown). This map attempts to conserve the inter-cohort distances to the largest possible extent, and it also confirms the presence of a single outlier recognized by PCA. This outlier would appear if the differences between it and the other cohorts are uniformly distributed across all features or, as in this case, it is concentrated in a single feature. Since this outlier can adversely affect subsequent classification studies it is removed from consideration. The outlier identified by the Sammon Map corresponds to the exact outlier identified by PC analysis (described above). Therefore, the classification is only performed on the remaining 126 spectra (46 controls, 50 IC patients, and 30 BC patients).

Classification by distance-dependent K nearest neighbors
The Euclidean distance matrix used to construct the Sammon Map was also used in a standard KNN study, as shown in Table 1  fied, 50% correctly classified, or completely misclassified. Table 1 shows that the quality of the classification decreases as the number of neighbors increase. When the Manhattan distance matrix is constructed to determine the similarity between the NMR profiles and is then used in a KNN (or K-Most Similar Neighbors) classification study, the results in Table 1  These simple examinations show that there are features in these spectra that separate these cohorts to some degree since the results for four neighbors still yield classifications that are significantly above those expected from random chance (36.51, 39.68, and 23.81% for control, IC patients, and BC patients, respectively). By using a Feature Selection method to search for the optimum set of J features, a model using four nearest neighbors should yield results that are superior to the 4-Neighbor results listed in Tables 1(A) and (B). Including the distance dependence will cause the classification of a point to reflect the local environment in this J-dimensional space (i.e. proximity of neighbors and their classifications), and may increase or decrease the accuracy of the classifier.
The first set of classification results uses four features and a Euclidean distance. In all runs, the EP Feature Selection method has a population size of 2000 (sets of four features) and the search runs for 400 generations. The intensity change between two cohorts in a given bin can be either a relative difference (i.e., the absolute difference divided by their average distance) or an absolute difference. An inverse probability function is used throughout and the probability of being in the Unknown-Class has a maximum value of 0.1 for all neighbors. The value of HALF can be 0.1, 0.15, and 0.2, representing increasing widths in the probability function. The results for the six runs are shown in Table 2. Included in this table are the results for the best set of features and for the 50th best set. In addition, the features used in each of the best 50 sets are examined, and if a feature is used in five or more sets it is listed along with the number of times it appears in these sets.
These results show that selecting an optimal set of four features produces better classification models than the one using the overall NMR profile (K = 4 result in Table 1(A)). It is interesting to note that greater accuracy is obtained when the absolute difference between intensities is used and that this accuracy is less affected by changing the value of HALF (α). Conversely, there is a larger drop in the overall accuracy when the 50th best feature set is compared to the best, but this decrease is less than 4%. The features present in the top 50 sets do not significantly change when HALF changes, but become very different when the intensity change is either a relative or absolute difference.
Very similar results are obtained when a Manhattan distance is used instead of a Euclidean distance (Table 3), though the overall accuracy of the best feature set increases ∼1% when the relative difference is used and >1% when the absolute difference is used. The most heavily used features in the top 50 sets in Table 2 are still the most heavily used when a Manhattan distance is applied (Table 3), though there are some changes in the less-used features.
The predicted classifications for the 126 cohorts using the best feature set from run KNN(4b5) are shown in Table 4. The source (i.e. Class) of the urine sample (i.e. normal, IC patient, or BC patient) is also shown. These results show that in the great majority of cases the classification is correct and definitive, or there is an obvious question about the classification. For example, the first two cohorts are almost evenly assigned to normal healthy and IC patient, so their classification is undeterminable between these two classes. The next two cohorts have a high probability of being from an IC patient, but have a 14.4 and 27.9% chance of being unknown. This result means that they are quite far from two or more of their four neighbors and the confidence in the classification is reduced. In only a few cases   A graphical analysis of the features presented in Tables 2 and 3 produce a few interesting results. A display of the intensity of feature 377, which is used prominently in the top feature sets when a relative difference in intensities is used, is shown in Fig. 4(A). In this plot, the 46 normal control individuals are shown in the first set of green data points, the 50 IC patients are the red points, and the 30 BC patients are the blue. The dotted lines represent the average intensities for each of the three Classes. Though the average is exaggerated by two high intensity values, the average intensity of this feature is greater in normal control individuals than in either the IC or BC patients.
Similar plots are shown for Features 356, 384, and 437 in Figs 4(B-D), respectively. They also show a reasonable separation between the normal control individuals and the IC-or BC-affected patients, and it is interesting to note that Feature 437 is present in five or more of the best sets for at least one run using either intensity change measure or either distance metric. Though feature 356 is only present five times in one of the 12 runs listed in Tables 2 and 3, it is the only feature of those listed that shows a reasonable separation in the averages of all three cohorts.
When eight features are used instead of four, the results obtained using Euclidean and Manhattan distances are shown in Tables 5 and 6, respectively. Because the search space of eight unique feature sets is many orders of magnitude larger than for four feature sets, the EP method uses a population size of 4000 and runs for 800 generations.
These results again show that an absolute difference in intensities produces better classifiers than relative differences and that the former is less sensitive to changes in HALF. What these results also show is that as the number of features increases from four to eight the overall quality if the classifier is virtually independent upon whether Euclidean or Manhattan distances are used. Euclidean distances generally improve the classification of BC-affected patients, while the Manhattan distance classifiers generally improve the classification of control individuals and IC-affected patient cohorts. The use of more features should continue to improve the accuracy of the classification model up to a certain point. A 100% correct classification would not be expected in a four-neighbor model for all cohorts, and increasing the accuracy more would require polling less than four neighbors.

Discussion
The development of technologies that provide a global view of the cell at the genomic, transcriptomic, proteomic, and metabonomic level is and will continue to be a major trend in biological science for the foreseeable future. While there are many different types of information that can be gleaned using these global approaches, one of the major initiatives is to use these technologies to more effectively diagnose diseases and develop better therapies. A vast majority of these initiatives use these technologies to rapidly screen thousands of species within complex mixtures in search of a biomarker that is unique to either the healthy or diseased state. The present approach, however, does not rely on a single unique species, but rather it takes into account the abundances of several key features within each spectrum to select the diagnosis.
An effective diagnostic tool should be,amongst other things, non-invasive, technically simple, and require a minimal amount of sample. While such tools currently exist to screen urine samples for evidence of acute BC (including urine dipstick and microscopy) the current "diagnosis" of IC often involves cystoscopy with hydrodistension performed under general anesthesia. Unfortunately, findings of glomerulations or Hunner's ulcers at cystoscopy with hydrodistension does not even provide a definitive diagnosis of IC, however, it is recommended for fulfilling NIDDK diagnostic criteria for IC. Although other diagnostic parameters (including the measurement of urine APF activity or HB-EGF/EGF levels) have been described for IC, a key advantage of using NMR-based technology is that many of the resonances observed in a typical spectrum of any human biofluid may be readily assignable to a known compound based solely on the resonance frequency values, thereby potentially providing additional informa- tion about the disease process itself. In addition, for those signals that cannot be readily assigned, experiments such as total correlated spectroscopy (TOCSY), correlated spectroscopy (COSY), nuclear Overhauser spectroscopy (NOESY), etc. can be used in an attempt to identify their compounds of origin. Clearly, the relative sensitivity, specificity, positive predictive value, negative predictive value, and cost for each type of analysis will need to be considered for determining the optimal diagnostic test for IC.
An obvious concern in using pattern matching of spectra generated from biofluids for disease diagnostics is the variability of the samples from the human subjects. Unlike experimental animals, such as mice, humans cannot be kept under strictly controlled conditions of diet, rest, physical activity, or drug intake (especially for over the counter medications). While there is no universally accepted treatment for IC, many of the individuals affected by IC in this study were using a variety of different medications with the goal of alleviating their symptoms. While some patients were on no medications, most were on various medications includ-ing pentosan polysulfate (Elmiron), dimethyl sulfoxide (DMSO), aloe vera, nonsteroidal anti-inflammatory medications, and antihistamines. To show that the 1 H-NMR-based diagnostic was not simply classifying spectra based on the medications each individual was taking, patients that formed each node in the diagnostic pattern were analyzed based on their medication intake. It was found that the cluster the individuals fell into was independent of their drug intake. For example, ICaffected individuals taking no medication were spread out amongst the various clusters that were diagnostic of IC and individuals taking medications were also distributed within the various IC-clusters as well. In addition, none of the frequency values so far identified as being key to generating the diagnostic patterns were directly related to the medication being taken by the individual or any of their known metabolites.
The MS-based analysis was able to correctly classify the urine samples as being obtained from either normal or IC-affected individuals with an accuracy of 100%. However, the bioinformatic tool used to segregate the spectra is unable to perform a three-tiered classification. Therefore, an alternative bioinformatic analysis was used to determine if urine samples from three different conditions could be correctly diagnosed. An analysis of the 531-binned NMR spectra of 126 cohorts produced a single model that is able to correctly classify the cohorts to approximately 84% level. It uses a Distance-Dependent Four-Nearest Neighbor procedure to predict the Class of each cohort, and the resulting distribution of Class probabilities can suggest to the researcher that the classification of a particular cohort is suspect. The classification of the control and IC patient cohorts is more accurate than the BC patient cohorts, and this may be caused by either the smaller size of the BC patients training set and/or the lack of a strong biomarker specific to the diagnosis of BC. However, NMR would never be cost-effective enough for the diagnosis of BC as there are fairly cheap, sensitive and specific ways to diagnose BC now that NMR could never compete with.