Proteomic Characterization of Cerebrospinal Fluid from Ataxia-Telangiectasia (A-T) Patients Using a LC/MS-Based Label-Free Protein Quantification Technology

Cerebrospinal fluid (CSF) has been used for biomarker discovery of neurodegenerative diseases in humans since biological changes in the brain can be seen in this biofluid. Inactivation of A-T-mutated protein (ATM), a multifunctional protein kinase, is responsible for A-T, yet biochemical studies have not succeeded in conclusively identifying the molecular mechanism(s) underlying the neurodegeneration seen in A-T patients or the proteins that can be used as biomarkers for neurologic assessment of A-T or as potential therapeutic targets. In this study, we applied a high-throughput LC/MS-based label-free protein quantification technology to quantitatively characterize the proteins in CSF samples in order to identify differentially expressed proteins that can serve as potential biomarker candidates for A-T. Among 204 identified CSF proteins with high peptide-identification confidence, thirteen showed significant protein expression changes. Bioinformatic analysis revealed that these 13 proteins are either involved in neurodegenerative disorders or cancer. Future molecular and functional characterization of these proteins would provide more insights into the potential therapeutic targets for the treatment of A-T and the biomarkers that can be used to monitor or predict A-T disease progression. Clinical validation studies are required before any of these proteins can be developed into clinically useful biomarkers.


Introduction
A-T is a neurodegenerative disease characterized by progressive cerebellar degeneration, immunodeficiency, cancer predisposition, premature aging, growth retardation, gonadal atrophy, high sensitivity to ionizing radiation, and genomic instability [1][2][3][4][5][6]. Many studies have suggested that a deficiency in the ability to repair DNA double-strand breaks (DSBs) is the main cause of the A-T phenotype [7]. A major breakthrough in understanding the pathophysiology of A-T came with the identification of the defective gene, ATM (ataxia-telangiectasia mutated), which is mutationally inactivated in individuals with the disease [8]. The identification of ATM has facilitated rapid progress in understanding many aspects of the molecular basis of this disease.
The ATM protein is a serine-threonine kinase that undergoes autophosphorylation in response to DNA damage and subsequently initiates a signaling cascade that involves 2 International Journal of Proteomics the phosphorylation of several down-stream substrates, including p53, p53BP1, Chk2, BRCA1, and TRF2 [7,9,10]. Recently, substantial insight has been obtained regarding the mechanism by which ATM signals to effector proteins after DNA double-strand breaks have occurred. Although ATM is an essential factor for sensing and signaling the repair of DSBs, other factors such as the MRN complex (Mre11/Rad50/Nbs1) may play an important upstream role in the activation of ATM [11]. In addition, ATM is a member of a large protein complex called the BRCA1-associated genome surveillance complex, suggesting that DNA damage recognition and signaling also involve other proteins, several of which are substrates for ATM [12]. A vast amount of literature has demonstrated the role of ATM in regulating a damage response pathway that ultimately leads to cell cycle checkpoint arrest, DNA repair, or apoptosis [13]. Understanding this role of ATM has explained the predisposition of A-T patients to develop immunodeficiency and cancer but has not explained the observed neurodegeneration. A global quantitative analysis of proteins associated with the A-T phenotype from A-T patient samples has not yet been reported but might shed new light on this dilemma.
One of the goals of proteomics is to measure and characterize the protein expression profiles in specific tissues and biofluids. Even though a tremendous effort has been made to improve proteomic technologies, there are still numerous challenges associated with even the most advanced technologies when analyzing global protein expression due to the inherent complexity of clinically relevant biological samples. These challenges include: (1) the sensitivity of the instrument and its ability to identify novel proteins, (2) the need to be moderate to high throughput, (3) the wide range of protein masses and abundances (dynamic range) that need to be covered, (4) the ability to quantitatively analyze protein expression and posttranslational modifications, and (5) access to the appropriate tissue and/or biofluid. With the recent development of a label-free protein quantification technology [14], large-scale and highthroughput analysis of complex biological samples has become possible which has overcome some of the challenges in proteomics [15][16][17][18]. This unique technology combines a proprietary sample preparation protocol [19], the LC/MS method, and statistical data analysis tools to quantitatively analyze proteins from whole tissue homogenates, cell lysates, or depleted serum/plasma samples.
In this work, we used cerebrospinal fluid (CSF) samples from A-T patients and age-and gender-matched unaffected controls to identify and verify potential biomarkers of A-T. CSF was selected as it has been shown to be a relevant biological sample to study other neurodegenerative diseases such as Alzheimer's disease (AD), and to study changes in the predominant clinical phenotype of A-T (neurodegeneration) that have not been addressed in previous studies [20][21][22].

Human Subjects.
We contacted all adults (≥18 years of age) with A-T followed at the A-T Clinical Center at Johns Hopkins Medical Center. Eight were willing to have a lumbar puncture for the purpose of this research. The diagnosis of A-T was made by the combination of three observations: (1) characteristic neurologic abnormalities such as oculomotor apraxia, bulbar dysfunction and postural instability, (2) occulocutaneous telangiectasia, and (3) at least two of the following laboratory abnormalities: elevated serum alpha-fetoprotein level, absence of ATM on a western blot, increased rate of X-ray induced chromosomal breakage in comparison to a control population, and/or mutations in both alleles of the ATM gene.
Controls were otherwise healthy individuals having a lumbar puncture performed for a clinical indication (e.g., suspected pseudotumor cerebri or evaluation of chronic headache) and found not to have another neurologic disease.
The institutional review board of the Johns Hopkins Medical Institutions approved the study, and informed consent was obtained from every subject.

CSF Samples.
The CSF samples were collected at Johns Hopkins Hospital (Baltimore, Md, USA). Lumbar punctures were performed by standard clinical technique using local anesthetic. The first 4 mL of CSF were used for standard chemistry and hematology tests. The next 1 mL was collected for proteomic analysis, immediately transported to the laboratory, and frozen at −80 • C.
The A-T group consisted of eight patients, six women and two men, ranging in age from 20 to 26 years old (mean ± S.D., 22.17 ± 2.13 years) ( Table 1). As determined by the Bradford protein assay [23], the total CSF protein concentration in all samples ranged from 211.5 μg/mL to 441.5 μg/mL with a mean of 288.5 ± 93 μg/mL. The control group consisted of five gender-and age-matched healthy controls. In the control group, the mean CSF protein concentration was 200.7 ± 81 μg/mL, ranging from 98.9 μg/mL to 369.9 μg/mL. Aliquots of CSF were stored in polypropylene tubes at −80 • C until use.

Sample Preparation.
The two most abundant serum proteins, albumin, and IgG were removed from the CSF samples using ProteoPrep spin columns. Depleted CSF samples were denatured by 8 M urea for 1 h with agitation at room temperature. Chicken lysozyme (0.25 μg, used as QA/QC reagent) and a volatile reduction/alkylation solution (97.5% acetonitrile, 2% iodoethanol, and 0.5% triethylphosphine) were added to each sample, and the solutions were incubated at 37 • C for 1 h according to a previously published procedure [19]. The samples were dried under vacuum on a speedvac. The resulting pellets were redissolved in 100 μL of 100 mM ammonium bicarbonate (pH 8) containing 0.4 μg of Identified proteins were categorized into two priority groups based on the quality of the peptide identification and the number of unique peptides identified [24]. All the proteins were identified with at least one best peptide identified at a confidence level ≥90% (q-value ≤ 0.1, q-value represents a false-discovery-rate or FDR which was described previously [14,21]) or higher. Proteins were assigned to Priority 1 if two or more unique peptides were identified or Priority 2 if only a single peptide was identified. Peptides assigned to proteins with a confidence level of less than 90% were filtered out of this study. The estimation of the confidence levels, which is based on a random forest recursive partition supervised learning algorithm was described previously [24]. Protein quantification was carried out using a proprietary protein quantification algorithm licensed from Eli Lilly & Company (Indianapolis, Ind, USA) as described previously [14]. Briefly, once the raw files were acquired from the LTQ, all extracted ion chromatograms (XICs) were aligned by retention time. To be used in the protein quantification procedure, each aligned peak must match the parent ion, charge state, fragment ions (MS/MS data), and retention time (within a 1-min window). After alignment, the areaunder-the-curve (AUC) for each individually aligned peak from each sample was measured, quantile normalized [25], and compared for relative abundance. All peak intensities were transformed to a log 2 scale before quantile normalization. Quantile normalization was employed to ensure that every sample has a peptide intensity histogram of the same scale, location, and shape. This normalization removes trends introduced by technical variations including sample handling, sample preparation, total protein differences, and changes in instrument sensitivity while running multiple samples [25]. If multiple peptides have the same protein identification, then their quantile normalized log 2 intensities were averaged to obtain log 2 protein intensities. The log 2 protein intensity is the final quantity that is fit by a separate ANOVA statistical model for each protein Sample(Group) is a random effect. Group effect refers to the effect caused by the experimental conditions or treatments being evaluated. Sample effect represents the random effects from individual biological samples. It also includes random effects from sample preparation. All of the injections were randomized, and the same person operated the instrument for all samples in this study. The inverse log 2 of each sample's mean was calculated to determine the foldchange between groups.

Linear Discriminant Analysis (LDA)
. LDA was performed using JMP (version 8) to separate the A-T group from the control group. The individual protein intensities of 13 Priority 1 proteins that showed significant expression changes were used as input for this analysis. The least number of proteins that gave the best discrimination between the two groups were selected as biomarker candidates.

Pathway Analysis.
After LDA, a list of five proteins that could be used to distinguish A-T from normal samples was created and analyzed by Pathway Studio (v6.0) (Ariadne, Rockville, Md, USA) in an attempt to link them with the key A-T protein ATM. Briefly, the proteins' corresponding gene list was run against the ResNet database that is equipped with functional relationships from other scientific literature and commercial databases. The filters that we applied included "Add shortest path" and "Protein." Protein interactions and the biological processes in which they were involved were noted. The information received was further explored in the literature to determine the interactions and regulatory relationships between the proteins of interest and ATM.

Multiple-Reaction-Monitoring (MRM) Analysis.
To verify and validate the candidate biomarkers of A-T, an MRMbased targeted proteomic analysis was performed to quantify the relative protein expression levels between the control and A-T patient samples. An AB/SCIEX 4000 QTRAP mass spectrometer interfaced with a Dionex Ultimate 3000 HPLC system was used for this targeted proteomic quantification study. In this study, five candidate proteins (listed in Table 4) were selected for verification/validation. The analytes, which were the same tryptic peptides used for the label-free discovery study, were first loaded onto a trapping column (75 μm i.d. × 20 mm) and then onto an analytical column (75 μm i.d. × 150 mm packed in-house with C 18 Table 4. Relative quantification was accomplished using the Analyst software (version 1.5.1 Applied Biosystems).

Results
To characterize the alterations in protein expression related to A-T, we performed LC/MS-based quantitative proteomic analysis of CSF from control and A-T patients. The patient information in each group is summarized in Table 1. Proteins identified based on priority groups are summarized in Table 2. A total of 477 proteins were identified and quantified with high confidence in the samples. The expression levels of 13 proteins from Priority 1 and 7 proteins from Priority 2 were statistically significantly changed (listed in Table 3). The 13 significantly changed proteins from the Priority 1 group were further analyzed by Linear Discriminant Analysis (LDA) and pathway analysis for their roles in biological processes. Figure 1 illustrates the LDA results. Expression differences of proenkephalin-A (PENK, P01210), isoform 1 of extracellular matrix protein 1 (ECM1, Q16610), secretogranin-2 (SCG2, P13521), isoform 1 of CD166 antigen (ALCAM, Q13740), and insulin-like growth factor binding protein 7 (IGFBP7, Q16270) can clearly discriminate A-T samples from healthy controls.The literature search results demonstrate that these five proteins are involved in either human cancers or neurodegenerative processes [26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42]. Figure 2 shows a protein-protein interaction network linking these five proteins to ATM from the pathway analysis. Their  relative protein expression levels as determined by MRM are shown in Figure 3.
For QA/QC purpose, chicken lysozyme was spiked into every individual sample at a constant amount (10 ng chicken lysozyme/2 μg of sample) before tryptic digestion. There were 9 unique chicken lysozyme peptides being detected and quantified. After averaging these peptide concentration values, a −1.099 fold-change was observed with a q-value of 0.77 (77% FDR), suggesting this observed small change is not statistically significant and the data obtained from this study was reliable.
MRM results demonstrate the same direction of protein expression changes (up-or downregulation) as compared to the global discovery study, even though the absolute foldchange may be slightly different in some cases, likely due to differences in the platform used. In this targeted proteomic study, we were able to detect and quantify four out of the five proteins of interest. Unfortunately, we were unable to confidently detect the MRM peptide "SSPSFSSLHYQDAG-NYVCETA" from ALCAM due to its low abundance. All of the MRM peptides and transitions for each protein of interest are listed in Table 4.

Discussion
Much of the effort in proteomics has been devoted to improve the sensitivity of the instrument and measurement accuracy. At the present time, there is no consensus within the field of proteomics on any one technology that can attain complete and quantitative protein coverage of all proteins in a given tissue or biofluid. The most commonly used proteomic approach, the so called "bottom-up" approach, utilizes a two-step approach: peptide separation followed by peptide/protein identification and quantification by mass spectrometry (MS). Two-dimensional gel electrophoresis (2DE) has been the workhorse for protein separation in proteomics research efforts in the past decade, but its inability to widen the protein dynamic range and its low throughput remain its biggest disadvantages and thus limit its utility in large-scale and highthroughput proteomic analysis. One alternative approach to 2DE is the nongel-based liquid chromatography mass spectrometry-based shotgun proteomic technology [43][44][45][46]. It provides a powerful analytical platform to resolve and identify thousands of proteins from a complex biological sample in a single experiment. This approach is rapid and more sensitive, and it increases the protein dynamic range 3-to 4-fold as compared to 2DE. The hallmark of this method is its ability for highthroughput large-scale proteomic analysis [47,48]. Although some success using isotopic labeling technology in combination with mass spectrometry for protein quantification has been reported [48], recently developed label-free protein quantification technology [14] has become a major platform for biomarker discovery primarily due to the high costs of the labeling reagents, especially for a large-scale study. In this study, we used a peak-intensity-based label-free protein quantification method that was previously applied for many other studies [14,15,17,18].
One challenge to studying the neurodegeneration seen in A-T is access to affected brain tissue. For this reason, we chose CSF to analyze since this biofluid is in direct contact with the brain and studies of other neurodegenerative diseases have shown that disease-specific changes in the brain can be  [20][21][22]. A recent study by Cheema et al. [49], using analysis of ATM-mediated gene and protein expression in A-T fibroblasts found a completely different set of proteins than those observed in our CSF study and highlights the importance of selecting a clinically relevant tissue for biomarker discovery.

Confidence in the Methodology.
In this proteomic study, we did not detect A-T-related proteins, such as ATM-related protein kinases or their substrates [7]. This could be due to the inability of current LC/MS technology to confidently detect low-abundance proteins. However, the advantages of the method far outweigh this limitation. Firstly, proteomic analysis ignores transcripts that may never be translated by detecting only the end products of gene activity, giving it an advantage over genomic analysis. Secondly, the LC/MSbased label-free protein quantification technology used here has proven itself a powerful tool to resolve and identify thousands of proteins from complex biological samples [16,50]. It is a method that compares the relative expression levels of the same protein under different physiological conditions. The method is rapid highthroughput, and more sensitive than many other proteomic platforms [16], and it increases the protein dynamic range 3-to 4-fold compared to the conventional 2D gel-based proteomic platform. During the development of the method, chicken lysozyme was used for QA/QC purposes, and the method has since been robustly tested on many different types of samples [15][16][17][18]. Automation allows it to be applied to large-scale proteomic analysis; thus, it has become a tool of choice for biomarker discovery [15,51]. The inclusion of statistics in both the experimental design and data analysis allows for the detection of small but statistically significant changes not offered by other methods. We are, therefore, confident in the qualitative and quantitative data produced by this method.

Significance of Results
A. Statistical Motivation. The size of the treatment or disease effect (signal) needs to be evaluated relative to the sample and replicate variation (noise). The signal to noise ratio is estimated based on a statistical model. If the data have multiple sources of random variation such as biological samples and replicates then the data are modeled as a Linear Mixed Model (A generalization of an ANOVA, Analysis of Variance). This kind of model, especially when applied to complex experimental designs, cannot be handled by introductory methods such as t-tests. The exact scale of the protein expression used in the model can make a difference in the sensitivity. There is usually a large technical variation introduced by the act of "measurement" in any "omics" study. Randomization of measurement order will eliminate the bias, but it is still extremely important to "normalize" or mathematically calibrate the measurement. This is a highly technical matter but can be viewed as similar to mathematically resetting a scale to zero before each measurement. We use a statistically based method called "quantile  Table 4. For secretogranin-2 and IGFBP7, the fold-changes from both MRM peptides were also averaged. Statistical analysis was performed by ANOVA models using PROC MIXED in SAS. P < .05.
normalization" [25] which was the result of considerable research on genomic data. Because "omics" measures of expression are usually on an arbitrary scale, it is best to evaluate ratios or their equivalent differences on the log scale. Log base 2 is chosen because a unit difference on the log scale is equivalent to a two-fold change.
B. Five Biomarker Candidate Proteins. From the LDA, five candidate proteins whose relative expression levels could be used to precisely discriminate control samples from A-T patient samples were discovered. After reviewing the literature, all of these proteins were found to play some role in either cancer or neurodegenerative processes, or both, which lends support to these proteins being viable biomarkers of A-T.
The first of these five proteins is proenkephalin-A (PENK), which is an opioid neuropeptide precursor, a neuroendocrine hormone, and a cytokine. It is involved in pain perception, modulation of the immune system, anticonvulsant activity, and the neurodegenerative disorder Huntington's disease [27,30]. It is also involved in several cancers, including breast cancer and prolactin-secreting pituitary adenoma [26,28,29]. This protein was found 30% overexpressed in A-T samples.
Isoform 1 of Extracellular matrix protein 1 (ECM1), which is 42% over-expressed in A-T samples, is involved in many cancers, including breast, esophageal, laryngeal, thyroid, and lung cancers and may play a role in angiogenesis [31]. It is mutated in lipoid proteinosis, a dermatological disease in which patients may develop neurological abnormalities such as temporal lobe epilepsy and mental retardation [32].
The third protein, the neuroendocrine prohormone secretogranin 2 (SCG2), has a role in both neurological processes and cancer. SCG2 is over-expressed by 35% in the A-T patients and is involved with the packaging and sorting of peptide hormones and neuropeptides into secretory vesicles. One of its gene products promotes neuroprotection and neuronal plasticity in mice and humans [38]. Secretogranin 2 has also been suggested to be involved in neuroendocrine tumors and amyotrophic lateral sclerosis (a neurodegenerative disorder) [39,40].
The fourth protein, isoform 1 of CD166 antigen (ALCAM), which has a role in cancer and neurological disorders [33][34][35][36][37], was found to be 52% over-expressed in the A-T samples. It is involved in neurite extension by neurons in chickens [35] and in the neurodegenerative disorder multiple sclerosis [34]. In addition, CD166 plays a role in many cancers, including melanoma, prostate, colorectal, pancreas, and breast [33,36,37].
The final protein, insulin-like growth factor binding protein 7 (IGFBP7), is downregulated 46% in A-T compared to control samples. It plays a role in regulating proliferation, differentiation, and apoptosis. Additionally, it is involved in several types of cancers, including colorectal and inflammatory breast cancers [41,42].
C. Other Priority 1 Proteins. The remaining eight significantly changed proteins in the Priority group 1 are SPARC (secreted protein acidic and rich in cysteine), neurosecretory protein VGF, TPP1 (cDNA FLJ56402, highly similar to tripeptidyl-peptidase 1), neurocan core protein, chromogranin A, cathepsin D, SOD3 (extracellular superoxide dismutase), and ENPP2 (isoform 1 of ectonucleotide pyrophosphatase/phosphodiesterase family member 2) ( Table 3). Among these proteins, SPARC, neurosecretory protein VGF, TPP1, and SOD3 are of particular interest because they have been implicated in neurodegenerative processes.
SPARC is a unique matricellular glycoprotein involved in many types of diseases including cancer, inflammation, and neurodegeneration [52][53][54]. Its function is associated with cell development, remodeling, cell turnover, and tissue repair. Our observed downregulation (1.47-fold) of this protein in A-T patients implicates deficiencies associated with these cellular functions in this disease.
Neurosecretory protein VGF has been identified by many proteomic studies [55][56][57]. It plays a role in neuronal communication [56]. This gene is specifically expressed in a subpopulation of neuroendocrine cells and is upregulated by nerve growth factor [57]. However, its exact function remains to be discovered. TPP1 (cDNA FLJ56402, highly similar to tripeptidylpeptidase 1), also known as CLN2, is a member of the family of serine-carboxyl proteinases and plays a crucial role in lysosomal protein degradation; a deficiency in this enzyme leads to fatal neurodegenerative disease [58]. It is also involved in telomere protection [59]. Based on its known functions, TPP1 is expected to be down-regulated in A-T patients, which is what we observed (down-regulated 1.44-fold).
SOD3 is an antioxidant enzyme associated with many pathways and diseases. Its association with neurodegeneration has been reported previously in a study of antioxidant gene therapy [60]. SOD3's function is to protect against neurodegeneration. Down-regulation (1.37-fold) of this protein in A-T patients would suggest less of a protective effect by SOD3. Importantly, a large body of evidence suggests that oxidative stress plays some role in the pathophysiology of A-T. As a recent example, one group has shown that ATM can be directly activated by oxidation in the absence of DNA double-strand breaks, implying that ATM may act as a redox sensor capable of regulating cellular responses to oxidative stress as well as genotoxic stress [61].

Conclusion
We identified novel CSF biomarker candidates for A-T from the 13 priority 1 proteins with significant absolute foldchanges of at least 1.3 (30% increase or decrease) (q < 0.05). LDA was applied to assess the ability of individual and/or combinations of these proteins to correctly classify individuals into the control or disease group. The selectivity and specificity from the LDA was high, suggesting that it is possible to correctly assign individuals to the proper group (control or A-T patient) when the expression levels of these biomarker candidates are accurately measured in the CSF. Findings from our study confirm that the mass spectrometry-based label-free protein quantification and MRM technologies can be used successfully for biomarker discovery and validation. However, limitations of our study require us to interpret the data with caution. First, the current study constituted a small sample size, and further validation studies with a larger set of patient cohort samples are necessary. Second, the fold-changes observed in the study are relatively small, which require high measurement precision to produce high quality, clinically valid data. Thus, mass spectrometry-based methods may not be a practical approach for clinical applications. Third, CSF may not be an ideal biospecimens for prognostic applications due to the invasiveness involved in sample collection. Future studies involving serum or plasma samples would make this biomarker discovery strategy even more attractive with the hope that such noninvasive biospecimens can be incorporated into routine clinical practice and utilized in clinical trials for the assessment of potential therapeutic compounds.