Raman Spectroscopy in Colorectal Cancer Diagnostics : Comparison of PCA-LDA and PLS-DA Models

1State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China 2University of Chinese Academy of Sciences, Beijing 100049, China 3No. 4 Hospital, Jinan 250031, China 4National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China


Introduction
Colorectal cancer has high morbidity and mortality rates and is the third most commonly diagnosed cancer as well as the third leading cause of cancer death for both males and females in the United States [1].Accurately detecting this cancer is a crucial and foremost step toward improving the survival rate of patients with colorectal cancer.Currently, colonoscopy and histopathology are standard screening and diagnostic techniques for colorectal tissues.Though colonoscopic screening has significantly increased the survival rate of patients with colorectal cancer, it remains a challenge to distinguish adenomas and early adenocarcinomas from benign hyperplastic polyps using colonoscopy [2].This difficulty is due mainly to the fact that conventional white light reflectance colonoscopy deeply relies on subjective visual assessment of colorectal polyps [3].The gold standard for cancer diagnostics is histopathology, which is based on the visual investigation of tissue biopsies.A pathologist can diagnose a sample using specific staining, for example, with hematoxylin and eosin, to highlight the focus.Disadvantages of histopathology include time-consuming sample preparation and the subjectivity of pathologists [4].It is of great necessity to develop an objective and sensitive technique that can assist clinicians in the differential diagnosis of benign and malignant cysts.
Raman spectroscopy, a vibrational analysis technique, is gaining popularity in cancer diagnostics.This technology investigates molecular vibrations that can be used for functional group identification and compositional analysis.Extensive research has demonstrated that Raman spectroscopy can support gold-standard techniques and substantially improve clinical diagnostics [5][6][7][8][9].Raman measurements on biopsies can help pathologists identify the tumor margins in a fast and precise way.
Raman spectroscopy has been employed to study human colorectal tissues in vivo or ex vivo to collect spectral information for cancer diagnosis.Since biochemical changes only lead to subtle changes in the Raman spectra, statistical methods are necessary to extract diagnostic information [10,11].Typical statistical methods can be categorized into supervised and unsupervised approaches.The unsupervised approach relies only on the Raman spectra to make a decision, whereas the supervised method uses additional information acquired by the gold-standard method.Principal component analysis (PCA), a frequently used unsupervised approach, reduces the number of variables and assesses the data as a first step.Following PCA, a supervised approach such as linear discriminant analysis (LDA), which takes advantage of PCA and the histopathological results, can classify tissues or cells [12][13][14].A study on mice with colon cancer showed that the PCA-LDA model can correctly discriminate tumors from healthy tissues with an accuracy of 86.8% [15].
Another commonly used supervised approach, partial least-squares-discriminant analysis (PLS-DA), can provide additional group affinity information by classifying memberships as zeros and ones and thus can maximize the variations between groups of samples.PLS-DA rotates the latent variables (LVs) to achieve maximum group separation [16,17].Thus, the LVs consider the diagnostically relevant variations rather than the significant differences in the dataset.PLS-DA model has been employed to analyze colon tissues [3,18].In the previous study of Bergholt et al., the PLS-DA model was performed to diagnose the colorectal cancer with an accuracy of 88.8% [3].
Different models can result in different diagnostic performance when employed to analyze the same dataset.Thus, the use of a proper statistical model plays an important role in achieving diagnostic accuracy.The diagnostic performances obtained from different statistical models in terms of sensitivity and specificity were compared to find the optimal model.Based on this optimal model, the suitable diagnostic method was identified in the target tissue system.However, the relevant study on model comparison in the colorectal cancer diagnosis is lacking.This study evaluated and compared two statistical models for colorectal tissue classification.Our aim is to bridge the knowledge gap in identifying the appropriate model for Raman spectroscopy in cancer diagnosis.
Two multivariate statistical methods, PCA-LDA and PLS-DA, were used in combination with leave-one-patient-out cross-validation to establish the discrimination model.This work demonstrates that Raman spectroscopy is a prospective tool in the diagnosis of colorectal cancer during clinical examinations and that the PLS-DA model is superior in detecting spectral differences between normal and cancerous colorectal tissues.

Sample Preparation.
The formalin-fixed, paraffin-embedded colorectal tissues were retrieved from the Jinan No. 4 Hospital in accordance with the regulations of its ethics committee.The Jinan No. 4 Hospital has approved this study.Normal regions were outside of the tumor areas in the tissue that was obtained during surgery.The paraffinembedded tissues were sectioned into 10 m thick sections.Each section was put on a glass microscope slide and stained with hematoxylin and eosin for histopathological diagnosis of the suspected area.The adjacent section was placed on a glass slide without being stained for Raman spectroscopy analysis [12,19].The histological analysis was conducted by professional medical doctors who are board certified pathologists.

Raman Spectroscopy.
Raman spectra were acquired with a 10 s integration time in the spectral range of 400∼4000 cm −1 using a Raman system (Horiba JY HR evolution, France) equipped with an Olympus BXFM open space optical microscope and a charge-coupled device (CCD) detector.A 532 nm laser was focused through a 100x objective (NA = 0.9, WD = 0.21 mm) to excite the samples.The laser power on the sample was about 1.33 mW.A 520.7 cm −1 band of silicon wafer was used for calibration.The spectral resolution was about 0.65 cm −1 , and the wavenumber accuracy was ±0.03 cm −1 .The normal spectra were acquired from healthy regions outside the tumor areas in tissue.

Data Processing and Multivariate Data Analysis.
A linear baseline correction was applied to the Raman spectra using Labspec6 software (Horiba JY).About 10 spectra were collected for each tissue and then averaged.Mean-centering was carried out prior to multivariate statistical analysis to remove common variance from the colorectal tissue Raman spectra dataset.
PCA-LDA and PLS-DA methods were applied for discriminant analysis.Leave-one-patient-out cross-validation was used to validate and optimize the PLS-DA model.Distinct molecular features of the colorectal tissues were extracted and visualized through loadings and scores.The statistical significance among the PCA/PLS scores for normal and cancerous tissues was calculated using a p value less than 0.05.
The PCA-LDA statistical analysis was performed using in-house written scripts.The PLS-DA statistical analysis was carried out with PLS toolbox (Eigenvector Research, Wenatchee, US).All statistical analyses were carried out in the Matlab programming environment (Mathworks Inc., Natick, US).General Raman-active tissue components were comparable among colorectal tissues, and subtle variations, even though highly molecule-specific, were observed including peak position and intensity.Prominent Raman bands were observed for normal and cancerous colorectal tissue at about 1063 (lipids/collagen), 1134 (fatty acids and proteins), 1174 (Ltryptophan), 1297 (lipids and phospholipids), 1414 (lipids), 1442 (fatty acids and triglycerides), 1461 (lipids/proteins), 2847 (fatty acids and triglycerides), 2879 (lipids), and 2927 cm −1 (proteins and lipids).The difference between normal and cancerous tissues reflects the molecular changes in the tissue associated with the dysplastic progression (Figure 1(b)).For instance, the peak intensities at 1134 and 1297 cm −1 increased significantly in cancerous tissue, relative to the normal tissue, suggesting a higher amount of lipid material compared with the normal tissue (Figure S1).But the subtle variations were also hard to differentiate two types of tissues.This motivated further studies of PCA-LDA and PLS-DA to analyze the suitability of each in colorectal cancer diagnosis.

PCA-LDA Analysis of Raman Spectra.
To reduce the dimension and complexity of the biological dataset, we performed PCA-LDA on normal and cancerous colorectal tissues in the spectral range of 400-4000 cm −1 .PCA modeling is able to extract most fundamental features, resolving highly specific biomolecular information.Figure 2 shows that the first two PC components accounted for 98% (PC1, 82.7%; LV2, 15.3%) of the total Raman variations around the major Raman peak positions.These two PC components alone contributed to the most characteristic vibrational frequencies.They are dominated by the vibrational features of fatty acids, lipids, proteins, and nucleic acids from the colorectal tissues (Figure 3).The PC1 loading contained Raman peaks for fatty acids (1134 cm −1 ); proteins (1174 cm −1 from L-tryptophan and 1461 cm −1 from C-H wagging); and lipids and fatty acids (1442 cm  symmetric CH 2 stretching, and 2879 cm −1 from asymmetric CH 2 stretching).The loading on PC2 captured Raman peaks similar to those in PC1 loading, reflecting the main components of Raman spectra.To cross-validate the classification, we used the LDA model with the leave-one-patient-out approach for cross-validation.The PCA-LDA model resulted in a sensitivity of 72.8% and a specificity of 85.9%, which finally yielded a diagnostic accuracy of 79.2%.

Comparison of PCA-LDA and PLS-DA.
Figure 4 shows the box chart of significant PC and LV scores to visualize different degrees of diagnostic efficiency.The PCA scores show the classification comparisons between normal and cancerous tissues through PC1 and PC2.Compared with PCs and the other LVs, LV2 ( = 1.16 − 7) shows the greatest efficacy in distinguishing colorectal cancer.Analysis of the LV2 scores showed that increased protein and nuclear contents occurred during the neoplastic progression in the colon tissue, indicating the elevated number of cells associated with cancerous development.Channelling this increased biomolecule biosynthesis is absolutely required for tumorigenic transformation [27].The cancerization could induce lipid and nucleic acid changes, which are reflected in the Raman spectra and further analysis.Human cancer cells express high levels of lipogenic enzymes to meet the great demand for lipid synthesis [28].Meanwhile, changes in the levels of nucleic acids were associated with tumor burden and malignant progression [29].Receiver operating characteristic (ROC) curves (Figure 5) were also generated from spectral datasets to further evaluate the separation.The integrated area under the ROC curve of the PLS-DA model was 0.856, while the integrated area of the PCA-LDA model was 0.696, substantiating the efficiency of using the Raman technique with PLS-DA for diagnosing cancerous colorectal tissues.Table S2 shows that the PLS-DA model combined with Raman spectroscopy had better diagnostic performance, compared with the PCA-LDA model.Previous studies reported that the PLS-DA model provided a diagnostic sensitivity of 90.9% and specificity of 83.3% for differentiating adenomas from hyperplastic polyps [18].The PCA-LDA model could distinguish cancer from normal colon tissues in mice with diagnostic accuracy of 86.8% [15].The diagnostic results from the literature indicated that the PLS-DA model resulted in better diagnostic accuracy than the PCA-LDA model, even in different tissue systems.
In the PLS-DA model, the diagnostic specificity (91.0%) in our study was higher than that (83.3%) in previous study, while the sensitivity (77.7%) from our study was lower than that (90.9%) from previous study.Compared with previous studies, the diagnostic accuracy (79.2%) from the PCA-LDA model in our study was lower than that (86.8%) in the mice colon tissue study.The different statistics may be attributed to the difference in Raman spectra collection and sample preparation.The diagnostic results of our study and previous studies can be acceptable for the clinical application.
PCA is a classical technique for dimensionality reduction.This method identifies several principal directions with a high variance, especially for high-dimensional data X with small number of samples and large number of features, such as Raman spectra.By projecting the original data of X onto these directions, much of the information of X will be maintained by just a small number of these new projected variables (i.e., LVs) [30].However, PCA only involves one set of data.On the other hand, PLS realizes dimensionality reduction by considering the relations between two data blocks (X and Y) across the same samples.PLS maximizes the covariance between X and Y, which balances the requirement to explain as much variance as possible by considering the correlated relationships between X and Y [31].Thus, the three LVs identified using PLS captured not only the high variance in the spectral dataset, but also the relationship between the spectral dataset and sample class.In this experiment, the accuracy of PLS-DA modeling (84.3%) was much higher than that of PCA-LDA (79.2%).Thus, out of the two, the PLS-DA model is preferable for discriminating cancerous from normal tissues following the strategies in this study.

Conclusions
In summary, Raman spectroscopy was applied as a sensitive diagnostic alternative for identifying pathologic changes (e.g., dysplasia) in colon tissue at the molecular level, using an optimized multivariate data analysis model.In a side-byside comparison of PCA-LDA and PLS-DA with respect to the characterization of molecular profiles (e.g., proteins, lipids, and nucleic acids) of normal and cancerous colorectal tissues, the PLS-DA model was found to be a superior choice.The subtle Raman variations among normal and cancerous colorectal tissues are associated with cancerous tissue transformation.Confocal Raman spectroscopy is a promising technique in the diagnosis and characterization of colorectal cancer.
Figure 1(a) shows the averaged Raman spectra of normal (n = 78) and cancerous (n = 81) colorectal tissues, and the peak assignments are listed in

Figure 2 :
Figure 2: Variance captured percent and classification error as function of model complexity (i.e., retained number of PCs and LVs) using PCA-LDA and PLS-DA together with leave-one-patient-out cross-validation.

Figure 3 3 Figure 3 :
Figure3: Loading plot of PCA components (PC) and PLS components (LV) in the each model calculated from the Raman spectra of colorectal tissues.Each loading is shifted vertically for better visualization.The broken interval (-//-) indicates the region of ∼2000-2500 cm −1 , which does not contain much tissue-related biochemical information.

Figure 4 :Figure 5 :
Figure 4: Box charts of the two PCA component (PC) scores and three significant PLS component (LV) scores calculated from the Raman dataset for normal and cancerous tissue types: (a) PC1 score, (b) PC2 score, (c) LV1 score, (d) LV2 score, and (e) LV3 score.The line within each box represents the median, while the lower and upper boundaries of the box indicate the first (25th percentile) and the third (75th percentile) quartiles, respectively.Error bars (whiskers) represent the 1.5-fold interquartile range.