^{1}

^{2}

^{1}

^{1}

^{2}

^{2}

^{2}

^{1}

^{1}

^{2}

To facilitate the enhanced reliability of Raman-based tumor detection and analytical methodologies, an

Breast cancer is the most common cancer experienced by women worldwide [

Currently, breast cancer screening is conducted principally with a triple assessment using imaging examination that integrates X-ray mammography and ultrasound, clinical tests, and histological assessment [

Raman microspectroscopy allows a qualitative and quantitative analysis of the chemical nature of biological samples which requires minimal sample preparation and does not require a staining process. After years of development, it has been widely accepted by clinicians and research communities for the early diagnosis of cancer, the identification of cancer progression, and intraoperative guidance [

In the present study, we characterized the spectral variations in healthy (H), DCIS, and IDC tissues so as to identify the features in spectra caused by cancer progression and to facilitate the development of Raman-based tumor detection algorithms. Two multivariate analysis models, principal component analysis (PCA) followed by linear discriminate analysis (LDA) and support vector machine (SVM) analysis, respectively, were further utilized to analyze and classify Raman spectra in the three types of tissue. Following a comparison of the performance of PCA-LDA and PCA-SVM models, an effective algorithm was verified to further bridge the knowledge gap in identifying the appropriate model for Raman spectroscopy in the breast cancer diagnosis.

A total of twelve healthy breast samples, which contain both collagenous and adipose tissue, from four female patients were purchased from Alenabio (Xi’an, Shaanxi, China), and biopsies were performed using protocols approved by the IRB (Institutional Review Board) and the HIPAA (Health Insurance Portability and Accountability Act). It was additionally approved as commercial product development. IDC (

Immediately after lesion excision, the samples were embedded in optimal cutting temperature medium (OCT, Surgipath® FSC 22®, Leica Biosystems, USA) and frozen in liquid nitrogen for better preservation of native morphology. 12

The equipment used for Raman spectroscopy has been described in detail previously [^{−1} grating with a back-illuminated deep-depletion charge-coupled device camera (Du401A-BR-DD-352, Andor Technology, UK) at a resolution of approximately 3 cm^{−1}.

WITec Project FOUR software (WITec GmbH, Germany) was used to preprocess all datasets that were obtained for band range selection, cosmic ray removal, background subtraction, and spectral smoothing, using the same parameters in each case. The background subtraction is achieved by a nine-order polynomial fit and we use a five-order Savitzky–Golay smoothing to noise removal. All Raman spectra were normalized using an area under the curve method over the ranges 600–1800 cm^{−1} and 2800–3000 cm^{−1} to minimize the effects of sample and instrument variability.

The spectral datasets were mean-centered and then used to conduct additional analysis. PCA was used to simplify complexity and identify key variables in the multidimensional datasets [

Using Vapnik–Chervonenkis (VC) theory and the principle of structural risk minimization, an SVM algorithm was also adopted using PC scores as input variables to construct a PCA-SVM model. In the present study, three kernel types were tested in the PCA-SVM model, namely, a linear kernel, polynomial kernel, and Gaussian radial basis function (RBF). All acquired spectral data were divided into either a training (80%) or a testing set (20%) during testing. To obtain a model with the best performance, grid search combined with 10-fold cross-validation was employed to determine the most appropriate combination of parameters for each kernel. Finally, those parameters and the trained algorithms were used to construct the final PCA-SVM model and identify the unknown spectra. All statistical analyses were performed using Matlab R2015b software (Mathworks, Inc., Natick, MA, USA).

Using H&E-stained tissue sections, significant morphological differences were observed among the H, DCIS, and IDC tissues, representing pathological progression (Figure

H&E-stained images of healthy breast tissue (a), ductal carcinoma

As shown in Figure ^{−1} (C-C stretching, collagen) [^{−1} (C-C stretching, lipid) [^{−1} (CH_{2} twisting, wagging, phospholipids) [^{−1} (CH_{2} deformation) [^{−1} (C=C lipid stretching) [^{−1} (T and G in nucleic acids) [^{−1} (symmetric ring breathing in tryptophan, in protein) [^{−1} (C=C stretching in tryptophan) [^{−1} (C=C stretching in phenylalanine) [^{−1} (lipids) and 2854 cm^{−1} (CH_{2} symmetric stretch, lipids) [^{−1} (CH stretching, lipids and proteins) [^{−1} (CH_{2} anti-symmetric stretching in lipids) [^{−1} (CH_{2} wagging, C-N stretching, amide III of collagen) [^{−1} (CH_{2} asymmetric stretch in lipids and proteins) [

(a) The mean ± standard deviations (SD) of normalized spectra in H, DCIS, and IDC tissues; shading area represents standard deviations. (b) The differential spectra calculated from the normalized Raman spectra among different tissues.

To better identify the underlying compositional information for the different stages of breast cancer invasion, differential spectra were calculated by subtracting the acquired featured spectra from each tissue type, as shown in Figure ^{−1} (phenylalanine) [^{−1} (lipids and proteins) [^{−1} (lipids) and 1524 cm^{−1} (carotenoids), indicating that phenylalanine content increased while lipid and carotenoid levels decreased during the evolution of cancer from healthy tissue to DCIS.

Meanwhile, in the subtractive spectra for IDC and H tissues, positive peaks were observed at 669 cm^{−1} (nucleic acids), 754, 1243, 1552, and 1608 cm^{−1} (protein), and 2934 cm^{−1} (lipids and proteins), while there were negative peaks at 1302, 1450, 1654, 2854, and 2900 cm^{−1} (lipids) [^{−1} (carotenoids) [^{−1} (nucleic acids), 754 cm^{−1}, 1243, and 1552 cm^{−1} (protein), 1608 cm^{−1} (phenylalanine), and 2934 cm^{−1} (lipids and protein), and positive peaks at 1302, 1450, 2854, and 2900 cm^{−1} (lipids). This suggests that the nucleic acid, protein, and phenylalanine levels were higher in IDC than in DCIS, but lipid levels in IDC were lower than in DCIS.

A ratio plot of Raman intensity of relevant specific wavenumbers is depicted in Figure ^{−1}/1267 cm^{−1} and 1654 cm^{−1}/1450 cm^{−1} can be used to evaluate levels of saturated and unsaturated lipids of breast tissue ^{−1}/1450 cm^{−1} and 754 cm^{−1}/1450 cm^{−1} indicates the change in proteins, nucleic acid, and lipid content as cancer progresses. The level of saturated lipids (Figure

Comparisons among relative intensity ratios of the selected Raman bands with the corresponding tentative biochemical assignments of the tissue samples. All data are represented as mean ± standard deviation values. (a) Ratio for saturated lipid. (b) Ratio for unsaturated lipid. (c) Ratio for nucleic acid to lipid. (d) Ratio for protein to lipid.

To identify the important variations within the acquired Raman data, multivariate analysis was conducted to distinguish spectral features characteristic of the different tissues. Raman spectra of the low-wavenumber region (600–1800 cm^{−1}) and high-wavenumber region (2800–3000 cm^{−1}) were obtained from the three tissue sample types and categorized by PCA to obtain corresponding PC scores and loading values. The first PC accounted for the largest variance within the spectral dataset (PC1, 89.6%), while PC2 and PC3 represented 5.3% and 0.7% of the total variance, respectively. In order to visualize the spectral distribution of different tissues, Figure ^{−1}) and lipid components (1076, 1302, 1450, 1654, 2854, and 2900 cm^{−1}); it can be seen that the PC1 contained more lipids and carotenoids. Compared to the loading of PC1 with the single spectrum of Figure

(a) A scatter plot of the first three principal components acquired from the dataset consisting of all the collected spectra from three tissue types. (b) The corresponding PCA loading spectra of PC1, PC2, and PC3.

For the positive peaks of PC2, the corresponding loading can be assigned to biochemical components such as nucleic acids at 669 cm^{−1}, tryptophan at 754 and 1552 cm^{−1}, phenylalanine at 1608 cm^{−1}, collagen at 1243 cm^{−1}, and lipids at 2934 cm^{−1}, while negative peaks can be attributed to lipids at 1076, 2854, and 2900 cm^{−1}. In a comparison of PC2 loading with characteristic spectra of the three tissues, the characteristics of positive peak 669, 754, 1552, 1608, and 1243 cm^{−1} were obvious in IDC tissues, while the positive peak at 2934 cm^{−1} can be observed in all three tissue types. Therefore, positive features extracted by PC2 were principally derived from IDC tissue, while the negative features were mainly derived from the contribution of DCIS.

The loading of PC3 was evenly distributed on both sides of the zero line, but principal characteristic information in the negative loading appeared at 669, 754, 1552, 1608, 2854, and 2900 cm^{−1}, while positive peaks were observed at 1267, 1302, 1450, and 1654 cm^{−1}, peaks representing the spectral contribution of nucleic acids, proteins, and lipids. Component PC3 was rather noisy, displaying a mixture of spectral characteristics of both PC1 and PC2.

All three significant PCs were loaded into the LDA model for developing effective breast tissue diagnostic model. Figure ^{−1}) and high-wavenumber range (2800–3000 cm^{−1}) from investigated tissue types obtained by the PCA-LDA algorithm. The scatter plot of LDA discrimination distinguishes the spectra of the three tissue types, in which the zero line of the first discriminant function distinguishes the spectral feature of H group from that of cancerous tissue. The spectra of H group all distributed on the negative axis, while that of cancerous tissue appears on the positive axis. The spectra of the DCIS group were all represented on the negative axis of the second discriminant function, and the spectra of the IDC group on the positive axis. Thus, the zero line of the second discriminant function can separate the IDC group from the DCIS group. The posterior probabilities of H, DCIS, and IDC groups were also calculated and shown as a two-dimensional ternary scatter plot in Figure ^{−1}) and the high-wavenumber region (2800–3000 cm^{−1}), the classification of three different types of breast tissues using PCA-LDA diagnostic model is shown in Figures

The scatter plot of linear discriminant scores for three types of tissue.

A two-dimensional ternary plot of the posterior probabilities belonging to the investigated H, DCIS, and IDC samples calculated from the acquired dataset consisting of all the collected spectra from three tissue types, using the PCA-LDA discriminant model combined with LOOCV method.

To achieve an optimized classification performance in our study, SVM with three kernel functions (linear, polynomial, and RBF) was also implemented in the present study. In addition, PC1 and PC2 scores were used as input variables in the SVM model for visual classification. The optimal parameters of each kernel type were determined from the training set using a grid search program combined with cross-validation. In order to observe the influence of different parameters on classification accuracy in the training of the SVM model, three-dimensional surface maps of the different parameters and corresponding classification accuracy were constructed, as shown in Figures ^{−5} to 2^{5}, with a step of power of two. It can be observed that the accuracy of the RBF kernel in the PCA-SVM classification model gradually increased with increasing values for parameters ^{−5} to 2^{5} was selected, with a step of power of two. Figure

(a, b) The 3D map of classification accuracy as a function of parameter

These optimized parameters were used to build the final SVM classification model to classify the spectra in the test set. The classification accuracy of the RBF kernel PCA-SVM model in the test set was 96.7%, while those of the linear and polynomial kernel PCA-SVM models were 100% and 100%, respectively. PCA-SVM diagnostic model was used to classify the test set data of three breast tissues, as shown in Tables ^{−1}) and high-wavenumber (2800–3000 cm^{−1}) region could be found in Figures

Scattering plots of PCA-SVM algorithm based on three kernel functions. (a) PCA-SVM with linear kernel, (b) PCA-SVM with polynomial kernel, and (c) PCA-SVM with RBF kernel. Points in different colors represent different tissue types; background color represents class domain created by SVM (a, linear kernel; b, polynomial kernel; c, RBF kernel).

Based on the calculated differential spectra in Figure ^{−1}) was significantly lower in the DCIS and IDC tissues than in the H group. These results indicate that lipid content declined in the DCIS and IDC groups, possibly related to the high rate of cell division and the thinning of the lipid cell membranes in the process of cancer cell invasion and migration [^{−2}) or iron complexes in cancer tissues may also reduce lipid content [^{−1}) was higher in DCIS and IDC tissues than that in the H group, which may be associated with the large quantity of protein synthesized by cancer cells during uncontrolled growth and thence leads to an increase in phenylalanine levels [^{−1} (carotenoids) exhibited decreased intensity in DCIS tissue compared with healthy tissue, possibly attributable to the free radical oxidation of carotenoids [^{−1}) and protein (754 cm^{−1}) levels in the IDC group compared with the H group may be associated with the large quantity of protein synthesized by cancer cells during uncontrolled growth [^{−1}), phenylalanine (1608 cm^{−1}), and nucleic acid (669 cm^{−1}) content and lower lipid (1302, 1450, 1654, 2850, and 2900 cm^{−1}) levels in the IDC group compared with the DCIS group, consistent with features of cancer progression.

Ratio plots of saturated lipids (1302 cm^{−1}/1267 cm^{−1}), as displayed in Figure ^{−1}/1450 cm^{−1}) in cancerous tissue was higher than that in healthy tissues, possibly related to lipid peroxidation during the development of breast cancer. As cancer invasion occurred, the ratio of nucleic acid to lipid (669 cm^{−1}/1450 cm^{−1}) and protein to lipid (754 cm^{−1}/1450 cm^{−1}) gradually increased, consistent with changes in nucleic acid and lipid levels for DCIS and IDC in Figure

The above analysis only employed a limited number of Raman peaks for tissue classification; however, many biochemical species would be involved in cancer evolution and progression. Therefore, multivariate statistical analysis method (such as PCA-LDA), which identifies the most significant spectral features from the whole spectrum, improves the diagnostic efficiency of Raman-based tissue analysis and classification. The advantage of PCA-LDA is that the modeler can query the spectral variables using principal components selected in the model to provide a source for classification [

SVM is an additional multivariable analysis technique able to manage linear and nonlinear separable data. For SVM, the most important operation is to choose an appropriate kernel function and parameter optimization strategy, critical for the development of a robust model. When choosing the kernel function, it should first be considered whether the data is linearly separable, and the optimization should maximize the accuracy and minimize the complexity of the model. Compared with other multivariate statistical methods, SVM can deal with class boundaries under complex condition by replacing kernel functions. In the present study, an SVM model with three traditional kernels was developed by using PCA algorithm to reduce the dimension of spectral data, which greatly simplifies the SVM algorithm and improves its performance. The results indicated that the PCA-SVM model with linear and polynomial kernels had the best classification performance, followed by the PCA-LDA method. The performance of the linear and polynomial kernel PCA-SVM models was slightly higher than that of PCA-LDA, possibly due to the use of a hyperplane to separate the classes [

In conclusion, the present study demonstrated that significant biochemical differences in breast cancer can be observed by Raman spectroscopy. Compared with H tissue, the content of protein and nucleic acid in DCIS and IDC tissue was higher, while the composition of lipids and carotenoids was lower or had even disappeared. Combined with multivariate analysis, the spectral characteristics in the H, DCIS, and IDC groups were further extracted by PCA loading and score plots. We also confirmed that the tissue classification model based on the PCA-LDA algorithm, together with LOOCV, was able to distinguish three different breast tissue types. In addition, a PCA-SVM diagnostic technique was developed with different kernel functions and comprehensive evaluation and comparison of the diagnostic performance were performed. This method greatly simplified the complexity of calculation without sacrificing the performance of the algorithm. The linear and polynomial PCA-SVM algorithm was superior to the PCA-LDA algorithm for classification of the spectra in breast tissue, indicating that it has great diagnostic potential in future applications. Therefore, the study confirmed the feasibility of Raman spectroscopy combined with multivariate analysis for the diagnosis of breast cancer.

Although our presented work or other groups’ achievements has already demonstrated that Raman spectroscopy benefits early cancer diagnosis and pathological studies in a noninvasive way or without sample preparation procedures, there are still some practical issues that should be noted. Firstly, individual spectral diversity is a particularly prominent factor for making appropriate final diagnostic decisions; therefore, it is necessary to adopt machine learning method for accurately classifying the spectral features among different tissue types, and provide a reference for treatment practice. In this context, continued efforts are highly required to facilitate the transition from a Raman benchtop (micro)spectroscopy to bedside by developing advanced detection methodologies for bridging the gap between experimental studies and clinical practices. The improvement of newly developed Raman instrument would be symbolized by a high signal-to-noise ratio with automatic data analysis techniques, allowing fast, earlier, and more accurate diagnosis. Meanwhile, more

The spectroscopic data used to support the findings of this study were supplied by Dr. Shuang Wang under license and so cannot be made freely available. Requests for access to these data should be made to Dr. Shuang Wang (

The authors declare that they have no conflicts of interest.

This work was supported by the National Natural Science Foundation of China (61911530695) and Science Development Foundation of Shaanxi Province, China (2020KW-055).

The supplementary documents explain the mathematical principal of PCA-SVM model and provide its evaluation results in both fingerprint and high-wavenumber region.