Cancer diagnosis is one of the most important tasks of biomedical research and has become the main objective of medical investigations. The present paper proposed an analytical strategy for distinguishing between normal and malignant colorectal tissues by combining the use of near-infrared (NIR) spectroscopy with chemometrics. The successive projection algorithm-linear discriminant analysis (SPA-LDA) was used to seek a reduced subset of variables/wavenumbers and build a diagnostic model of LDA. For comparison, the partial least squares-discriminant analysis (PLS-DA) based on full-spectrum classification was also used as the reference. Principal component analysis (PCA) was used for a preliminary analysis. A total of 186 spectra from 20 patients with partial colorectal resection were collected and divided into three subsets for training, optimizing, and testing the model. The results showed that, compared to PLS-DA, SPA-LDA provided more parsimonious model using only three wavenumbers/variables (4065, 4173, and 5758 cm−1) to achieve the sensitivity of 84.6%, 92.3%, and 92.3% for the training, validation, and test sets, respectively, and the specificity of 100% for each subset. It indicated that the combination of NIR spectroscopy and SPA-LDA algorithm can serve as a potential tool for distinguishing between normal and malignant colorectal tissues.
Nowadays, cancer has become one of the principal causes to death of diseases [
Colorectal cancer is a disease of genes that control the proliferation, differentiation, and death of colon cells [
Recent researches have demonstrated the applicability of optical spectroscopic technique for fast, noninvasive, and in situ diagnosis of various diseases including cancer. Infrared (IR) and near-infrared (NIR) spectroscopy especially have been proved to be useful tools for disease diagnosis because of their potential to probe the changes of tissues and cells at the molecular level [
However, the NIR spectrum mainly corresponds to overtones and combinations of fundamental vibration transitions that occur in the IR region and is therefore overlapping, broad, and weak and without distinct signature of individual components [
The present paper proposed an analytical strategy for distinguishing between normal and malignant colorectal tissues by combining the use of NIR spectroscopy with variable selection. For this purpose, the SPA-LDA was used to seek a reduced subset of variables/wavenumbers and build a diagnostic model of LDA. For comparison, the partial least squares-discriminant analysis (PLS-DA) based on full-spectrum classification was also used as the reference. Principal component analysis (PCA) was used for a preliminary analysis. A total of 186 spectra from 20 patients with partial colorectal resection were collected and divided into three subsets for training, optimizing, and testing the model. The results showed that, compared to PLS-DA, SPA-LDA provided a simpler and better model, which used only three wavenumbers/variables (4065, 4173, and 5758 cm−1) to achieve the sensitivity of 84.6%, 92.3%, and 92.3% for the training, validation, and test set, respectively, and the specificity of 100% for each subset. It indicated that the combination of NIR spectroscopy and SPA-LDA algorithm can serve as a potential tool for distinguishing between normal and malignant colorectal tissues.
Partial least squares (PLS) regression is a classic latent variable-based multivariate calibration method. Partial least squares-discriminant analysis (PLS-DA) is a classification algorithm that combines the properties of PLS regression with discriminant analysis [
The model constructed on the experimental dataset can be used to assign unknown samples to a previously defined class based on its measured features such as spectrum. Classification of a new sample is derived from the output value of the PLS model. The output value is a real number, instead of an integer, which should ideally be close to the values used to codify the class (either 1 or 2). A threshold between 1 and 2 is set so that a sample is assigned to class 1 if the predicted value is smaller than the threshold or assigned to class 2 if the predicted value is above the threshold. PLS-DA uses the appropriate number of LVs, that is, linear combinations of the original variables, to maximize the discrimination among the classes. The number of LVs can be optimized by the criterion of lowest prediction error in cross validation.
The successive projections algorithm (SPA) is a forward variable selection method aimed at minimizing variable collinearity in modeling. It was originally developed by Araújo et al. in the context of multivariate calibration [
The combination of SPA and LDA is expressed as SPA-LDA. Figure
The flowchart of the SPA-LDA algorithm.
Once the variables have been selected, a SPA-LDA model can be obtained. For a new sample, its Mahalanobis distance with respect to the mean vector of each class can be calculated and it can then be assigned to the class for which its Mahalanobis distance is the smallest.
Colonic tissue samples were collected from 20 patients who underwent partial colorectal resection at the Affiliated Hospital of North Sichuan Medical College and the First People’s Hospital of Yibin of China. All patients were histopathologically proven malignancies of the colon. After surgical resections, the tissue samples were immediately fixed in 10% formalin solution and then stored in the laboratory for spectral measurements. To ensure that the NIR spectra were representative of the pathology, the peer tissues were processed as paraffin embedded blocks for pathologic confirmation. The average age of the patients was 54 years with the youngest being 31 years and the oldest being 71 years. The study had been approved by the local ethics committee and the consent for using the tissue samples was obtained. It was believed that the positions with 5–10 cm distance from the tumor were healthy and each site was also confirmed by experienced pathologist. A total of 186 NIR spectra from different sites of colonic tissue specimens were acquired, in which 78 spectra were from cancerous positions and 108 spectra from normal positions. Different spectra correspond to different positions. All spectra were divided into three subsets: the training set, the validation set, and the test set. Each subset consisted of 26 cancerous and 36 normal spectra from different patients. For classification purposes, each spectrum was assigned a class label (1 for cancer and 2 for normal). The training and validation sets were used in the modeling procedures whereas the test set was only used in the evaluation and comparison of the final classification models.
The FT-NIR spectrometer of Antaris II (Thermo Fisher Scientific, USA) equipped with an InGaAs detector and a fiber-optic probe (SabIR) was used in this work for spectra collection. The SabIR is a high-performance optical probe able to perform remote nondestructive sampling. The measurement was done in diffuse reflectance mode. The outer and light spot diameters of the probe were about 20 mm and 3 mm, respectively. Thus, during each measurement, the measured area was appropriately 7.0 mm2. The spectrometer was controlled by the accompanied Result 3.0 software. Each spectrum was taken as an average of 32 successive scans from 4000 to 10000 cm−1 with spectral resolution of 4 cm−1. The record format of spectrum was
Figure
Populations mean spectra and the standard deviation of cancerous and normal tissue specimens.
Principal component analysis (PCA) was used to examine the possible clustering in samples and investigate the extent to which NIR features can differentiate cancerous and normal tissues. Figure
Three-dimensional scatter plot of the first three principal components (PCs) and its 2-dimensional projection. The variance explained by each PC is indicated in parenthesis.
Both the PLS-DA and SPA-LDA algorithms were used for constructing the diagnostic models.
When the PLS-DA model was constructed, one major issue was the choice of the optimal number of latent variables (LVs), which was carried out by a 5-fold cross validation procedure. When performing cross validation, the samples in the training set were first divided into five cross validation groups, that is, cancellation groups. Each cancellation group was first assigned 5 cancerous spectra and 7 normal spectra and the remaining spectra entered into the fifth group. Each cross validation group was removed from the training set, one at a time. Each time, the model was trained on the remaining samples and then used to predict the samples in the cross validation group. Figure
The influence of the number of latent variables (LVs) in the PLS-DA model on the misclassified ratio (MCR) and the first two loading vectors.
The prediction performance of the final PLS-DA model on different subsets.
The SPA-LDA modeling resulted in only three variables/wavenumbers, which correspond to the minimum point of the validation cost curve, as the arrow indicated in Figure
Validation cost (
Preprocessed mean spectra of cancerous and normal tissues by 1st derivative in the range of 8000–4000 cm−1. The solid circle markers indicate the positions in the spectra of the wavenumbers selected by SPA-LDA algorithm.
The prediction performance of the final SPA-LDA model on different subsets.
The combination of NIR spectroscopy and two classification algorithms was evaluated in a study for distinguishing cancerous colon tissue from normal ones. The results showed that the SPA-LDA was preferable since it used only three single wavenumbers to achieve better performance than PLS-DA. The NIR technique has several advantages: it is inexpensive and less time-consuming and does not require special sample preparation. It can be applied in oncology, not only to diagnose cancerous tissue from normal tissue but also to understand basic process such as changes in metabolite concentration at the molecular level before histological manifestation. Based on more representative sample set, NIR is also expected to be used in grading of malignancies, which maybe remains our future work.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China (21375118), the Applied Basic Research Programs of Science and Technology Department of Sichuan Province of China (2013JY0101), Scientific Research Foundation of Sichuan Provincial Education Department of China (12ZA201, 13ZB0300), Yibin Municipal Innovation Foundation (2013GY018), and Innovative Research and Teaching Team Program of Yibin University (Cx201104).