Glioma is one of the most common and deadly malignant brain tumors originating from glial cells. For personalized treatment, an accurate preoperative prognosis for glioma patients is highly desired. Recently, various machine learning-based approaches have been developed to predict the prognosis based on preoperative magnetic resonance imaging (MRI) radiomics, which extract quantitative features from radiographic images. However, major challenges remain for methodologic developments to optimize feature extraction and provide rapid information flow in clinical settings. This study investigates two machine learning-based prognosis prediction tasks using radiomic features extracted from preoperative multimodal MRI brain data: (i) prediction of tumor grade (higher-grade vs. lower-grade gliomas) from preoperative MRI scans and (ii) prediction of patient overall survival (OS) in higher-grade gliomas (<12 months vs. > 12 months) from preoperative MRI scans. Specifically, these two tasks utilize the conventional machine learning-based models built with various classifiers. Moreover, feature selection methods are applied to increase model performance and decrease computational costs. In the experiments, models are evaluated in terms of their predictive performance and stability using a bootstrap approach. Experimental results show that classifier choice and feature selection technique plays a significant role in model performance and stability for both tasks; a variability analysis indicates that classification method choice is the most dominant source of performance variation for both tasks.
Glioma is one of the most common and deadly malignant brain tumors originating from glial cells. About 50 percent of nervous system tumors and 80 percent of all malignant brain tumors are gliomas. Glioblastoma multiforme (GBM) (also called glioblastoma) is a fast-growing glioma that develops from star-shaped glial cells (astrocytes and oligodendrocytes) that support the health of the nerve cells within the brain. In adults, GBM occurs most often in the cerebral hemispheres, especially in the brain’s frontal and temporal lobes of the brain. GBM is a devastating brain cancer that typically results in death in the first 15 months after diagnosis. Traditional treatment of GBM is surgical resection followed by radiation therapy and/or chemotherapy. However, the median survival time of GBM is still less than 15 months despite surgical resection, radiotherapy, and chemotherapy. Therefore, the accurate preoperative prognosis of GBM patients is desired, which can provide essential information for planning the optimized and personalized treatment.
Recently, various machine learning-based approaches have been developed to predict the prognosis based on preoperative magnetic resonance imaging (MRI) radiomics, which is a new cross-field of medical informatics, aiming to extract quantitative features defined by mathematics from medical images, such as shape, intensity, and texture [
Although all the studies mentioned above have indicated an essential value of brain imaging phenotype for OS prediction, tumors are often heterogeneous in space and time. There are differences in the cell, gene, and microenvironment for different tumor regions at the same time point or at other time points in the same tumor region, which usually requires multiple biopsies to capture the tumor’s molecular heterogeneity, bringing inconvenience and risk to patients. Radiomics can provide a noninvasive way to explore the heterogeneity of tumors [
However, although researchers at home and abroad have done a lot of research on the application of machine learning algorithm in radiomic feature classification and prognosis prediction [
In our study, two machine learning classification tasks using radiomic features are investigated, which predict tumor grade and patient OS from preoperative MRI scans, respectively. These two tasks utilize the conventional machine learning techniques constructed with various classifier methods. Feature reduction methods also are applied to increase model performance and decrease computational costs. Models are assessed in terms of their predictive performance and stability using a bootstrap approach based on the 2017 BraTS Challenge’s MRI data. Experimental results show that the classifier choice and dimensionality reduction technique plays a significant role in model performance and stability for both tasks. Figure
Proposed workflow for grade/survival classification task.
We utilized the 2017 BraTS Challenge’s Training Dataset [
Examples of the four MRI modalities and the corresponding tumor masks from two randomly selected GBM patients.
This study uses a multitask VNet framework to segment glioma and its different subregions from the multimodal MR image, shown in Figure
The overall architecture of the proposed multitask VNet.
In contrast, the decoder module alternately stacks deconvolutional layers and convolutional layers in the joint encoder restore image resolution stage by stage based on the features extracted by the module. The model’s loss function is the weighted sum of the categorical focal loss of the mask decoder block and the MSE loss of the distance transform decoder block. Its essence is that the distance map prediction regularizes the template prediction.
Medical images contain a lot of information that can reflect the relationship between human macro performance and microenvironment. Up to now, the analysis and diagnosis of medical images are mainly based on human judgment. The disadvantage of this method is that it can only be qualitative but not quantitative. Compared with the qualitative description of human experience, quantitative features can reflect more potential information in the image. Medical imaging has developed from traditional morphological diagnosis to quantitative tumor analysis. The main difference is that the latter needs to extract and analyze more high-order quantitative image features.
Quantitative feature extraction refers to the process of extracting information from images by computer. The performance of a classification model largely depends on the features used. We extracted 16 shapes, 19 first-order statistics, 27 gray-level cooccurrence matrix (GLCM), 16 gray-level size zone matrix (GLSZM), and 16 gray-level run length matrix (GLRLM) features from each phenotype region of interest (ROI). The coiflet wavelet transform filter was also applied to each image to extract eight decompositions; for each phenotype, each decomposition’s intensity-based features were calculated. The combination of shape features, first-order features, texture features, and wavelet features extracts 718 features for each image phenotype and 2154 features for each sample. Before extracting these features, voxel intensity values were normalized using the
Radiomics leads to the creation of several informative features for use in predictive modeling. However, when the number of samples is far less than the number of features, direct classification prediction has a high computational cost and a poor effect. It may even lead to the classification prediction algorithm’s failure. Hence, feature selection is needed to obtain the feature set with good performance after image feature extraction.
For machine learning models, there are many methods to reduce the feature space. Common categories of feature selection methods include filter, wrapper, and embedded methods. In addition, compared with the wrapper and embedded methods, the filter methods have the advantages of classifier independence and high computational efficiency [
We utilized four unsupervised dimensionality reduction methods to build machine learning models, that is, principal component analysis (PCA), kernel PCA (KPCA), independent component analysis (ICA), and factor analysis (FA). We chose these methods due to their simplicity, computational efficiency, and easily available implementation. Moreover, these methods were compared with a univariate filter technique, ANOVA
The prediction of tumor grade or overall survival in this paper is a small sample binary classification problem. To solve this problem, supervised learning in machine learning is more targeted. Supervised learning uses the training data to find rules through training to predict new samples. Training data consists of examples represented by a set of input features (radiomic features) and an output value (tumor grade or overall survival class). Once an intelligent prediction model is built from labeled data using a classifier and feature selection method, it can predict an unlabeled sample class.
We selected nine conventional machine learning techniques constructed with various classifier methods and two deep learning-based models for comparison, that is, decision trees (DT), random forest (RF), bagging (BAG), boosting (BST), Gaussian naïve Bayes (NB), multilayer perceptron (MLP), support vector machines (SVM), logistic regression (LR),
Models built with various machine learning techniques.
Classifier methods | Dimensionality reduction methods | Feature selection methods |
---|---|---|
Decision trees (DT) | Principal component analysis (PCA) | ANOVA |
Random forest (RF) | Kernel PCA (KPCA) | Max 2D diameter (DIAM) |
Bagging (BAG) | Independent component analysis (ICA) | — |
Boosting (BST) | Factor analysis (FA) | — |
Naïve bayes (NB) | — | — |
Multilayer perceptron (MLP) | — | — |
Support vector machine (SVM) | — | — |
Logistic regression (LR) | — | — |
— | — |
To analyze our results, a split was made by the patient. For each dataset (Tumor Grade Dataset
To investigate and compare the performance of different dimensionality reduction and classification approaches, a three-dimensional parameter grid for analysis was constructed in this study. For any of the four dimensionality reduction approaches, we took two as the step size (
We apply the popular open-source machine learning python library scikit-learn for model building and analysis in Python 3.6. The training and testing experiments are performed on an NVIDIA GeForce Titan RTX 24G GPU with Intel Xeon Silver 4210 2.2G GPU. The presented figures are generated using the plotting library Matplotlib. An open-source radiomics toolbox, Pyradiomics, was used for radiomic feature extraction.
There are three main experimental factors in our study which can affect the radiomics-based prediction, that is, prediction model (RF, NB, DT, BAG, BST, SVM, LR, MLP, KNN, CNN, and DNN), feature selection method (PCA, KPCA, ICA, and FA), and the number of dimensions selected (1, 3, 5, …, 15). Multivariate analysis of variance (ANOVA) was used to quantify these factors’ impacts on AUC scores and their interactions in each classification task. To compare the variability contributed by each factor, the variance (sum of squares) calculated for each factor was divided by total variance and multiplied by 100 to yield the percent variance for each factor.
In our study, a total of 2154 features were extracted from the segmented tumor regions of the preoperative MRI scans from the BraTS 2017 glioma dataset. For the Tumor Grade Dataset, the output classes were LGG or HGG, while for the Overall Survival Dataset, the output classes were <12-month or >12-month survival. For both classification tasks, feature selection and classification training were made using the training set, whereas the testing set was used to assess performance and stability.
Figure
Predictive performance of feature reduction and classification methods. (a) Grade classification task and (b) survival classification task.
In addition, we repeated the above experiment by varying the number of dimensions. Figures
Predictive performance corresponding to classification methods and the number of dimensions for each dimensionality reduction method for grade classification task.
Predictive performance corresponding to classification methods and the number of dimensions for each dimensionality reduction method for survival classification task.
Four AUC/RSD values corresponding to different dimensionality reduction techniques (PCA, KPCA, ICA, and FA) are generated for each prediction method. We took the median of all four AUC/RSD values for each prediction task as the representative AUC/RSD of a model. Figure
Scatterplots between representative stability and predictive performance of classification methods. (a) Grade classification task and (b) survival classification task.
To quantify the effect of classification methods, dimensionality reduction methods, and the number of selected dimensions, multivariate ANOVA was performed on AUC scores in this study. In Figure
Variation of AUC explained by experimental factors and their interactions. (a) Grade classification task and (b) survival classification task.
Several studies have built radiomics-based predictive models for various clinical factors such as tumor grade, prognostic outcome, treatment response, and more. However, to expand the radiomics community, studies utilizing open-source data, tools, and machine learning models, such as those used in our current investigation, are necessary. In a series of papers by Parmar et al., they evaluated the predictive performance and stability of computed tomography (CT) radiomic machine learning models constructed with various feature selection filter methods and classifier methods [
Additionally, Zhang et al. performed a similar study on lung CT with unsupervised dimensionality reduction methods and proposed dimensionality reduction methods have the potential to be superior to filter methods [
ET in T1c MRI scans often used as a distinctive marker when attempting to distinguish LGG from HGG. However, since we have only used LGG samples that contain ET components, we suggest radiomics provides novel information about underlying phenotype, usually not possible in the radiological setting. Glioma grade is histopathologically diagnosed; i.e., a biopsy must be taken for classification [
Predictive performance for grade classification is much higher when compared to survival classification, which is not surprising as each classification task has its own set of optimal radiomic biomarkers linked to underlying biological significance. For example, the combination of shape, first-order statistics, texture, and wavelet features utilized through dimensionality reduction leads to higher predictive performance than diameter features alone for the grade classification task. However, this is not the case for the survival classification task. Moreover, using diameter features alone in survival prediction leads to higher predictive performance than dimensionality reduction or filter techniques with all radiomic features. Previous studies have shown that texture features are challenging to gain predictive power from in GBM, with AUC values routinely falling <0.6 [
For both classification tasks, the classifier method was the most significant contribution to variability in predictive performance. A trend has commonly been observed in radiomic studies investigating machine learning models using different classifiers and feature selection methods [
Some limitations of our study are as follows. Regarding image preprocessing, we have only utilized a simple method of intensity normalization (
In this study, we investigate two machine learning classification tasks using radiomic features: (i) prediction of tumor grade (higher-grade vs. lower-grade gliomas) and (ii) prediction of overall survival in higher-grade gliomas (<12 months vs. > 12 months). These tasks are attempted using machine learning models constructed with various classifier methods and dimensionality reduction techniques. Models are assessed in terms of their predictive performance and stability using a bootstrap approach. Our results demonstrate that for both classification tasks, among dimensionality reduction methods, FA yielded the highest predictive performance. Similarly, MLP, LR, and KNN produced the highest predictive performance and stability among classifier methods. In addition, DT tended to perform poorly for both classification tasks. This possibly points to an underlying radiomic structure in the BraTS dataset that is preferentially fit by specific machine learning models. Where results start to diverge significantly is in the implementation of the SVM classifier. For the grade classification task, SVMs tend to perform relatively well with all feature selection methods except ICA. For the survival classification task, SVMs tend to perform poorly with all feature selection methods except FA. Interestingly, previous studies in different cancer types have suggested RF to be the best classifier method for radiomics studies. Still, it does not score among the best classifier methods for either task in our research.
The raw/processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.
The authors declare that they have no conflicts of interest regarding the publication of this paper.
This work was funded in part by the National Natural Science Foundation of China (Grant nos. 62072413 and 61602419), in part by the Natural Science Foundation of Zhejiang Province of China (Grant no. LY16F010008), in part by Medical and Health Science and Technology Plan of Zhejiang Province of China (Grant no. 2019RC224), and also in part by the Teacher Professional Development Project of Domestic Visiting Scholar in Colleges and Universities of Zhejiang Province of China (Grant nos. 2020-19 and 2020-20).