Tumour Relapse Prediction Using Multiparametric MR Data Recorded during Follow-Up of GBM Patients

Purpose. We have focused on finding a classifier that best discriminates between tumour progression and regression based on multiparametric MR data retrieved from follow-up GBM patients. Materials and Methods. Multiparametric MR data consisting of conventional and advanced MRI (perfusion, diffusion, and spectroscopy) were acquired from 29 GBM patients treated with adjuvant therapy after surgery over a period of several months. A 27-feature vector was built for each time point, although not all features could be obtained at all time points due to missing data or quality issues. We tested classifiers using LOPO method on complete and imputed data. We measure the performance by computing BER for each time point and wBER for all time points. Results. If we train random forests, LogitBoost, or RobustBoost on data with complete features, we can differentiate between tumour progression and regression with 100% accuracy, one time point (i.e., about 1 month) earlier than the date when doctors had put a label (progressive or responsive) according to established radiological criteria. We obtain the same result when training the same classifiers solely on complete perfusion data. Conclusions. Our findings suggest that ensemble classifiers (i.e., random forests and boost classifiers) show promising results in predicting tumour progression earlier than established radiological criteria and should be further investigated.


Introduction
GBM is the most common and malignant intracranial tumor [1], representing as much as 30% of primary brain tumors with increasing incidence in some geographic regions [2]. The patients have a median survival of only 10 to 14 months after diagnosis with only 3 to 5% of patients surviving more than three years. Recurrence is universal, and, at the time of relapse, the median survival is only five to seven months despite therapy [3].
The current standard of care is surgical resection followed by radiotherapy and concomitant and adjuvant temozolomide (TMZ) chemotherapy [4].
Magnetic resonance imaging (MRI) is the most widely used medical imaging technique for identifying the location and size of brain tumours. However, conventional MRI has a limited specificity in determining the underlying type of brain tumour and tumour grade [5,6]. More advanced MR techniques like diffusion-weighted MRI, perfusion-weighted MRI, and chemical shift imaging (CSI) are promising in the 2 BioMed Research International characterization of brain tumours as they give potentially more physiological information [7][8][9].
Diffusion-weighted imaging (DWI) and diffusion kurtosis imaging (DKI) visualize the tissue structure and are useful for assessing tumour cellularity, because they give information about the movement of the water inside different tissues including biological barriers. Typical parameters related to diffusion are apparent diffusion coefficient (ADC), mean diffusivity (MD), mean kurtosis (MK), and fractional anisotropy (FA). MD is a general parameter that accounts for the mean diffusivity in all directions, MK might be a specific parameter for tissue structure [10], and FA is a general index of anisotropy, with a value of zero corresponding to isotropic diffusion and a value of one corresponding to diffusion only in one direction.
Perfusion-weighted MRI (PWI) provides measurements that reflect changes in blood flow, volume, and angiogenesis. Hypervascularity due to glioma-induced neoangiogenesis may show up as high relative cerebral blood volume (rCBV) while necrosis of different tissues may show up as low rCBV [11].
MR spectroscopy provides information about metabolites present in normal and abnormal tissues [12]. This information can be represented as metabolite maps using CSI.
The focus of our paper is finding a map between the multiparametric MR data acquired during the follow-up of the patients and the relapse of the brain tumour after surgery, as described by the clinically accepted RANO criteria [17]. In order to do this, we test different families of classifiers on multiparametric MR data, starting from simple ones, for example, -nearest neighbours ( -NN) and linear discriminant analysis (LDA), and moving to nonlinear classifiers, for example, random forests and neural networks, using a total of 27 features extracted from PWI, DKI, and CSI data.
Patients that were treated according to the HGG-IMMUNO-2003 protocol are patients with relapsed GBM that received immune therapy as the sole treatment strategy.
Patients that were treated according to the HGG-IMMUNO-2010 protocol are patients with primary GBM that had surgery. For the follow-up treatment after surgery the patients were split into two groups. The first group consisting of 6 patients who received radiochemotherapy and the immune therapy vaccine. The second group consisting of the remaining 7 patients who received just radiochemotherapy for the first six months after surgery, and after those six months all 7 patients received radiochemotherapy plus the immune therapy vaccine. We refer to the first group as "HGG-IMMUNO-2010 vaccine" and to the second group as "HGG-IMMUNO-2010 placebo. " All 29 patients were offered monthly MRI follow-up, but after six months under immune therapy all patients switched to a three-monthly schedule.
The local ethics committee approved this study and informed consent was obtained from every patient before the first imaging time point.
Based on radiological evaluation of the follow-up MRI scans using the current guidelines for response assessment of high grade glioma [17], each patient was assigned to one of two clinical groups: (i) patients with progressive disease during follow-up which exhibit an increase of ≥25% in the sum of the products of perpendicular diameter of enhancing lesions compared to the smallest tumour measurement obtained either at baseline or best response, (ii) patients with complete response with disappearance of all measurable and nonmeasurable disease sustained for at least 4 weeks.
Based on this assessment, each MRI time point for each patient was considered to be labeled or unlabeled as follows: labeled as "responsive" for all time points at and after the moment when the patient was considered as "complete response"; labeled as "progressive" for all time points at and after the moment when the patient was considered as "progressive disease"; or "unlabeled" for all time points preceding the decision moment.

MRI Acquisition and
Processing. Magnetic resonance imaging was performed on a clinical 3 Tesla MR imaging system (Philips Achieva, Best, Netherlands), using a body coil for transmission and a 32-channel head coil for signal reception. The imaging protocol consisted of diffusion kurtosis imaging, dynamic susceptibility weighted contrast-MRI (DSC-MRI), and MR spectroscopy, combined with standard anatomical imaging (T1-weighted MRI after contrast administration, T2-weighted MRI, and FLAIR (fluid attenuated inversion recovery) MR images).
Regions of interest (ROI) were manually drawn around the solid contrast-enhancing region if present, avoiding areas of necrosis (N) or cystic components such as the surgical cavity. A second ROI was manually drawn around the entire lesion (TO), that is, contrast enhancement (CE) and perilesional oedema (ED). The ROI containing the perilesional oedema was obtained by extracting the contrast-enhancing portion from the total lesion. Finally, a separate ROI was drawn around the contralateral normal appearing white matter (NAWM) to standardize the hemodynamic measurements of DSC-MRI.
The manual delineations were drawn by a radiologist (SVC) with 5 years experience of MR imaging of brain tumours. An example of delineations on T1 post contrast image can be seen in Figure 1, where green is the necrosis, red is CE, and blue is ED.

Magnetic Resonance Spectroscopy.
A 2D-CSI short echo time protocol was used as validated in [20]. The volume of interest (VOI) is positioned on the slice of the transverse reconstruction of the T1-weighted 3D-FFE sequence with the largest section of contrast enhancement. The slice thickness of the VOI is 10 mm and the VOI is 80 × 80 × 10 mm 3 , with each voxel being 5 × 5 × 10 mm 3 (16 × 16 voxels in total). If the contrast-enhancing lesion was smaller than 2 cm 3 or the contrast enhancement is located in areas with large susceptibility differences, for example, the basal forebrain or the anterior temporal lobes, a single voxel (SV) technique was performed (TR/TE: 2000/35 msec, minimal volume: 1 cm 3 ).
MR spectra were processed using the MATLAB 2010b environment (MathWorks, MA, USA) with SPID graphical user interface [21] as described in detail in [20].
Sixty-six percent (66%) of all spectroscopic time points are not included in this study. There are two reasons for this: (1) quantification was not possible for all time points (MR spectroscopy data was not acquired for all patients due to patient movement) and (2) the rest of them did not pass the quality control recommended by Kreis [24].

Dynamic Susceptibility Weighted Imaging (DSC-MRI).
Perfusion images were obtained using a standard DSC perfusion MR imaging protocol consisting of a gradient echo-EPI sequence, TR/TE: 1350/30 msec, section thickness/gap: 3/0 mm, dynamic scans: 60, FOV: 200 × 200 mm 2 , matrix: 112 × 109, number of slices: 23, and scan time: 1 minute 26 seconds. EPI data were acquired during the first pass following a rapid injection of a 0.1 mmol/kg body weight bolus of megluminegadoterat (Dotarem, Guerbet, Villepinte, France) via a mechanical pump at a rate of 4 mL/sec, followed by a 20 mL bolus of saline. Preload dosing was performed according to Hu et al. in order to correct for T1-weighted leakage (preload dose 0.1 mmol/kg megluminegadoterat, incubation time 10 min) [26].
The mean values of the considered perfusion parameters were retrieved in the CE, ED, and NAWM regions. We report relative rCBV (rrCBV), relative rCBF (rrCBF), and relative DR (rDR) of tumoural tissue by using the corresponding parameter value in the contralateral NAWM as internal reference.
Although quantification was possible for all time points, after quality assessment done by visual inspection by SVC, 30% of them were not included in this study. (DKI). DKI data were acquired according to the previously described protocol in [18,19] (SE-EPI-DWI sequence with TR/TE: 3200/90 msec, /Δ: 20/48.3 msec; FOV: 240 × 240 mm 2 , matrix: 96 × 96, number of slices: 44, 1 signal average acquired, section thickness/gap: 2.5/0 mm, and -values: 700, 1000, and 2800 sec/mm 2 in 25, 40, and 75 uniformly distributed directions, resp.) [27]. The DKI data were processed as described in [18]. Fractional anisotropy (FA), mean diffusivity (MD), and mean kurtosis (MK) were derived from the tensors [10,28]. A nonlinear registration of the parameter maps to the anatomical MR imaging data was performed to minimize the local misalignment between the EPI distorted DKI data and the anatomical data on which the ROIs were manually positioned. MK, MD, and FA were determined in the CE and ED regions.

Diffusion Kurtosis Imaging
Although quantification was possible for all time points, after quality control according to [27], 44% of them were not included in this study. Out of all 178 measurements, if we extract just the ones with complete features, it will result in a subset of 18 patients with 45 measurements. This implies that more than 75% of the measurements have at least one feature missing. Five features are always present: the three volumes, the parameter for tumour resection, and the parameter for different groups.

Classifiers.
We have used several supervised and semisupervised classifiers, as presented in Table 1, with the goal of testing whether the unlabeled data could have been reliably labeled before the actual labeling was performed in the clinic according to the RANO criteria.
The list of classifiers in Table 1 is representative for the most important families of classification methods, starting from simple classical methods such as linear discriminant analysis (LDA) and -nearest neighbour ( -NN) up to more complex nonlinear classifiers such as random forests and neural networks.
Each classifier is based on a mathematical model, which needs to be optimised on the basis of a training dataset. The training set consists here of labeled data, that is, data at and after a clinical decision has been made. The test set on which we compare classifiers consists of data that have no label, that is, time points before the decision of "progressive" or "responsive" has been made.
All classifiers are implemented in MATLAB R2013a (MathWorks, MA, USA). All classifiers except least squares Table 1: Supervised and semisupervised classifiers tested in this paper.

Supervised classifiers
Handles missing values Random forests support vector machines (LSSVMs) and the semisupervised ones are part of the Statistics Toolbox and Neural Networks Toolbox of MATLAB R2013a.
-NN [29] is one of the basic classifers in machine learning. The class label of a new testing point is given by the most common class among its neighbours. We used the default MATLAB R2013a (Statistics Toolbox) function " nnclassify" to run a grid search for the best combination of number of neighbours ( ) and type of distance. We varied between 1 and 11 and the distance was either "euclidean, " "cityblock, " "cosine, " or "correlation. " We found the best results for the combination of 3 neighbours and the "correlation" distance.
Diagonal LDA (dLDA [30]) is a simple modification of linear discriminant analysis, which implies that we use the pseudoinverse of the covariance matrix instead of the actual inverse. We used the default MATLAB R2013a implementation "classify" from the Statistics Toolbox.
SVMs [31,32] are among the most popular machine learning models because they are easy to understand: given a training set with points that belong to two classes, we try to find the best hyperplane to differentiate between the two types of points. We can try this in the original space or we can map the points to another space by using the kernel trick. We used the default MATLAB R2013a (Statistics Toolbox) implementations "svmtrain" and "svmclassify. " We used different types of kernel: linear, polynomial, radial basis function, and multilevel perceptron.
Classification tree [33] is an algorithm commonly used in machine learning. Like in a real tree there are leaves which represent class labels and branches. At each node of a tree a single feature is used to discriminate between different branches. We used the default MATLAB R2013a (Statistics Toolbox) implementation "classregtree. " Neural networks [34][35][36][37] are built on interconnected layers of artificial "neurons" that try to map an input vector to its specific output. There are three types of layers: input, hidden, and output. The weights between different neurons are trained until a maximum number of iterations or a minimum error is reached. We used the default MATLAB R2013a (Neural Network Toolbox) implementation "net" with 10 hidden neurons. We tested four types of neural networks: pattern net, feed forward net, cascade forward net, and fit net.
Random forests [38,39] are part of the ensemble methods for classification that use a collection of decision trees. Each decision tree learns a rule and then it can classify a new point. The new point is assigned to the class voted by the majority of the decision trees. We used the default MATLAB R2013a (Statistics Toolbox) implementation "TreeBagger" with 100 trees.
Boosting algorithms [40][41][42][43] start with a collection of weak classifiers (e.g., decision trees) and with each iteration they try to improve the overall classification by learning what was misclassified at the previous step. We used the default MATLAB R2013a (Statistics Toolbox) implementation "fitensemble" with 100 trees. We tested seven types of boosting algorithms: AdaBoost, LogitBoost, GentleBoost, RobustBoost, LPBoost, TotalBoost, and RUSBoost.
LSSVMs [44,45] are a powerful machine learning technique. We downloaded LSSVMlab from [46] and followed the instructions from [47] to tune the parameters. We used different types of kernel: linear, polynomial, radial basis function, and also the Bayesian approach on LSSVM.
The semisupervised classifiers used in this paper are low density separation (LDS [48]), squared-loss mutual information regularization (SMIR [49]), and safe semisupervised support vector machine (S4VM [50,51]). In the last years there has been a steady increase in the use and development of semisupervised classifiers, as they take into account information from unlabeled data also, not just from labeled data. This makes them powerful machine learning tools. The implementation for semisupervised classifiers was downloaded from [52][53][54].
Classifiers were tested first with all features described in Section 2.2.5 taken as input, but then also by selecting subsets of the available features as input, that is, only the features pertaining to a single modality (perfusion, diffusion, and spectroscopy). Additionally, classifiers were tested first on the smaller dataset containing 45 time points with a complete set of features and then on the larger dataset containing 178 time points where missing values have been imputed according to Section 2.4, presented below.

In-House Imputation
Method. Some classifiers have built-in strategies of handling missing values, but other classifiers do not handle missing values (see Table 1). This is why we developed our own in-house imputation method, so the handling of missing values will be the same for all classifiers.
Our method is based on the volumes of contrast enhancement and oedema regions, in the sense that if the volume of a tumour region is zero, that missing tissue is considered healthy tissue. If we have values of any modality (perfusion, diffusion, and spectroscopy) that are missing from CE or ED, and the volume of CE or ED corresponding to that measurement is zero, and then we assume that those missing values belong to a normal type of tissue. For perfusion, because we normalize every parameter to the normal appearing white matter value, the missing values will be replaced by 1's. For diffusion and spectroscopy, the missing values will be replaced by the average of the features taken over the measurements which were labeled as responsive, because we consider that these measurements are recorded from a healthy tissue. If we have missing values without association to zero volume for CE or ED, they will be replaced by the average taken over all the labeled measurements.

Performance Indices
Leave One Patient Out (LOPO). Classifiers are trained on labeled data from all patients except one who is the test patient. Each patient in turn is selected as test patient. All time points that belong to the test patient are classified independently. Results for each classifier are averaged per time point over all patients relative to the time point at which the clinical decision was made.
This way of testing is intuitive from a medical point of view and provides us with information about how good is the classification when we approach the decision time. In this way we can look at the temporal evolution of the classification for each patient.
We compute the balanced error rate (BER) at each time point before and after the decision, using the clinical decision assigned to each patient as expected label for all time points of this patient. BER is computed as For each classifier we have a grand total of 17 time points, due to the fact that there are patients with up to 6 time points after the decision time point and there are others with up to 11 time points before the decision. In order to compare the classifiers by using just one error number instead of 17, we compute a weighted average for each classifier's time response. This performance measurement is denoted by "weighted BER (wBER)" in the Results section.
We use two sets of weights: (i) one for the temporal response-the classifier should perform better when we approach the labeling time point and after it:      (ii) one for patient population-the time points with more patients get a higher weight (see Table 14 from the Appendix): The equation of wBER is Table 7 from the Appendix shows how different classifiers perform on complete and on imputed features when using all MR modalities. We selected the best 6 classifiers (marked by bold font in Table 7) and present their detailed BER results for each time point in Table 2. Table 3 shows the detailed BER results for each time point for the best 6 classifiers (marked by bold font in Table 7) when using data with imputed features. Table 4 shows how the best 6 supervised classifiers (marked by bold font in Table 7) perform on complete features when using each MR modality separately. Tables 8,9, and 10 from the Appendix list the performance of the best supervised classifiers (marked by bold font in Table 7) when using, respectively, perfusion, diffusion, or spectroscopy data separately, considering complete features only.  Table 5 shows how the best 6 classifiers (marked by bold font in Table 7) perform on imputed features when using each MR modality separately. Tables 11,12, and 13 from the Appendix list the performance of the best supervised classifiers (marked by bold font in Table 7) when using, respectively, perfusion, diffusion, or spectroscopy data separately, considering imputed features only. 8 BioMed Research International Table 8: Detailed BER results for each time point for the best 6 supervised classifiers when using the leave-one-patient-out method on complete perfusion features. The decision moment marked by bold font. Some time points do not have results because there were no complete perfusion measurements.

In-House Imputation Strategy versus Built-In Imputation
Strategy. Table 6 shows how different classifiers perform with our in-house imputation of missing values (Section 2.4) versus the built-in imputation strategy of missing values for the classifiers marked in Table 1.

Discussion.
A first conclusion that we can draw from a comparative analysis of the different classifiers is that we obtain the lowest error when training classifiers on data with complete features and not on data with imputed features, no matter the imputation method (our in-house method or the built-in method). In order to improve the performance of classifiers, improving the quality of the data would help. The lowest error when using complete features is around 0.14 (SVM-mlp-0.136), while if we use imputed features the lowest error is 0.216 (dLDA). The best classifiers on complete   If we compare the results of single MR modalities when training classifiers on data with complete features, we can say that the use of spectroscopy only leads to the worst results with a minimum error of 0.561. The single use of perfusion generates better results than using only diffusion data, especially when using ensemble methods (random forests, LogitBoost, and RobusBoost), with a minimum error of 0.148 compared to 0.255. When using imputed features, the minimum error almost doubles.
An interesting aspect when looking at detailed measurements on complete features (Table 2) is the fact that we have error equal to zero (perfect classification), one time point before the actual labeling according to the RANO criteria, when using random forests, LogitBoost, or RobustBoost. This means that we can predict the patient outcome (progressive, responsive) with 100% accuracy one time point (i.e., about 1 month in our study) earlier than the actual clinical decision was made. When looking at each MR modality separately (Tables 8, 9, and 10) we notice that the same result could have been obtained by using solely the perfusion data. This is a very important finding, mainly because perfusion is very fast to   [56]) prove that perfusion parameters are strongly correlated with tumour progression and overall survival. The main reason behind this strong correlation is the fact that tumours grow very fast, so they require large amounts of nutrients to develop, which is reflected in the angiogenesis of the tumour. This increase in angiogenesis is visualised and measured using perfusion imaging.
When comparing the two methods of imputing missing values, our in-house method (Section 2.4), and the classifierdependent built-in strategies, the difference between them is not important with respect to the performance of the classifiers.
Using machine learning for classification of brain tumoral tissue is a field with an increasing amount of work.
In [57] Hu et al. use a support vector machine approach on multiparametric MRI (perfusion, diffusion, and anatomical MRI) to automatically differentiate between radiation necrosis voxels and progressive tumour voxels coming from patients with resected GBM. They optimize a one-class SVM based on the area under receiver operator curve from 6000 training voxels manually delineated from 8 patients and then tested on manually delineated voxels from 8 new patients.
Their results show that perfusion and diffusion have a high discrimination rate between radiation necrosis and tumour progression.
In [55] Barajas Jr. et al. use perfusion MR imaging to investigate which parameters can be used to differentiate between recurrent GBM and radiation necrosis. Their study was based on 57 patients, they used Welch test to compare measurements between groups, and they found that all perfusion parameters (relative CBV, peak height, and percentage of signal intensity recovery) are strongly correlated with tumour progression.
In [56] Hu et al. use perfusion metrics on contrast enhancement lesions (CBV mean, mode, maximum, width, and a new thresholding metric called fractional tumor burden (FTB)) to see how they correlate to overall survival (OS). Their study was based on 25 patients with recurrent GBM and found that all parameters are strongly correlated with OS.
In [58] Weybright et al. used chemical shift imaging (CSI) to differentiate voxels with tumour recurrence and radiation injury. Their study was based on 29 patients and they had high quality data for 28 of them (97%). They found that the Cho/NAA and Cho/Cr ratios may be the best numerical discriminators between tumour recurrence and radiation injury.
Although we cannot compare our results directly to the ones from the studies presented before due to different approaches on classifying different tissues, it is becoming more obvious that a learning algorithm based on multiparametric MR data will evolve in the near future and will help clinicians in differentiating between progressive tumoral tissue and other types (necrotic or normal).

Conclusions
In this paper we compare different supervised and semisupervised classifiers. We train them on multiparametric MR data with complete and imputed features. The data was acquired from 29 patients selected from follow-up studies of GBM. We investigate the leave-one-patient-out testing method and come to the conclusion that the same label according to the RANO criteria could have been put earlier with at least one month with 100% accuracy, if we train random forests, LogitBoost, or RobustBoost on data with complete features. More interesting is the fact that the same result is achieved by the same classifiers using only complete perfusion data.
For future work we plan on using the temporal evolution of the features when classifying different MR sessions and also allow updating the class labels in time. Moreover, we are going to try new methods of processing the raw MR data to improve the quality of it.