^{1}

^{2}

^{1}

^{3}

^{2}

^{1}

^{2}

^{3}

Functional magnetic resonance imaging (fMRI) exploits blood-oxygen-level-dependent (BOLD) contrasts to map neural activity associated with a variety of brain functions including sensory processing, motor control, and cognitive and emotional functions. The general linear model (GLM) approach is used to reveal task-related brain areas by searching for linear correlations between the fMRI time course and a reference model. One of the limitations of the GLM approach is the assumption that the covariance across neighbouring voxels is not informative about the cognitive function under examination. Multivoxel pattern analysis (MVPA) represents a promising technique that is currently exploited to investigate the information contained in distributed patterns of neural activity to infer the functional role of brain areas and networks. MVPA is considered as a supervised classification problem where a classifier attempts to capture the relationships between spatial pattern of fMRI activity and experimental conditions. In this paper , we review MVPA and describe the mathematical basis of the classification algorithms used for decoding fMRI signals, such as support vector machines (SVMs). In addition, we describe the workflow of processing steps required for MVPA such as feature selection, dimensionality reduction, cross-validation, and classifier performance estimation based on receiver operating characteristic (ROC) curves.

Functional magnetic resonance imaging (fMRI) exploits blood-oxygen-level-dependent (BOLD) contrasts to map neural activity associated with a variety of brain functions including sensory processing, motor control, and cognitive and emotional functions [

The GLM is normally expressed in matrix formulation by

The parameter estimates of the model that we denote as

To generalize this argument, we consider linear functions of the beta estimates:

To test whether the condition combinations specified in

Thresholded statistical map overlaid on anatomical image.

One of the limitations of the GLM mass-univariate approach is the assumption that the covariance across neighbouring voxels is not informative about the cognitive function under examination. Such covariance is considered as uncorrelated noise and normally reduced using spatial filters that smooth BOLD signals across neighbouring voxels. Additionally, the GLM approach is inevitably limited by the model used for statistical inference.

Multivariate and model-free fMRI methods represent promising techniques to overcome these limitations by investigating the functional role of distributed patterns of neural activity without assuming a specific model. Multivariate model-free methods are based on machine learning and pattern recognition algorithms. Nowadays, multivoxel pattern analysis (MVPA) has become a leading technique in the analysis of neuroimaging data, and it has been extensively used to identify the neural substrates of cognitive functions ranging from visual perception to memory processing [

The aim of the current paper is to review the mathematical formalism underlying MVPA of fMRI data within the framework of supervised classification tools. We will review the statistical tools currently used and outline the steps required to perform multivariate analysis.

Multi-voxel pattern analysis (MVPA) involves searching for highly reproducible spatial patterns of activity that differentiate across experimental conditions. MVPA is therefore considered as a supervised classification problem where a classifier attempts to capture the relationships between spatial patterns of fMRI activity and experimental conditions [

More generally, classification consists in determining a decision function

To obtain the decision function

Support vector machines (SVMs) [

In the simplest linear form of SVMs for two classes, the goal is to estimate a decision boundary (a hyperplane) that separates with maximum margin a set of positive examples from a set of negative examples (Figure

2D space illustration of the decision boundary of the support vector machine (SVM) linear classifier. (a) the hard margin on linearly separable examples where no training errors are permitted. (b) the soft margin where two training errors are introduced to make data nonlinearly separable. Dotted examples are called the support vectors (they determine the margin by which the two classes are separated).

If we assume that data are linearly separable, meaning that we can draw a line on a graph of the feature

SVM attempts to find the optimal hyperplane

However, in practice, data are not often linearly separable. To permit training errors and then increase classifier performance, slack variables

To control the trade-off between the hyperplane complexity and training errors, a penalty factor

Effect of

To solve the mentioned

Let

The Lagrangian

Substituting the above results in the Lagrange form, we get the following:

According to Lagrange theory, in order to obtain the optimum, it is enough to maximize

Because this

In practice, most fMRI experimenters use linear SVMs because they produce linear boundaries in the original feature space, which makes the interpretation of their results straightforward. Indeed in this case, examining the weight maps directly allows the identification of the most discriminative features [

Nonlinear SVMs are often used for discrimination problems when the data are nonlinearly separable. Vectors are mapped to a high-dimensional feature space using a function

In nonlinear SVMs, the decision function will be based on the hyperplane:

A mathematical tool known as “kernel trick” can be applied to this equation which solely depends on the dot product between two vectors. It allows a nonlinear operator to be written as a linear one in a space of higher dimension. In practice, the dot product is replaced by a “kernel function”

Several types of kernels can be used in SVMs models. The most common kernels are polynomial kernels and radial basis functions (RBFs).

The polynomial kernel is defined by

The

Decision boundary with polynomial kernel.

Radial basis function (RBF) kernel is defined by

Decision boundary with RBF kernel.

In the fMRI domain, although non-linear transformations sometimes provide higher prediction performance, their use limits the interpretation of the results when the feature weights are transformed back to the input space [

Although SVMs are efficient at dealing with large high-dimensional data-sets, they are, as many other classifiers, affected by preprocessing steps such as spatial smoothing, temporal detrending, and motion correction. LaConte et al. [

On the other hand, classifier performance can be improved by reducing the data dimensionality or by selecting a set of discriminative features. Decoding performance was found to increase by applying dimensionality reduction using the recursive features elimination (RFE) algorithm [

Other studies attempted to compare classifiers in terms of their performances or execution time. Cox and Savoy [

When dealing with single-subject univariate analysis, features may be created from the maps estimated using a GLM. A typical feature will consist of the pattern of

Several studies demonstrated the relevance of feature selection. Pearson’s and Kendall

More recently, novel techniques have been developed to find informative features while ignoring uninformative sources of noise, such as principal components analysis (PCA) and independent component analysis (ICA) [

It is worth mentioning that feature selection can be improved by the use of cross-validation (see Section

Multivariate classification methods are used to identify whether the fMRI signals from a given set of voxels contain a dissociable pattern of activity according to experimental manipulation. One option is to analyze the pattern of activity across all brain voxels. In such a case, the number of voxels exceeds the number of training patterns which makes the classification computationally expensive.

A typical approach is to make assumptions about the anatomical regions of interest (ROI) suspected to be correlated with the task [

An alternative is to select fewer voxels (e.g., those within a sphere centred at a voxel) and repeat the analysis at all voxels in the brain. This method has been introduced by Kriegeskorte et al. [

2D illustration of the “searchlight” method on simulated maps of

More recently, Björnsdotter et al. [

Illustration of the Monte Carlo fMRI brain mapping method in one voxel (in black). Instead of centering the search volume (dashed-line circle) at the voxel as in the searchlight method and computing a single performance for it, here the voxel is included in five different constellations with other neighboring voxels (dark gray). In each constellation, a classification performance is computed for it. In the end, the average performance across all the constellations is assigned to the dark voxel.

To ensure unbiased testing, the data must be split into two sets: a training and test set. In addition: it is generally recommended to choose a larger training set in order to enhance classifier convergence. Indeed, the performance of the learned classifier depends on how the original data are partitioned into training and test set, and, most critically, on their size. In other words, the more instances we leave for test, the fewer samples remain for training, and hence the less accurate becomes the classifier. On the other hand, a classifier that explains one set of data well does not necessarily generalize to other sets of data even if the data are drawn from the same distribution. In fact, an excessively complex classifier will tend to overfit (i.e., it will fail to generalize to unseen examples). This may occur, for example, when the number of features is too large with respect to the number of examples (i.e.,

Mean accuracy after 4-fold cross-validation to classify the data shown in Figure

In

Leave-one-run-out cross-validation (LORO-CV) and leave-one-sample-out cross-validation (LOSO-CV). A classifier is trained using training set (in blue) and then tested using the test set (in red) to get a performance. This procedure is repeated for each run in LORO-CV and for each sample in LOSO-CV to get at the end an averaged performance.

Machine-learning algorithms come with several parameters that can modify their behaviors and performances. Evaluation of a learned model is traditionally performed by maximizing an accuracy metric. Considering a basic two-class classification problem, let

Metrics extracted from the receiver operating characteristic (ROC) curve can be a good alternative for model evaluation, because they allow the dissociation of errors on positive or negative examples. The ROC curve is formed by plotting true positive rate (TPR) over false positive rate (FPR) defined both from the

Any point

Generally, the classifier’s output is a continuous numeric value. The decision rule is performed by selecting a decision threshold which separates the positive and negative classes. Most of the time, this threshold is set regardless of the class distribution of the data. However, given that the optimal threshold for a class distribution may vary over a large range of values, a pair (FPR; TPR) is thus obtained at each threshold value. Hence, by varying this threshold value, an ROC curve is produced.

Figure

ROC curve representation.

In order to assess different classifier’s performances, one generally uses the area under the ROC curve (AUC) as an evaluation criterion [

Processing the AUC would need the computation of an integral in the continuous case; however, in the discrete case, the area is given by [

SVMs can be used as classifiers that output a continuous numeric value in order to plot the ROC curve. In fact, in standard SVM implementations, the continuous output

Step 2.1. compute

Step 2.2. plot the corresponding point (FPR; TPR).

We performed this procedure on the simulated data used for the searchlight analysis. However, data were unbalanced in order to show the threshold effect (we used four runs each containing 30 exampls, 10 for condition

ROC analysis of

A last point worth mentioning is that the classifier performance measures its ability to generalize to unseen data under the assumption that training and test examples are drawn from the same distribution. However, this assumption could be violated when using cross-validation [

Nonparametric permutation test analysis was introduced in functional neuroimaging studies to provide flexible and intuitive methodology to verify the validity of the classification results [

Concretely, to verify the hypothesis

In particular experimental conditions when the fMRI data exhibit temporal autocorrelation [

In this paper, we have reviewed how machine-learning classifier analysis can be applied to the analysis of functional neuroimaging data. We reported the limitations of univariate model-based analysis and presented the multivariate model-free analysis as a solution. By reviewing the literature comparing different classifiers, we focused on support vector machine (SVM) as supervised classifier that can be considered as an efficient tool to perform multivariate pattern analysis (MVPA). We reported the importance of feature selection and dimensionality reduction for the success of the chosen classifier in terms of performance, and the importance of a cross-validation scheme both in selecting the best parameters for the classifier and computing the performance. The use of ROC curves seems to be more accurate to evaluate the classifier performance, while nonparametric permutation tests provide flexible and intuitive methodology to verify the validity of the classification results.

This work was supported by the Neuromed project, the GDRI Project, and the PEPS Project “GoHaL” funded by the CNRS, France.