Label fusion is used in medical image segmentation to combine several different labels of the same entity into a single discrete label, potentially more accurate, with respect to the exact, sought segmentation, than the best input element. Using simulated data, we compared three existing label fusion techniques—STAPLE, Voting, and Shape-Based Averaging (SBA)—and observed that none could be considered superior depending on the dissimilarity between the input elements. We thus developed an empirical, hybrid technique called SVS, which selects the most appropriate technique to apply based on this dissimilarity. We evaluated the label fusion strategies on two- and three-dimensional simulated data and showed that SVS is superior to any of the three existing methods examined. On real data, we used SVS to perform fusions of 10 segmentations of the hippocampus and amygdala in 78 subjects from the ICBM dataset. SVS selected SBA in almost all cases, which was the most appropriate method overall.
Label fusion is a process used in medical image segmentation. Its aim is to produce a single, discrete element or
A long-term goal of our research program is to obtain accurate, automated segmentations of neuroanatomical structures, primarily the hippocampus (HC). Our primary motivation stems from our work in Alzheimer’s disease, for which HC volume and atrophy measurements are putative disease markers (see reviews in [
To reach our goal, we thus decided to investigate different fusion processes. To suit our research context, we restricted our analysis to techniques that depend solely on given input labels. We disregarded techniques that depend on intensity images [
Our first objective was to characterize applicable label fusion strategies. The first approach is the Vote method (or sum rule), which has been widely used and described by virtue of its simplicity [
While testing the implementations of these three approaches on simulated data, we observed that the technique with a result closest to the ground truth was not the same depending on the dissimilarity between raters’ input labels, as detailed below. Therefore, the second objective of our study was to propose an empirical, hybrid STAPLE-Vote-SBA (SVS) technique that automatically selects the right label fusion approach based on this dissimilarity.
We report results of comparison tests on the four label fusion methods for simulated two-dimensional (2D) and three-dimensional (3D) data as well as HC and amygdala (AG) labels obtained from magnetic resonance images (MRI). All images used in this study were binary. For the real data, we performed label fusion on HC and AG independently.
Our mathematical notation is as follows. We consider an image of
For evaluating the performance of SVS with respect to STAPLE, Vote and SBA, our data consisted of 2D and 3D simulated data as well as real data.
We created two simulated 2D data sets: one for training SVS and one for testing the label fusion approaches. The SVS version trained with 2D data is hereafter referred to as SVS-2D.
The data consisted of multiple binary images created from a ground-truth object, shown in Figure
(a, b, and c) 2D and (d, e, and f) 3D simulated images showing the ground truth (a, d), and images with
Ground truth
Ground truth
We generated individual, simulated rater images by moving the control points of the ground-truth ellipse and reinterpolating with cubic splines. We moved the control points in random directions, following a uniform distribution, with random distances from their original coordinates. The random distance followed a normal distribution of zero mean with a standard deviation adjusted so that it could be modified by a normalized deformation factor
In other words,
For each of the training and testing sets, we created 625 label fusion tests, each consisting of 10 deformed images, for a total of 6,250 images in the training set and 6,250 different images in the testing set. Each test was created by varying
As for the 2D case, we created two simulated 3D sets: one for training SVS and one for testing the label fusion techniques. The SVS version trained with 3D data is hereafter referred as SVS-3D. An SVS version was also trained with the combination of 2D and 3D training sets. It is referred as SVS-2D&3D.
The 3D data consisted of binary volume images created from a ground-truth ellipsoid. To produce the ground truth, we first created a cubic regular grid volume. This volume was then warped along each axis by dividing each voxel coordinate by its corresponding ground-truth ellipsoid radius, creating a warped grid. By applying this warping transformation, the ellipsoidal space became a spherical space. A ground-truth sphere was created by regularly sampling the angles
To produce the ground-truth image, the control points were projected into a Cartesian space with the following axes:
While appearing complex, this process in fact simplified the creation of the deformed ellipsoid images. We randomly moved the control points of the ground-truth sphere along
As for the 2D sets, the random distance followed a normal distribution of zero mean. The standard deviation was adjusted so that it could be modified by
Figures
The real MRI data consisted of intensity images and segmented left and right HC and AG labels of 78 young, neurologically healthy subjects part of the ICBM database [
The ground truth consisted of left and right HC and AG manual labels, presented in a previous study [
The labels available for fusion were obtained using a template-based segmentation algorithm [
The next sections present the three existing label fusion strategies that we used in this study: STAPLE, Vote, and SBA. We implemented all label fusion methods, including SVS, in MATLAB (MathWorks, Natick, MA, USA).
It is important to note that all approaches were applied to the disputed pixels/voxels only. Pixels/voxels for which all the raters unanimously agreed on their label were not considered; the label was automatically assigned. Working with only disputed pixels/voxels speeded up computation for all methods and significantly improved the results given by STAPLE (see [
STAPLE is an expectation-maximization (EM) algorithm that iteratively estimates (1) the true segmentation from the raters’ performance (E-step) and (2) the raters’ performance (sensitivity and specificity) from this true segmentation estimate (M-step). We implemented STAPLE following the mathematical description in [
The Vote method consists of summing for each pixel/voxel
SBA is a voting scheme where each vote is weighted by the signed Euclidean distance computed for each input label. In this study, SBA is the only method that incorporates spatial information in the label fusion process. We implemented this method following the mathematical description in [
SVS is a strategy that selects the most appropriate method among STAPLE, Vote, and SBA, based solely on the input labels and their dissimilarity. We point out that SVS is not limited to these three label fusion methods. It could easily be extended to include further methods.
We developed SVS after observing, during our simulations, that the performance of STAPLE, Vote, and SBA was dependent on the distribution of
Scatter plots showing
We note that
As can be seen, none of STAPLE, Vote, and SBA can be considered superior to the others. The choice of the best method seems to depend on the distribution of
These observations thus suggested that
The measures
We overcame this problem by using the following scheme. For
We then computed, for each rater
For each estimated rater’s probability
This last equation corresponds to a cumulative sum of the upper half of the probability mass function of a binomial distribution. In this study, we used
From (
To estimate
We then summed
From
In Figure
(a, c)
To perform its selection, SVS finds a score For each label fusion test After performing label fusion with STAPLE, Vote, and SBA, we first summed, for each label fusion method Following the last two steps of the training procedure, we had, for each test
This procedure was performed for each of the 2D and 3D training data sets as well as the combination of both sets resulting in three versions of SVS: SVS-2D (trained with 2D data), SVS-3D (trained with 3D data), and SVS-2D&3D (trained with 2D and 3D data). We note that using this scheme, other label fusion methods could be incorporated in SVS, increasing only the number of scoring functions
Figure
(Top) Scoring surface functions in the space
SVS-2D
SVS-3D
SVS-2D&3D
We can now describe the SVS method as follows. Compute the dissimilarity coefficient Find the score for each label fusion method using its corresponding scoring surface function. Select the label fusion method corresponding to the highest score.
In case of two or more equal scores, which do not imply identical label fusions, a weighted vote “meta fusion” of the label fusion results, obtained with STAPLE, Vote, and SBA, is performed using the scores as weights. In practice, this situation is uncommon. We point out that, besides the SVS versions presented here, this “meta fusion” approach, i.e. performing a label fusion of STAPLE, Vote, and SBA, has also been tested (results not presented), using each of STAPLE, Vote, and SBA as “meta fusion” method with and without score weights for the two latter methods. However, no “meta fusion” outperformed the versions of SVS presented in this study.
We also point out that
To measure the performance of the label fusion techniques, we computed
To further characterize our testing sets and insure the deformation factor
(a, d)
The three existing techniques (STAPLE, Vote, and SBA) as well as the three versions of SVS (SVS-2D, SVS-3D, and SVS-2D&3D) were used to perform the label fusion of the 10 images of each of the 625 tests of the 2D testing set. Figures
Boxplots of (A, C, E, and G)
Scatter plots of the
The experiment described in the last section was also performed on the 3D testing set.
Figure
Boxplots of (A)–(D)
Scatter plots of (a, b)
We showed on a large set of different simulated data that the label fusion method giving the label closest to the ground truth was not the same depending on the dissimilarity among the raters.
Regarding robustness, we showed that SVS outperformed any single method among STAPLE, Vote, and SBA, regardless of the training set. Applying SVS-2D (trained with 2D data) and SVS-3D (trained with 3D data) on 3D and 2D data, respectively, we still obtained better performance than STAPLE, Vote, and SBA. Effectively, the three versions of SVS showed similar results, explained by similar selection regions (Figure
We also demonstrated that with real data, Vote was not necessarily the method of choice; in our study, SBA was better than Vote and STAPLE. To our knowledge, SBA has not been widely used in the literature, and it might have been underestimated.
The first and obvious limitation of the SVS technique is that it is upper-bound limited to the best technique (either STAPLE, Vote, or SBA) in each case.
Secondly, we used
Thirdly, we did not assess the influence of the number and the selection of input labels on the performance of the label fusion strategies. While these two aspects are important, as reported in some studies [
We proposed a method that automatically selects the most appropriate label fusion method based on the dissimilarity of input labels. Overall, the SVS technique performed better with simulated data compared to either individual technique among STAPLE, Vote, and SBA. For real data, SVS selected SBA for almost all cases, which was overall superior to STAPLE and Vote.
Alzheimer’s disease
Amygdala
Arbitrary units
Dice similarity coefficient
Hippocampus
Magnetic resonance imaging
Simultaneous Truth and Performance Level Estimation
Shape-Based Averaging
STAPLE-Vote-SBA
Two dimensional
Three dimensional.
The authors thank Dr. J. C. Pruessner and Dr. D. L. Collins (McGill University, Montréal, Canada), and the International Consortium for Brain Mapping for access to label and MRI data. This work was supported by an operating grant from the Ministère du Développement Économique, de l’Innovation, et de l’Exportation du Québec.