Label Fusion Strategy Selection

Label fusion is used in medical image segmentation to combine several different labels of the same entity into a single discrete label, potentially more accurate, with respect to the exact, sought segmentation, than the best input element. Using simulated data, we compared three existing label fusion techniques—STAPLE, Voting, and Shape-Based Averaging (SBA)—and observed that none could be considered superior depending on the dissimilarity between the input elements. We thus developed an empirical, hybrid technique called SVS, which selects the most appropriate technique to apply based on this dissimilarity. We evaluated the label fusion strategies on two- and three-dimensional simulated data and showed that SVS is superior to any of the three existing methods examined. On real data, we used SVS to perform fusions of 10 segmentations of the hippocampus and amygdala in 78 subjects from the ICBM dataset. SVS selected SBA in almost all cases, which was the most appropriate method overall.


Introduction
Label fusion is a process used in medical image segmentation. Its aim is to produce a single, discrete element or label from a combination of multiple independent inputs. The merged result is potentially more accurate, with respect to the exact, sought segmentation, than each individual input due to the reduction of uncorrelated errors. Labels can be obtained by combining inputs from different raters or automated segmentations [1,2].
A long-term goal of our research program is to obtain accurate, automated segmentations of neuroanatomical structures, primarily the hippocampus (HC). Our primary motivation stems from our work in Alzheimer's disease, for which HC volume and atrophy measurements are putative disease markers (see reviews in [3][4][5][6]). Of the multiple HC segmentation approaches available (see [7] for review), novel template-based paradigms propose the use of template libraries [8]. In such approaches, a single label is found by combining multiple individually segmented HC through label fusion [2].
To reach our goal, we thus decided to investigate different fusion processes. To suit our research context, we restricted our analysis to techniques that depend solely on given input labels. We disregarded techniques that depend on intensity images [9,10], since these images may sometimes be unavailable or noisy. We also ignored techniques that depend on object-specific training, i.e. that have geometric or topological prior.
Our first objective was to characterize applicable label fusion strategies. The first approach is the Vote method (or sum rule), which has been widely used and described by virtue of its simplicity [1,9,[11][12][13]. The second is also a wellknown technique called Simultaneous Truth and Performance Level Estimation (STAPLE), initially proposed by Warfield et al. [14,15], and used in a variety of studies [9,16]. The third approach is referred to as Shape-Based Averaging (SBA), which incorporates spatial information [17].
While testing the implementations of these three approaches on simulated data, we observed that the technique with a result closest to the ground truth was not the same depending on the dissimilarity between raters' input labels, as detailed below. Therefore, the second objective of our study was to propose an empirical, hybrid STAPLE-Vote-SBA (SVS) technique that automatically selects the right label fusion approach based on this dissimilarity. We report results of comparison tests on the four label fusion methods for simulated two-dimensional (2D) and three-dimensional (3D) data as well as HC and amygdala (AG) labels obtained from magnetic resonance images (MRI). All images used in this study were binary. For the real data, we performed label fusion on HC and AG independently.

Materials and Methods
2.1. Mathematical Notation. Our mathematical notation is as follows. We consider an image of N pixels or voxels (x = 1, 2, . . . , N) for which K raters (k = 1, 2, . . . , K) each produces a binary label segmentation e k . To each element of e k , i.e. each pixel/voxel x, is assigned a label i (e k (x) = i) equal to 0 or 1, for background and segmented object, respectively. A decision matrix E is formed with all the e k vectors, E = [e 1 e 2 · · · e K ] with size N × K, and fed to a label fusion algorithm to obtain an estimate of the true segmentation T.

Data.
For evaluating the performance of SVS with respect to STAPLE, Vote and SBA, our data consisted of 2D and 3D simulated data as well as real data.

Two-Dimensional (2D) Simulated Data.
We created two simulated 2D data sets: one for training SVS and one for testing the label fusion approaches. The SVS version trained with 2D data is hereafter referred to as SVS-2D.
The data consisted of multiple binary images created from a ground-truth object, shown in Figure 1(a), which was an ellipse geometry defined by eight control points interpolated with cubic splines.
We generated individual, simulated rater images by moving the control points of the ground-truth ellipse and reinterpolating with cubic splines. We moved the control points in random directions, following a uniform distribution, with random distances from their original coordinates. The random distance followed a normal distribution of zero mean with a standard deviation adjusted so that it could be modified by a normalized deformation factor f σ (between 0 and 1) to create images with a relative difference area v D ranging from 0% to 50%, where v D is given by where V TRUTH corresponds to the area in pixels of the ground-truth ellipse. v k|T represents the number of pixels in the image that are different between decision e k of rater k and the ground truth T: In other words, v k|T is the total number of false positives and false negatives with respect to T. Figures 1(b) and 1(c) show two rater images corresponding to v D values of 25% and 50%, respectively.
For each of the training and testing sets, we created 625 label fusion tests, each consisting of 10 deformed images, for a total of 6,250 images in the training set and 6,250 different images in the testing set. Each test was created by varying f σ of the test images according to a given Gaussian distribution. For each test, different mean and standard deviation were used for f σ , ranging both from 0 to 1 with 25 linearly spaced points each, making a total of 625 Gaussian distributions, one for each test. Negative values of f σ and values higher than 1 were clamped to 0 and 1, respectively. We performed the label fusion of the 10 deformed images in each of the 625 tests of the testing set.

2.2.2.
Three-Dimensional (3D) Simulated Data. As for the 2D case, we created two simulated 3D sets: one for training SVS and one for testing the label fusion techniques. The SVS version trained with 3D data is hereafter referred as SVS-3D. An SVS version was also trained with the combination of 2D and 3D training sets. It is referred as SVS-2D&3D.
The 3D data consisted of binary volume images created from a ground-truth ellipsoid. To produce the ground truth, we first created a cubic regular grid volume. This volume was then warped along each axis by dividing each voxel coordinate by its corresponding ground-truth ellipsoid radius, creating a warped grid. By applying this warping transformation, the ellipsoidal space became a spherical space. A ground-truth sphere was created by regularly sampling the angles θ and φ in the spherical-coordinate space (r, θ, φ), giving a set of 26 control points (r c , θ c , φ c ).
To produce the ground-truth image, the control points were projected into a Cartesian space with the following axes: x = θ, y = φ, and z = r. We transformed the warped grid into spherical coordinates (r g , θ g , φ g ) and performed a cubic interpolation of (θ g , φ g ) on (r c , θ c , φ c ) to find r * at each point (θ g , φ g ). For each grid voxel, if r g < r * , the voxel was considered inside the sphere and was labeled accordingly. The warped grid (spherical space) was then unwarped into the regular grid (ellipsoidal space) to give the desired ground-truth ellipsoid image shown in Figure 1(d).
While appearing complex, this process in fact simplified the creation of the deformed ellipsoid images. We randomly moved the control points of the ground-truth sphere along r, modifying r c , reinterpolated to find r * for the warped grid, performed the labeling by thresholding (i.e. r g < r * ), and unwarped the grid to obtain the deformed ellipsoid.
As for the 2D sets, the random distance followed a normal distribution of zero mean. The standard deviation was adjusted so that it could be modified by f σ to create deformed ellipsoids with relative difference in volume, v D (1), ranging between 0% and 50% with respect to the ground truth.
Figures 1(e) and 1(f) show two examples of deformed images with v D of 25% and 50%, respectively. As for the 2D data, we produced a training set and a testing set, each consisting of 625 label fusion tests. Each test was created as previously described and comprised 10 deformed images. Each of the training and testing sets thus consisted of 6,250 images. We performed the label fusion of the 10 deformed images in each of the 625 tests of the testing set.
International Journal of Biomedical Imaging (c, f). White and black surfaces (e, f) represent, respectively, voxels added to or missing from the ground truth. In 2D, the ground truth was an ellipse geometry of radius 1 AU (arbitrary units) along the x-axis and 0.5 AU along the y-axis, consisting of eight control points, located at constantly separated angles, between which the ellipse was interpolated with cubic splines. We then mapped this geometry on a grid of 256 × 256 pixels between −1.5 and 1.5 AU along both the x-and y-axes. In 3D, the ground-truth image was an ellipsoid geometry of radius 1 AU along the x-axis and 0.5 AU along both the y-and z-axes, consisting of 26 control points. See text for construction details. The geometry was mapped in a grid of 64 × 64 × 64 voxels between −1.5 and 1.5 AU along each of the three axes.

Real MRI Data.
The real MRI data consisted of intensity images and segmented left and right HC and AG labels of 78 young, neurologically healthy subjects part of the ICBM database [18]. Subjects were scanned in Montréal (Québec, Canada) on a Philips Gyroscan 1.5T scanner (Philips Medical, Best, Netherlands) using a T1-weighted fast gradient echo sequence (sagittal acquisition, TR = 18 ms, TE = 10 ms, 1-mm 3 voxels, flip angle = 30 • ). The ground truth consisted of left and right HC and AG manual labels, presented in a previous study [19], with a reported intraclass reliability coefficient of 0.900 and 0.925 for interrater and intrarater reliability, respectively.
The labels available for fusion were obtained using a template-based segmentation algorithm [2]. In this approach, each subject's image is compared in turn to a library of other such images; the 10 images with highest match (e.g., highest normalized mutual information) are selected and then nonlinearly aligned with the original subject image. Given that each image in the library has an associated label, inverse warping allows the transfer of label in the original subject's space, where they must be fused to provide a single object. In our dataset, we received 10 labels for each subject, obtained with this technique, for each of the four following regions: left HC, right HC, left AG, and right AG. Label fusions were then performed independently for each region, giving a total of 312 label fusions (78 subjects × 4 regions).
We assessed the performance of the fusions using the manual segmentations as "ground truths".

Label Fusion
Strategies. The next sections present the three existing label fusion strategies that we used in this study: STAPLE, Vote, and SBA. We implemented all label fusion methods, including SVS, in MATLAB (MathWorks, Natick, MA, USA).
It is important to note that all approaches were applied to the disputed pixels/voxels only. Pixels/voxels for which all the raters unanimously agreed on their label were not considered; the label was automatically assigned. Working with only disputed pixels/voxels speeded up computation for all methods and significantly improved the results given by STAPLE (see [16]).

STAPLE. STAPLE is an expectation-maximization
(EM) algorithm that iteratively estimates (1) the true segmentation from the raters' performance (E-step) and (2) the raters' performance (sensitivity and specificity) from this 40 60 of v D , calculated over the input labels for each test. The centered v D corresponds to v D minus the v D evaluated for Vote. We note that σ(v D )/μ(v D ) better discriminates the label fusion methods than σ(v D ). true segmentation estimate (M-step). We implemented STA-PLE following the mathematical description in [20].

2.3.2.
Vote. The Vote method consists of summing for each pixel/voxel x and label i, the occurrences of label i among the raters, and assigning the most occurring label to x.

SBA.
SBA is a voting scheme where each vote is weighted by the signed Euclidean distance computed for each input label. In this study, SBA is the only method that incorporates spatial information in the label fusion process. We implemented this method following the mathematical description in [17].

Label Fusion Strategy Selection: SVS.
SVS is a strategy that selects the most appropriate method among STAPLE, Vote, and SBA, based solely on the input labels and their dissimilarity. We point out that SVS is not limited to these three label fusion methods. It could easily be extended to include further methods.

Experimental Observations.
We developed SVS after observing, during our simulations, that the performance of STAPLE, Vote, and SBA was dependent on the distribution of v D in the input labels of each label fusion test. This can be observed in the scatter plots of Figure 2  μ(v D ) measures how bad the raters are overall. These measures thus describe, in a way, the dissimilarity in the raters' input labels. As can be seen, none of STAPLE, Vote, and SBA can be considered superior to the others. The choice of the best method seems to depend on the distribution of v D . For low values of σ(v D )/μ(v D ), which better discriminates the label fusion methods than σ(v D ), SBA seems better (i.e. with lower values of v D after label fusion), while, for higher values, STAPLE would be a better choice. Focusing on the results with respect to μ(v D ), STAPLE seems better at lower values, and SBA, at higher values. We also observe that in none of the cases does Vote clearly outperform the other methods.
These observations thus suggested that σ(v D )/μ(v D ) and μ(v D ) could be used to determine the appropriate label fusion method.

Dissimilarity Factors.
The measures σ(v D )/μ(v D ) and μ(v D ) cannot be used in practice since the computation of v D depends on v k|T (1) and V TRUTH , and thus requires to know the ground truth, which is what we try to estimate with label fusion. We thus needed to find estimates for σ(v D )/μ(v D ) and μ(v D ).
We overcame this problem by using the following scheme. For v k|T , we first computed the frequency of occurrence f (x, i), between 0 and 1, of each label i for each pixel/ voxel x over all raters: We then computed, for each rater k and each pixel/voxel x, the estimated probability that rater k misclassifies pixel/ voxel x, i.e. that the assigned label was a false positive or a false negative: For each estimated rater's probability p k (x), we then performed a Bernouilli trial with B experiments to compute the probability P k (x) that a majority of B "virtual" raters misclassified pixel/voxel x, according to p k (x): This last equation corresponds to a cumulative sum of the upper half of the probability mass function of a binomial distribution. In this study, we used B = 99 so that i ranged from 50 to 99. An odd number for B was used to separate the binomial probability mass function equally into a lower and an upper part, the latter corresponding to a clear majority.
From (5), we were able to compute an estimate v k of v k|T by summing P k (x) over all pixels/voxels: To estimate V TRUTH , we used (3) in a similar Bernouilli trial approach. For each pixel/voxel x, we computed a probability that a majority of B = 99 "virtual" raters classifies pixel/voxel x as being part of the segmented region, i.e. with label 1, according to f (x, 1): We then summed F(x) over all pixels/voxels to obtain an estimate V of V TRUTH : From v k and V , we defined two empirical factors: the dissimilarity coefficient d c , estimating σ(v D )/μ(v D ), and the dissimilarity ratio d r , estimating μ(v D ). These factors are respectively given by In Figure 3, we demonstrate the performance of these estimates by showing that d c (a, c) and d r (b, d) match, with a quasi-one-to-one relationship, their theoretical values σ(v D )/μ(v D ) and μ(v D ), respectively, for both the 2D (a, b) and 3D (c, d) training sets. (1) For each label fusion test t of a given training set, we computed d c and d r , according to the approach presented in the last section.
(2) After performing label fusion with STAPLE, Vote, and SBA, we first summed, for each label fusion method m and test t, the number of pixels/voxels v m that were different between the label fusion result T m and the ground truth T, i.e. the number of false positives and false negatives: For each test t, we assigned a score s of 1 to the label fusion method with the lowest v m , corresponding to the best method, 0 to the method with the highest v m , corresponding to the poorest method, and we linearly interpolated the score value for the remaining method.
(3 This procedure was performed for each of the 2D and 3D training data sets as well as the combination of both sets resulting in three versions of SVS: SVS-2D (trained with 2D data), SVS-3D (trained with 3D data), and SVS-2D&3D (trained with 2D and 3D data). We note that using this scheme, other label fusion methods could be incorporated in SVS, increasing only the number of scoring functions s(d c , d r ). Figure 4 presents, for SVS-2D (a), SVS-3D (b), and SVS-2D&3D (c), the scoring surface functions in the space (d c , d r , s) as well as the selection regions in the space (d c , d r ), where each method gives the highest score. The latter images thus correspond to the top views of the firsts. We observe that the three versions of SVS give very similar delimitations between the methods. Interestingly, with SVS-2D&3D, the border between STAPLE and SBA is almost linear in the region of (d c , d r ) covered by the label fusion tests.

SVS Selection.
We can now describe the SVS method as follows.
(1) Compute the dissimilarity coefficient d c and the dissimilarity ratio d r from the raters' input labels, as described in Section 2.4.2.
(2) Find the score for each label fusion method using its corresponding scoring surface function.
(3) Select the label fusion method corresponding to the highest score.
In case of two or more equal scores, which do not imply identical label fusions, a weighted vote "meta fusion" of the label fusion results, obtained with STAPLE, Vote, and SBA, is performed using the scores as weights. In practice, this situation is uncommon. We point out that, besides the SVS versions presented here, this "meta fusion" approach, i.e. performing a label fusion of STAPLE, Vote, and SBA, has also been tested (results not presented), using each of STAPLE, Vote, and SBA as "meta fusion" method with and without score weights for the two latter methods. However, no "meta fusion" outperformed the versions of SVS presented in this study.
We also point out that d c and d r depend on the decision matrix E only, i.e. the input labels. Effectively, this ensures that there are no external parameters to the input data that may affect the sensitivity of the technique. Moreover, since d c and d r are normalized values, we believe that the technique should not be sensitive to the training data. In fact, we observe in Figure 4 that the different training sets gave similar regions.

Performance Measure.
To measure the performance of the label fusion techniques, we computed v D , as well as the Dice similarity coefficient (DSC), an established measure widely reported in the field [1,2,9,11,12,15], between each label fusion image and the ground truth. DSC is given by where |Z| is the area or volume of the segmented region Z.
To further characterize our testing sets and insure the deformation factor f σ reflected its initial intent, we computed the DSC between each deformed image and its ground truth. Figures 5(a)   Group SBA are the tests for which SVS-2D&3D selected STAPLE and SBA, respectively. We see that the SVS boxplots, matching the selected method's, give in both groups higher DSC and lower v D , while each of STAPLE (method a) and SBA (method c) is outperformed in its counterpart group. Regarding Vote (method b), it gives better performance than SBA in Group STAPLE but seems to be the worse method in Group SBA. We also see that the three versions of SVS are similar despite the different training sets. Figure 7 Figure 7(c), the boxplots are almost identical to SBA's. We see that SBA/SVS overall gives the highest DSC and the lowest v D . This is also shown in Figure 9, which presents scatter plots of DSC (a, b) and v D (c, d), centered on the Vote values, as a function of d c (a, c) and d r (b, d) for all the 312 label fusion cases. SBA/SVS is overall superior to STAPLE and Vote, with DSC and v D respectively above and below STAPLE and Vote means for the majority of the label fusion cases.

Findings.
We showed on a large set of different simulated data that the label fusion method giving the label closest to the ground truth was not the same depending on the dissimilarity among the raters. Regarding robustness, we showed that SVS outperformed any single method among STAPLE, Vote, and SBA, regardless of the training set. Applying SVS-2D (trained with 2D data) and SVS-3D (trained with 3D data) on 3D and 2D data, respectively, we still obtained better performance than STAPLE, Vote, and SBA. Effectively, the three versions of SVS showed similar results, explained by similar selection regions (Figure 4). This suggests that SVS is independent of the type of training set, 2D or 3D, and that the delimitations of the selecting regions with SVS-2D&3D could represent what we should really expect since there are more training tests.
We also demonstrated that with real data, Vote was not necessarily the method of choice; in our study, SBA was better than Vote and STAPLE. To our knowledge, SBA has not been widely used in the literature, and it might have been underestimated.

4.2.
Limitations. The first and obvious limitation of the SVS technique is that it is upper-bound limited to the best technique (either STAPLE, Vote, or SBA) in each case.
Secondly, we used DSC and v D in this study as the criteria for assessing the label fusion methods, the first being commonly used in the literature. However, we think that v D gives a better indication of the difference between a rater image and the ground truth. This is demonstrated in Figure 8 for HC left, HC right, and AG right. For these regions, while STAPLE's DSC medians are higher (better) than Vote's, v D medians are higher (worse), meaning that there are more false positives and/or negatives. Also, in Figures 5(c) and 5(f), we show that compared to a point with given v D and DSC, a neighbor point with a higher v D (more false positives and/ or negatives) can still give a higher (better) or similar DSC, especially for high v D . This difference between DSC and v D might be explained by the fact that DSC normalizes by the mean area/volume of the label fusion and ground truth, while v D normalizes by the area/volume of the ground truth only. Therefore, the denominator in v D remains constant, while the denominator in DSC varies between label fusions. The comparison is thus not performed on the same basis. Although we could argue on which measure is the most appropriate, this questions the validity of DSC as a performance measure for label fusion if the ground truth is available. We thus keep in mind for future work that DSC is not necessarily the best criterion in this case and that v D should be used instead.
Thirdly, we did not assess the influence of the number and the selection of input labels on the performance of the label fusion strategies. While these two aspects are important, as reported in some studies [2,12], our objectives were primarily to characterize three existing label fusion strategies and to propose a selection method based on our observations. We will confront these aspects in future work.

Conclusion
We proposed a method that automatically selects the most appropriate label fusion method based on the dissimilarity of input labels. Overall, the SVS technique performed better with simulated data compared to either individual technique among STAPLE, Vote, and SBA. For real data, SVS selected SBA for almost all cases, which was overall superior to STAPLE and Vote.