People often use multiple metrics in image processing, but here we take a novel approach of mining the values of batteries of metrics on image processing results. We present a case for extending image processing methods to incorporate automated mining of multiple image metric values. Here by a metric we mean any image similarity or distance measure, and in this paper we consider intensity-based and statistical image measures and focus on registration as an image processing problem. We show how it is possible to develop meta-algorithms that evaluate different image processing results with a number of different metrics and mine the results in an automated fashion so as to select the best results. We show that the mining of multiple metrics offers a variety of potential benefits for many image processing problems, including improved robustness and validation.
Every year many articles are published in the area of biomedical image registration that introduce new metrics for biomedical images, covering both distance/difference measures and similarity measures. There are many reasons for this interest in metrics. However, the abundance of methods creates a basic dilemma for practitioners seeking high-performance imaging systems: which metric should be used?
This paper reports on an effort spanning five years at UCLA, studying this question, and developing schemes that use multiple methods and multiple evaluation metrics to obtain better image processing results. In much the same way that ensemble methods yield better results in data mining, this effort explored software combinations of metrics that yielded improved methods for registration in neuroimaging.
In this paper we consider two families of image similarity metrics: intensity-based metrics (metrics of the intensity or luminosity values of voxels) and statistical metrics (metrics of their distributions). There are at least three reasons why use of multiple metrics can be important in image processing as follows. Metrics are performance measures, so awareness of them is a prerequisite for good performance. Although it is common to commit There are inherent limits to image processing performance. From this perspective, image processing methods are little more than optimizers that rest on assumptions about prior distributions of images and validation as experimental verification of these distributions. However, if metric values can be treated as samples of prior distributions on performance measures, we can mitigate some of these limits. The key point of this paper is that the results of different image processing algorithms and parameter settings can be evaluated under multiple metrics, and the metric values can then be analyzed with data mining to identify the best results. The tracking of metric value results permits investigation of which image processing methods give better results for images from a given source. It also permits flexible on-demand analysis of arbitrary performance measures.
Every metric has strengths and weaknesses when applied to categories of image modalities. In fact, some metrics are designed for or biased towards specific categories and therefore cannot encompass some images in real-world applications; no algorithm can be better than the metric used to evaluate it. Equivalently, proper evaluation of the performance of an algorithm can require consideration of multiple metrics.
Having multiple metric values is also important for development of image
If
Every year many articles are published in the area of medical image registration that introduce new metrics. In the ideal case this multitude of options could be condensed into a set of metrics that are effective, comprehensive, and compact. We have implemented an initial approximation of this ideal. The metrics we consider here can be broadly divided into intensity-based metrics, which rely solely on the intensities of voxels, and statistical metrics, which are based on distributions of these intensities. These are simple and there are many others, but our implementation is open and representative and can in principle accommodate any metric.
Table
Metrics often depend on the application itself and on the modalities of the input images. Both intensity- and morphology-based metrics have been largely employed in the implementation of registration algorithms to attend different needs including comparing images with different modalities.
The metrics in Table
This parallel coordinates plot is a visual representation of our eleven metric values for 186 different variants of the image E4863S4I, produced by four image registration tools, AIR Warp (blue), AIR Linear (red), FSL FLIRT (green), and MINC Tracc (purple). Higher metric values are “better” (higher similarity or lower distance). Each trajectory across the plot gives the row of 11 metric values obtained by an image; altogether there are 186 such trajectories, so an entire table of
There are many notions of similarity. This set of intensity-based and statistical metrics in Table
Image metrics can involve image features (and therefore both feature detection and feature matching) as well as models (and therefore model estimation, image resampling, image transformation, and numerical optimization) [
Consistency among these metrics can be visualized with a parallel coordinates plot of the data (Figure
Heat map representation of the correlation matrix for the table of metric values for the 186 variants of input image E4863S4I, with metrics clustered into a hierarchy by rough similarity. The nine last metrics are consistent in the sense that they are highly correlated, with all pairwise correlation values above 0.644. There is nontrivial disagreement between these nine and the
Since metrics can be computed automatically, evaluating a set of them gives us not only an inexpensive way of assessing multiple aspects of similarity but also a strategy for eliminating poor results and a basis for machine learning. Automation will never eliminate the need for expert opinion, but it can help eliminate distractions and improve productivity.
In the course of this development we have refined its implementation, in the choice of metrics, in recording of results (with a database), and in various performance enhancements for increasing parallelism and reducing file movement. Specifically, we used a database system to record all metric values obtained by each run and also metadata about program execution. This permits use of other data analysis tools for evaluating the resulting tables of metric values and execution information.
In our implementation, a backend PostgreSQL database is used to record all metric values for later analysis. Using a database to store this information provides three important benefits. First, it endows our meta-algorithm with the ACID properties (atomicity, consistency, isolation, and durability) provided by database systems. This is significant in a world in which tools crash or are unreliable as is unfortunately the case in neuroimaging. It might be possible to provide some of these properties in an ad hoc way, but there is little apparent gain in reimplementing these hard-won database features. Second, it allows our meta-algorithm to operate effectively in parallel computing environments. Using a database to log results independently is an elegant way to meet this need. Third, it allows ad hoc extraction and analysis of data from these executions. Although a given set of executions may not be that large (186 runs in our example), having a database makes this information much easier to work with.
Managing information about metric values in a database presents interesting possibilities for data mining. For example, one not only can determine which algorithms and parameter settings give better results for images from a given source, but also analyze execution times and even differences in performance by different versions of a given algorithm.
We show in this section how image processing methods can be extended by augmenting them with multiple metric computation coupled with data analysis methods from machine learning and data mining. As mentioned earlier, tracking metric value information (such as in a database) permits investigation of which algorithms and parameter settings give better results for images from a given source and permits analysis of execution times and even differences in performance by different versions of a given algorithm.
Augmentation with metric evaluation is a natural evolutionary direction for image processing methods. Given a set of images
Image processing methods can then be augmented with a final data analysis phase. This analysis can yield deeper understanding of method under the various metrics. As long as performance can be formalized in terms of metrics, we believe that this extension with learning and data mining methods can be important in improving any scientific computational method, because it can rise above assumptions about input data that are tacit in development.
Essentially, image registration is the problem of aligning two images. Since this alignment generally requires measurement of image similarity and optimization of a transformation so as to maximize it, registration is a canonical image processing problem requiring the consideration of multiple metrics.
Let
When assessing registration, it is natural to investigate how the edges from the template image are mapped to the corresponding edges in the reference image. In good registrations the mapped and reference edges are perfectly superimposed or very close in shape and space. The same applies to surfaces in three dimensions. This is the morphological view of registration.
Registration also can be approached from an information-theoretic point of view where image intensities are viewed as probability distributions. The analysis of similarity between distributions and intensities governs assessment of how well a registration algorithm performs. This perspective is natural for medical imaging; using a collection of metrics is useful for assessing the quality of registration methods, taking distributions and luminosities into account.
IRMA is a meta-algorithm for image registration that was developed with the metrics above in mind [
Figure
Parameter spaces of the four algorithms used in a representative configuration of IRMA, along with the resulting number of runs of each algorithm.
Algorithm | Options/parameters used | Runs |
---|---|---|
AIR warp | Airwarp model |
90 |
AIR linear | Blur |
30 |
FSL FLIRT | Interpolation |
60 |
MINC Tracc | dof |
6 |
Some intensity-based and statistical image metrics. In the Difference metrics, the index
(1) |
|
|
|
(2) |
|
|
|
(3) |
|
|
|
(4) |
|
|
|
(5) |
|
|
|
(6) |
|
|
|
(7) |
|
|
|
(8) |
|
|
|
(9) |
|
|
|
(10) |
|
|
|
(11) |
|
|
The values of all metrics were computed for the result of each run, and the tabulated results are shown in Figure
Figure
Figure
IRMA demonstrates how dimensionality reduction methods can be used to mine tables of metric values. Specifically, IRMA uses robust PCA [
The table of metric values shown in Figure
Some of the 186 registration results produced by IRMA for the input image E4863S4I, illustrating the wide variation in result quality that can be produced by changing algorithms and parameter settings. The first two images show the image projected into ICBM space (the target image) and the template (from the ICBM Atlas). Subsequent images (top-to-bottom, left-to-right) show the results produced by IRMA with ranks 1 (top ranked), 62, 88, 93, 124, and 186 (bottom ranked). Notice the significant diversity of result quality produced by different registration algorithms.
Many dimensionality reduction methods are available—including alternative PCA methods and multidimensional scaling [
If the performance of a given image processing method can be formalized in terms of the similarity metrics considered here, however, the multiple metric approach provides a more formal and more robust framework for validation. We can then extend the method to include a validation stage, which records computed metric values (e.g., in a database) and analyzes them (e.g., with PCA). Having multiple metrics as objectives formalizes them and avoids instabilities due to quirks of individual metrics.
By integrating data mining into our meta-algorithm we can increase sophistication of image processing algorithms. For example, IRMA’s evaluation process can be extended to learn about the strengths and weaknesses of image processing methods and about the kinds of images encountered. IRMA also gains robustness from not relying on any single method or metric.
We have argued that many image processing methods can be beneficially extended to a meta-algorithm with standardized computation and data mining of multiple metric values. Although a metric can be any figure of merit, that is, useful in evaluating the performance of the method, we have considered the situation where each metric is an image similarity measure. In this approach, basic image processing algorithms are used to produce a collection of results (e.g., for a variety of alternative parameter settings); these results are evaluated with multiple metrics, and a data mining postprocessing phase is used to extract good results. The approach described here could lead to more formal and robust image processing methods that exploit machine learning, leading to better understanding of performance in many dimensions.
As a demonstration, in this paper we have described the IRMA image registration meta-algorithm. IRMA is a neuroimaging module in the LONI Pipeline workflow environment [
The authors declare that there is no conflict of interests.
The authors thank reviewers for their valuable comments. This work was supported by NIH Grant 1U54RR021813 (Center for Computational Biology).