Fast Retrieval Algorithm for EarthMover ’ s Distance Using EMD Lower Bounds and a Skipping Algorithm

The earth mover’s distance (EMD) is a measure of the distance between two distributions, and it has been widely used in multimedia information retrieval systems, in particular, in content-based image retrieval systems. When the EMD is applied to image problems based on color or texture, the EMD reflects the human perceptual similarities. However, its computations are too expensive to use in large-scale databases. In order to achieve efficient computation of the EMD during query processing, we have developed “fastEMD,” a library for high-speed feature-based similarity retrievals in large databases. This paper introduces techniques that are used in the implementation of the fastEMD and performs extensive experiments to demonstrate its efficiency.


Introduction
The earth mover's distance (EMD) was introduced in computer vision as an improved distance measure between two distributions, and it has been widely used in multimedia databases.In particular, in the area of content-based image retrieval (CBIR), where color images are retrieved from multimedia databases, it is important to apply improved distance measures.Several CBIR models have been proposed based on the histogram approach, such as the query by image content (QBIC) method proposed by Faloutsos et al. [1], which maps each image to a vector based on encoded attributes such as color distribution, texture, and shape.However, the histogram approaches using Euclidean distance and quadratic forms apply bin-to-bin distance functions; hence, these approaches are sensitive to minor shifts in feature value distribution.
The similarity distance in the EMD is calculated as the amount of changes necessary to transform one image feature into another.Rubner and Tomasi [2] have proposed a CBIR model using the EMD, and its high quality of image-similarity searches has been demonstrated.Shishibori et al. [3] have proposed a music-retrieval model using the EMD, and a query-by-humming system presents a more robust approach than the conventional methods.
The EMD is defined as a linear programming problem that can be solved using the simplex method.However, computation using the EMD and the simplex method is too complex for its application in interactive multimedia database systems.In order to achieve efficient retrieval processing in large-scale multimedia databases based on the EMD, we propose a fast retrieval algorithm for the EMD.
In order to reduce the computation times of the original distance, the proposed method uses the lower bounding distance described in Cohen and Guibas [4] and Assent et al. [5].The lower bounding distance of the EMD is a simple approximation in terms of computation.In the case of the k-nearest-neighbor search task, this method calculates the original EMD for the first k images, where the largest distance is set to the threshold.This method calculates the lower bounding distance for k + 1 images.This method does not calculate the original EMD if the lower bounding distance exceeds the threshold, because the true EMD cannot be lower than the first k images.In all other cases, this method Advances in Multimedia calculates the original EMD, and if the true EMD is lower than the first k images, a new image is registered and the threshold is reset.
The remaining sections in this paper are organized as follows: Section 2 introduces the outline of the earth mover's distance.Section 3 explains the lower bounds of the EMD and proposes the algorithm of the calculation skip using the lower bounds.Section 4 evaluates the effectiveness of the presented method by using experimental results.Section 5 describes conclusion and future works.

Conventional Dissimilarity Measures.
The L p distance is a generalized measure of the similarity between two histograms x = {x i } and y = {y i }.The L p distance is defined as where p > 0.
The Minkowski distance assumes different names depending on the value of p.If p = 1, the distance is called a Manhattan distance.The Euclidean distance can be computed from (1) by selecting p = 2.The Euclidean distance is one of the simplest and most popular distance measures.
The L p distance is a representative of bin-by-bin dissimilarity measures, and it has been used as the the dissimilarity measures of multimedia information retrieval systems.A major disadvantage of this method is that it accounts only for the correspondence between bins with the same index, and it does not use information across bins.This is illustrated in Figure 1, which shows two pairs of one-dimensional grayscale histograms as described in Rubner and Tomasi [2].
Although the two histograms on the left are the same except for a shift by one bin, the L 1 distance between them is larger than the L 1 distance between the two histograms on the right.As shown in Figure 1(a), the L p distance does not match the perceptual dissimilarity.This can be fixed using the correspondences between the bins in the two histograms and the ground distance between them, as shown in Figure 1(b).As shown in Figure 1(b), the perceptual dissimilarity is based on the correspondence between the bins in the two histograms.
The quadratic-form distance is a representative of the cross-bin dissimilarity measures.Niblack et al. [6] suggested the use of this distance for color-based retrieval.The quadratic-form distance does not enforce a one-to-one correspondence between the mass elements in the two histograms.The same mass in a given bin of the first histogram is made to correspond to the masses contained in different bins of the other histogram.However, the image-retrieval results using the quadratic-form distance include some false positives, because the quadratic-form distance tends to overestimate the mutual similarity of color distributions, as mentioned in Stricker and Orengo [7].

Earth Mover's Distance.
The EMD is based on the following linear programming problem.Let P = {(p 1 , w p1 ), . . ., (p m , w pm )} be the first signature with m clusters, where p i is the cluster representative and w pi is the weight of the cluster.Let Q = {(q 1 , w q1 ), . . ., (q n , w qn )} be the second signature with n clusters.Let D = [d i j ] be the ground-distance matrix, where d i j = d(p i , q j ) is the ground distance between cluster p i and q j .We assume that the signature P denotes "supply places," the signature Q denotes "demand places," and the ground distance d i j denotes the transportation cost between each supply and demand place.Our aim is to find a flow F = [ f i j ] such that f i j lies between p i and q j , and it minimizes the following overall cost: ( Equation ( 2) is obtained with the following constraints: Constraint (3) allows the movement of supplies from P to Q and not vice versa.Constraint (4) limits the amount of supplies that can be sent by the clusters in P to their weight.Constraint (5) limits the clusters in Q to so that the supplies do not exceed the weights Constraint (6) forces the maximum possible amount of supplies possible to be moved.This amount is called the total flow.Once the transportation problem is solved, and we have found the optimal flow F, the EMD is defined as the resulting work normalized by the total flow An example of a two-dimensional EMD is shown.The signature A as supplies is shown in Table 1, and the signature B as demands is shown in Table 2.We assume that the ground distance is the Euclidean distance.The transportation costs between signatures A and B are calculated as shown in Table 3.By solving the transportation problem under the above condition, we can obtain the optimized transportation flow and weight, as shown in Figure 2.
In    The computational complexity of the transportation problem is based on the simplex algorithm, which has an exponential worst case.a sample of retrieval results using the EMD, where this sample is retrieval results from 10,000 image dataset.On the other hand, Figure 5 shows a sample of retrieval results using the L 2 from the same dataset.From these results, it can  be found that CBIR systems using the EMD can reflect the human perceptual similarities.

EMD Lower Bound
3.1.Basic Definitions and Notations.We denote a signature x as follows: where Here, d is the dimension of the cluster x i ∈ R d , and n is the number of clusters.The total weight of the signature x is In the case of the k-nearest neighbor search task, let r be the distance between the query image and the the kth candidate data during the search process.

Centroid-Based Lower Bounds.
The centroid x of the signature x = (X, w) ∈ D d,n is defined as Theorem 1. Suppose that x = (X, w) ∈ D d,m and y = (Y , u) ∈ D d,n are signatures of the equal total weight w Σ = u Σ .Then, Here, the ground distance • is any L p norm used to measure d(x i , y j ), as described in Cohen and Guibas [4].

Calculation Skip
Using the EMD CB .In the calculation of EMD CB , the method described in Ajiok et al. [8] can be applied to skip its calculation steps, because EMD CB is the L 2 distance between the centroid of each signature.This method regards the calculation of each centroid as each dimension in the sigunature vector.An example is shown in which the signature A (Table 1) is the query, the distance between the signature C (Table 4) and the query is calculated, and r is set to 1.5.Then, the following centroids are obtained by (12): First, the L 2 distance of the first dimensional centroid is calculated as below: The distance of the first dimension is larger than r 2 ; hence, the calculation steps after the second dimension can be eliminated.(16)

Independent Minimization Lower
where f i j is the flow from x i to y j , d i j = d(x i , y j ) is the ground distance between cluster x i and y j , and M = m i=1 n j=1 f i j = w Σ = u Σ .

Calculation Skip Using EMD IM .
The calculation cost of the EMD IM can be reduced by weakening the constraints of the transportation problem.In Figure 2, the total flow equals the total (supplies or demands) weight, because the constraints of (4) and ( 5) are satisfied.The difference between the EMD IM and the normal EMD is the constraint on the amount of flows that one cluster may receive.While the normal EMD requires the sum of flows to a certain cluster to be equal to the weight of the cluster (18), the EMD IM only ensures that for any single cluster, the incoming flows do not exceed the weight of the cluster (19); that is, the total flow to a cluster can exceed its weight.The final flows of EMD IM (A, B) are shown in Table 5 in which the total flow to B 2 exceeds w B2 .
From this characteristic, the flows can be calculated for each supply cluster.To calculate EMD IM , the method described in Ajiok et al. [8] can be applied to skip the calculation step of each supply cluster.This method regards the flow calculation of each supply cluster as each dimension in the signature vector.
An example of the calculation skip of the EMD IM (A, B) is shown, where r is set to 1.5.First, a minimum value of total transportation costs from A 0 is obtained, and its value is compared with r, as shown below: For the first k image data, the original EMD is calculated Data p which has the largest distance from the query is specified The largest distance is set to the threshold (r) Results Database  The value of A 0 is smaller than r; the total transportation costs from A 1 are added to A 0 as follows: where the total value is larger than r; hence, the calculation of A 2 and A 3 can be eliminated.

Fast Retrieval Algorithm for the EMD
4.1.Algorithm Using the Lower Bounds.In the case of the k-nearest-neighbor search task, the retrieval algorithm for the first k image data is shown in Figure 6, and the retrieval algorithm from the k+1 image data is shown in Figure 7.This method calculates the original EMD for the first k images in the database, where the largest distance is set to the threshold r.This method calculates the lower bounding distance for succeeding data from k+1 image.This method does not calculate the original EMD if the lower bounding distance exceeds the threshold r, because the true EMD cannot be lower than the first k images.In all other cases, this method calculates the original EMD, and if the true EMD is lower than the first k images, a new image is registered, and the threshold r is reset.

Algorithm
Using the Priority Queue.In a nearestneighbor search based on a linear search, k search results are obtained when k features close to a given query are found.When a newly found result has a distance to the search feature that is smaller than the largest of the current k results, the results are updated.The data with the largest distance must be deleted from the candidate list, and the new result must be added.The efficient execution of this procedure is provided by using a data structure called a priority queue.
There are various implementations of the priority queue, one of them being heap.The heap structure is shown in Figure 8.The diagram shows an array as a binary tree, with a key specifying the two lower keys.Every key is larger than its two child keys.Such a data structure is called a complete binary tree; the heap is defined as a complete binary tree, which is represented as an array in which all the nodes satisfy the heap condition.Here, the heap condition is that the key in any node is larger than (or equal to) the keys in its child nodes.If this heap condition is satisfied, then the largest key is always in the root node.As shown in Figure 8, the root is numbered as 1, its children as 2 and 3, their children as 4, 5, 6, and 7, and so on.In such a representation, the parent of a node a[ j] is a[ j/2] (truncated at the decimal point if j is odd), and its children are a[2 j] and a[2 j + 1]; thus, the parents and children can be identified easily for every node.When a new search result offers a distance smaller than the largest value among the current results, the heap root (which has the largest value in the candidate list) is replaced with this newly found value.When this procedure causes a violation of the heap condition, the root value is exchanged with the largest of its children to restore the condition.If the heap condition is violated at the next level, a similar operation is applied; the procedure is terminated if a parent's value is smaller than that of both children, or when the heap bottom is reached.This processing ensures that the data structure can be repaired so that every node meets the heap condition.

Experimental Results.
In order to evaluate this method, the number of color images in the database was increased from 5,000 to 50,000 (in increments of 5,000), and the retrieval time (CPU time) to search the top 10 images was evaluated.To evaluate the number of dimensions of the cluster representative x i , we prepared 5-and 11-dimensional  .In these graphs, "none" indicates the normal EMD calculation, "cb" indicates the corresponding result obtained using the centroid-based lower bounding distance, "im" indicates the corresponding result using the independent minimization lower bounding distance, and "cb im" indicates the corresponding result by combining the "cb" and "im" methods.In addition, "noskip * " represents the value obtained by each method without the calculationskipping algorithm.
From the experimental results, it can be seen that when the number of color images is 50,000 and has 5 dimensions, it takes approximately 2.5 seconds to retrieve the result for a query image by using the normal EMD calculation.The combination method of the lower bounds could obtain the same result as the normal EMD in approximately 0.15 second.On the other hand, the method using the calculation-skipping algorithm took only 0.1 second.

Related
Works.Some derivation algorithms of the EMD have been proposed.Pele and Werman [9] proposed the fast algorithm of EMD, and Ling and Okada [10] proposed the fast algorithm of EMD L 1 .The EMD can be obtain by replacing the ground distance of the EMD with the thresholded distance.EMD L 1 is a replacement of the ground distance with the L 1 distance.Both of EMD and EMD L 1 are different from the original EMD.And they only indicate high-speed calculation methods in the case that the thresholded distance  and L 1 distance were used as the ground distance.On the other hand, by using our algorithm, the original EMD can be calculated efficiently.The characteristic of our algorithm is that unnecessary calculations can be skipped by using the EMD lower bounds, and the viewpoint of speedup is different from their methods.

Future Works.
The proposed method selects first k image data in the database as the initial retrieval results (Figure 6).However, if there are some outliers k image data, the effect of speedup will become less, since the threshold r is much larger.Then, the smaller the distance between the query and first k image data is, the more the retrieval efficiency improves.In order to solve this problem, we are planning to use the hierarchical clustering index to select the initial data.VP-tree [11] and M-tree [12], which are famous metric space indices, can be used as the hierarchical clustering index.By using these indices, appropriate initial data as first k image data can be obtained from the leaf node which is reached for the first time on the retrieval process.
To be sure, these data are not true correct, but the possibility including outliers becomes low considerably.And it seems that the distance between the query and first k image data becomes small.The above improvement will be done as the future works.
In order to validate that the hierarchical indexing tree can help to reduce the effects of outliers in the first k images, the following preliminary experiment was done.First, the max distance between the query and top 100 image data is calculated.This distance is called distance top and set to the initial threshold r on the proposed method.Next, we obtain the max distance between the query and about 100 image data in the leaf node, which is reached for the first time by using the VP-tree.This distance is called by distance vp , and set to the initial threshold r on the improved method by using the VP-tree.These distances were calculated for 5 queries and the image database, in which was registered 10,000 image data.The preliminary experimental result is shown in Table 6, where the Euclid distance is used as the distance measure, and the number of dimensions is 48.From the preliminary experimental results, it is found that the initial threshold can be set to smaller value by using the VPtree.

Conclusion
In this paper, a fast retrieval method for the earth mover's distance in multimedia databases is proposed.This method uses the lower bounding distance of EMD and combines it with a calculation-skipping algorithm.Moreover, the validity of the proposed method is supported by empirical observations.For future works, various multimedia information retrieval systems using the fast EMD should be implemented.

Figure 2 ,
the white circles indicate supply places, and the black circles indicate demand places.The numbers inside the circle denote the weight of the supplies or demands.The arrows show the transportation flow, and the numbers beside the arrow denote the transportation weight.The total flow cost of Figure 2 is obtained as follows:

2. 3 .
Image Retrieval Using the EMD.Color distribution data applied for CBIR models is extracted from color segmentation images.As shown in Figure3, the distribution consists of the weight, flow cost, and image feature.The weight is the number of pixels in each region.The flow cost of each distribution can be obtained by calculating the Euclidean distance between the image features in each region.The image feature of each region consists of the center position (x, y) and the average color signature (RGB).

Figure 5 :
Figure 5: Sample of retrieved results using the L p .

Theorem 2 .≤
Bounds.EMD CB denotes the lower bounds based on the L p distance.On the other hand, Assent et al.[5] have proposed independent minimization lower bounds based on cross-bin dissimilarity measures.Suppose that x = (X, w) ∈ D d,m and y = (Y , u) ∈ D d,n are signatures of the equal total weight w Σ = u Σ .Then, EMD IM x, y = min EMD x, y .

Figure 6 :
Figure 6: Algorithm for the first k image data.

Figure 7 :
Figure 7: Algorithm from the k + 1 image data.

Table 1 :
An example of the signature A.

Table 2 :
An example of the signature B.

Table 3 :
Cost matrix of signature A and B.
Figure 4: Sample of retrieved results using the EMD.

Table 4 :
An example of the signature C.

Table 5 :
Flow in EMD IM .
Equation (16) is obtained with the following constraints:

Table 6 :
A preliminary experimental result.