Markov Models for Image Labeling

Markov random field MRF is a widely used probabilistic model for expressing interaction of different events. One of the most successful applications is to solve image labeling problems in computer vision. This paper provides a survey of recent advances in this field. We give the background, basic concepts, and fundamental formulation of MRF. Two distinct kinds of discrete optimization methods, that is, belief propagation and graph cut, are discussed. We further focus on the solutions of two classical vision problems, that is, stereo and binary image segmentation using MRF model.


Introduction
Many tasks in computer vision and image analysis can be formulated as a labeling problem where the correct label has to be assigned to each pixel or clique.The label of a pixel represents some property in the real scene, such as the same object or the disparity.Such problems can be naturally represented in Markov random field MRF model.MRF is firstly introduced into vision by S. Geman and D. Geman 1 in 1984 and has been widely used in both low-level and high-level vision perception in recent years.
Basically, humans understand a scene mainly by using the spatial and visual information which is assimilated through their eyes.Inversely, given an image or images, this information such as boundary or object, mainly based on the contextual constraints, is extremely necessary for scene interpretation.We hope to model the vision problem to capture the full interaction between pixels.On the other hand, due to the sensor noise and complexity of the real world, exact interpretation is rather difficult for computers.As a result, researchers have realized that the solution of vision problems should be solved by using optimization methods.As the most popular models for gridded image-like data, the MRF provides a series of mathematical theories to find such optimal solutions under the contextual visual information in the images.The context-dependent object in digital images can be modeled in a convenient and consistent way through MRF theory.It is achieved through characterizing mutual influences among such entities using conditional MRF distributions 2 .Besides, the images we captured are always piecewise smooth, which can be encoded as a prior distribution.Thus, we can use the MRF model whose negative log-likelihood is proportional to a robustified measure of image smoothness 3 .Moreover, we may know that some premise or external knowledge of the scene such as the specify object might exist in the environment.With these priors, we can get more reliable understanding of the images.
In the latest two decades, the renaissance of the MRF model in computer vision has begun due to powerful energy minimization algorithms.A lot of inference algorithms have been developed to solve the MRF optimization problems, such as graph cut 4 , belief propagation 5 , tree-reweighted message passing 6 , dual decomposition 7 , fusion move 8 , iterated conditional modes, and their extensions.In the literature 9 , Szeliski et al. gave a set of energy minimization benchmarks and used them to evaluate the solution quality and runtime of several energy minimization algorithms.Felzenszwalb and Zabih 10 reviewed the dynamic programming and graph algorithms and then discussed their applications on computer vision.A review for the linear programming to solve max-sum problem was given in 11 .On the other hand, a framework of learning image priors for MRF was introduced by Roth and Black 12 .Schmidt et al. 13 revisited the generative aspects of MRF and analyzed the quality of common image priors in a fully application-neutral setting.New models based on MRF such as MPF were proposed.It was proved that the convex energy MPF can be used to encourage arbitrary marginal statistics 14 .Some excellent books about MRF models in image analysis such as 2 are also available.
The MRF has been successfully applied to image analysis such as restoration, matting 15 , and segmentation, as well as two-dimensional 2D fields such as stereo matching, super resolution 16 , optical flow, image inpainting 17 , motion estimation, and 2D-3D registration 18 .The MRF was also used to solve the high-level vision problems such as object classification 19, 20 , face analysis 21 , face recognition 22 , and text recognition 23 .Many optimization problems can be formulated in the MRF, for example, color to gray transformation 24 , feature detection scale-selection 25 , and so forth.Additionally, Boykov and Funka-Lea 26 presented a survey of various energy-based techniques for binary object segmentation.S. Geman and D. Geman 1 firstly applied the MRF to image restoration.Sun et al. 27 used belief propagation algorithm and combinet it with occlusion to solve the stereo problem.Detry et al. 28 proposed an object representation framework that encodes probabilistic spatial relations between 3D features.Then the authors of 28 organize the proposed framework in the MRF.
In the remainder of the paper, Section 2 gives a sketchy of MRF and related concepts.Section 3 provides two most frequently used inference algorithms for MRF.Section 4 briefly introduces two labeling applications of MRF in low-level vision.Section 5 summarizes the contribution and offers the future works in the topic.

Problem Formulation with MRF
As a branch of probability theory 2 , MRF is an undirected graphical model in which a set of random variables have a Markov property.To solve a special computer vision problem involving pixel interaction and partially observed information into an optimization problem using MRF model, we will go over the graphical models that visualize the structure of the probabilistic models using diagrammatic representations.A graph consists of nodes and edges.Each node means an event, and each edge represents the relationship between the events.MRF is used to find the most optimal label configuration.
For a labeling problem, we need to specify a set of nodes, labels, and edges.Without loss of generality, let M be a set of indexes M {1, . . ., m}, and let P {p 1 , . . ., p m } be a set of observed nodes.In vision problem, a node often represents the pixel intensity or some other image features.Let L be a set of labels.L can be continuous or discrete, but in most cases, all the labels we set are discrete: L {l 1 , . . ., l m }.As stated above, a label means some quantity of the real scene.The simplest case is binary form where L {0, 1}.Such black and white model is often used to classify the foreground and background regions in the image.In general cases, the label value is more meaningful.For example, in stereo and image restoration problem, larger label value means depth information or lighter pixel intensity.Additionally, L also can be unordered labels of which the value has no semantic meaning, such as for object classification.
N {N i | ∀i ∈ M} represents the neighbor system to indicate the interrelationship between nodes or the order of MRF.The edges are added between one node P i and its neighbors N i .Usually, the neighbor system should satisfy 2 the following: 1 a site does not neighbor with itself: i / ∈ N i , 2 the neighboring relationship is mutual: The definition of the neighbor system is important because it reflects how far the contextual constraint is.For a regular array data, as in Figure 1, the neighbors of i are defined as the set of sites within a radius of sqrt r from i where r is the order of the neighbor system.Another concept here is "clique."A clique is a subset of P which plays the near role of the neighbor system.However, the nodes in a clique are ordered which means that P i , P j is a different form P j , P i .Figure 2 shows some examples of clique types.
Though we could get more static information of the problem domain with larger neighboring system, the computational complexity of the problem will also increase exponentially with the size of neighborhood.In most cases, a 4-neighborhood system is used for simplification and efficiency.MRF is an undirected graph where a set of random variables have a Markov property.In the random field, each random variable p i in the set P can take a label from L. Usually, a mapping function F {f 1 , . . ., f m } in which F : P → L can represent for this processing.F is also called configuration.Denote Pr f i p i l i as the probability of a pixel p i taking the label l i .Then the configuration is a joint probability:

Mathematical Problems in Engineering
Note that Pr f i p i l i > 0 for all p i ∈ P. The Markov property is a basic property if the conditional probability where M \ i means the entire element in M other than i and N i is the neighbor system of p i .The Markov property means that the state probability of one node only depends on its neighbors rather than other remaining nodes.Gibbs random field is a random field in which the probability obeys the Gibbs distribution in the form of where Z f∈F e − 1/T E f is a normalizing constant called the partition function, and T is a constant which shall be assumed to be 1 unless otherwise stated.E F c∈C i V c f f ∈ c, i ∈ M is the energy function.C is the clique defined on the graph, and V c f is the clique potential function.
Hammersley-Cliffod theorem states that if a probability distribution is a positive and satisfies the Markov properties in an undirected graph G, the distribution is a Gibbs random field.That is, its probability can be factorized over the cliques of the graph.This theorem provides a simple way to calculate the joint probability using the clique potential.According to the Bayes' rule, the posterior distribution for a given set y and their evidence p y | x , combined with a prior p x over the unknowns x, is given by If we do not know the prior information, the maximum likelihood ML criterion may be used where x * argmax p y | x .However, sometimes we can still obtain the knowledge about the prior distribution of x.Thus, the maximum of a posteriori MAP estimation is the best way to get the optimization where x * argmax p y | x p x .Figure 3 illustrates the difference between ML criterion and MAP criterion.MAP probability is one of the most popular statistical criteria for optimization, and in fact, it is the first choice in MRF vision modeling.

Maximum likelihood
Maximum a posteriori Logarithmize both sides, and then we can obtain the negative posterior log-likelihood where log p y is a constant used to make integration of p x | y equal to 1.To find the MAP solution, we simply minimize 2.4 .Rewrite the clique potential where the E d x, y can be treated as the clique potential whose clique size is 1, and E s x is the remaining clique potential or the observed image prior distribution.In most vision problems, the single-site clique potential is also called unary potential or data energy.Similarly, E s x is called smooth potential or smooth energy.With 2.1 , 2.5 can be rewritten as where Most vision problems map the minimization of an energy function over an MRF.In some degree, the energy function can be seen as a mathematical representation of the scene and should precisely measure the global quantity of the solution as well as can be easy to find the global minimization.When the energy function 2.7 is minimized, the corresponding posteriori Pr x | y gets the maximum.
To solve a specific problem, we need to determine the energy form and the parameters involved.Though there are many types of clique potential functions, there exists a unique normalized potential, called the canonical potential.In literature, the energy function can be expressed as either a parametric form or a nonparametric form 2 .Here, we take the secondorder clique potential, for example, which is also called the pairwise model.The pairwise MRF is the most commonly used model in which each node interacts with its adjacent nodes.It is the lowest-order constraint to convey contextual information and is widely used due to its simple form and low computational cost 2 .The pairwise MRF models the statistics of the first derivative in the image structure Figure 4 .The corresponding energy function is Usually, E d is the local evidence of p taking the label f p such as the intensity or the color value.Equation 2.8 can be rewritten as 2.9 In the binary MRF case, E s f p , f q f p × f q , where f p ∈ {0, 1}.In the multilabel case, Potts model is the most widely used one which can prevent the edges of objects from oversmoothing.Usually, Potts model takes the form where α may be a constant or α |f p − f q |.As is illustrated in Figure 5, in the pairwise MRF model, a node is attached to a pixel in the image, while edges are constructed between the node and its four neighborhoods.With such model, the corresponding energy function can be efficiently minimized using many inference algorithms.Other graph structures are also used.For example, in image segmentation, an image is partitioned into several regions.Each region can be regarded as a node, and edges may be constructed between adjacent segmented regions.To make the optimization more efficient, a hierarchical MRF model is used.It mainly uses the pyramid structure and performs in a coarse-to-fine scheme which uses a coarser solution to initialize a finer solution.It is well known that hierarchical methods can significantly improve the convergence rate and reduce the execution time.In 29-31 , a regular pyramid downsampling method was applied, while Zitnick and Kang 32 used an irregular pyramid.Figure 6 illustrates an example.
Although most MRFs use the pairwise model due to its simplicity, a scheme of more complex interaction, for example, 8-neighborhood or more numbers of pairwise terms, is also used sometimes.People usually use 26-neighborhood in 3D volumetric images or video analysis.Higher-order clique potentials can capture more complex interactions of random variables.For example, calculating the curvature of an object requires interaction of at least three nodes.Computational time for the clique potential increases exponentially with the size of the clique and poses a difficult energy minimization scenario, which poses a tough question.Recently, there have been many attempts to go beyond pairwise MRF.One approach is to transform the higher-order problem into pairwise problem by adding auxiliary variables.decomposed high-order cliques as hierarchical auxiliary nodes and used hierarchical gradient nodes to reduce the computational complexity.Another way is to perform direct computing using factor graph representation 37 .Kwon et al. 38 proposed a nonrigid registration method using the MRF with a higher-order spatial prior.Experiments show that using highorder potential the performances of image denoising are significantly improved, as is shown in Figure 7.

Inference Methods
Over the years, a large number of inference algorithms have been developed, which can be mainly classified into two categories, that is, message passing algorithms such as loopy belief propagation and move making algorithms such as graph cuts.In this section, we briefly introduce two classic inference methods for approximating energy minimums, that is, belief propagation and graph cut.

Graph Cut
Graph cut GC was first applied in computer vision by Greig et al. 40 , which describes a large family of MRF inference algorithms based on solving min-cut/max-flow problems.Given a type of computer vision problems which can be formulated in terms of an energy function, GC can get the minimum energy configuration corresponds to the MAP theory.Suppose that G V, E is a directed graph in which the edge weight is nonnegative, V represents vertices, and E denotes edges.The graph has two special terminals vertices , that is, the source s and the sink t.A cut C S, T is a partition of V .An s-t cut is a cut that splits the source and the sink to be in different subsets where s ∈ S and t ∈ T .Besides, according to graph theory, the potential of a cut can be measured by the sum of the weights of the edge crossing the cut.To find a cut which can minimize s-t cut problem is equivalent to compute the maximum flow from the source s to the sink t.Maximum flow is the maximum "amount of water" that can be sent from the source to the sink by interpreting graph edges as directed "pipes" with capacities equal to edge weights.As illustrated in Figure 8, the GC algorithm is ideally designed to solve the max-flow problem.
It was reported that GC can obtain the exact solution in the binary label case.In multilabel case, GC requires solving a series of related binary inferences and then obtains the approximated global optimal solutions.Two of the most popular GC algorithms are αβ swap and α-expansion.In the α-β algorithm, a swap move takes some subset of nodes that currently label with α and assign their label with β, and vice versa.The α-expansion algorithm increases the set of nodes taking α by moving it to other nodes.When there is no more swap or expansion move, a local minimum is found.Comparing the two algorithms, α-expansion is more accurate and efficient.Also α-expansion can produce a result with lower energy.However, the condition of α-expansion is more strict.When using the α-expansion, the interaction potential must be metric, that is, For α-β swap, it must be semimetric, that is, More details about α-β swap and α-expansion can be found in 4 .In addition, Kolmogorov and Rother 41 wrote a survey about graph cut and pointed out that GC can be applied to both submodular and nonsubmodular functions.Other more recent developments in GC include order-preserving GC 42 and combination GC 43 .

Belief Propagation
Belief propagation is a power inference tool originally developed for tree-Bayesian networks 45 .It is recently extended to those "cycle" graphs such as MRF.Although BP can only guarantee convergence with the Bethe free energy in MRF 46 , it can obtain reasonable results in practice.In standard BP with pairwise MRF, a variable m ij x j can be treated as a "message" from a node i to its neighbor j which contains the information about what the state of node j should be in.The message is a vector of the same dimension as the number of possible labels.The value of each dimension manifests how this label might be corresponding to the node.
Let φ ij x i , x j be the pairwise interaction potential of p i with p j , and φ i x i , x j is the "local evidence" of p i .Usually, the message must be nonnegative.A large value of the message means that the node "believes" the posterior probability of X j is high.The message updating rule is where t represents the number of interaction T as showed in Figure 9.
The belief is the product of "local evidence" of the node and all messages send to it The standard BP described above is also called sum-product BP.There is another variant BP which is more simple to use, that is, max-product or max-sum in log domain).In maxproduct BP, 3.3 and 3.4 are represented as The sketch map of this process is illustrated in Figure 10.Several speed-up techniques are attempted, for example, distance transformation, checkerboard updating, and multiscale BP 5 , so that the belief propagation can converge efficiently.In another way, Yu et al. 47 used the predictive coding, linear transform coding, and envelope point transform to improve the BP efficiency.
Although BP is an implicitly efficient inference algorithm for MRF with loops, it can only converge to the stationary point of the Bethe approximation of the free energy.Recently, a generalized belief propagation GBP algorithm proposed by 48 has received more attention due to its better convergence property against BP.It can converge to a more accurate stationary point of Kikuchi free energy 46 .More details about the GBP algorithm can be found in 48 .
BP and graph cut are both good optimal techniques which can find "global" minima over cliques and produce plausible results in practice.A comparison between the two different approaches for stereo vision was described in 49 .GC can get lower energy, but the performance of BP is comparative to GC relative to the ground truth.
In addition to the two typical methods, many other inference algorithms have been proposed in latest few years.Fusion move 8 is proposed for multilabel MRF.By employing QPBO graph cut, the fusion move can efficiently combine two proposal labels in a theoretically sound way, which is in practice often globally optimal.Alahari et al.50 improved the computational and memory efficiency of algorithms for solving multilabel energy functions arising from discrete MRF by recycling, reducing, and reusing.Kumar et al. 51 provided an analysis of linear programming relaxation, the quadratic programming relaxation, and the second-order cone programming relaxation to obtain the maximum a posteriori estimate of a general discrete MRF.Komodakis and Tziritas 52 proposed an exemplar-based framework and used priority BP to find MRF solutions.Ishikawa 53 introduced a method to exactly solve a first-order MRF optimization problem in more generality than previous ones.Cho et al. 54 used patch transform representation to manipulate images in the patch domain.The patch transform is posed as a patch assignment problem on an MRF, where each patch should be used only once, and neighboring patches should fit to form a plausible image.

Applications
Here, we provide MRF solutions for two typical problems in computer vision, that is, stereo matching and image segmentation.These problems require labeling each pixel with a value to represent the disparity and foreground or background.They can be easily modeled using MRF and solved by energy minimization.

Stereo Matching
Stereo matching has always been one of the most challenging and fundamental problems in computer vision.Comprehensive research has been done in the last decade 32, 55-58 .A latest evaluation of these various methods can be found in 59 .In the last few years, as is shown in 44 , the global methods based on MRF have reached the top performance.
For MAP estimation, let P be the set of the image pixels in image pair, and let L be the set of disparity.The initial data cost, which is calculated by the truncated linear transform which is robust to noise or outlier, is defined as where λ is the cost weight which determines the portion of energy that data cost possesses in the whole energy, and T represents the truncating value.The parameters can be set with empirical values from experiments.I L c p represents p s intensity in the left image of channel c.I R c p is similarly defined.Birchfield and Tomasi's pixel dissimilarity is used to improve the robustness against the image sampling noise.The smooth cost which expresses the compatibility between neighboring variables embedded in the truncated linear model, is defined as: where K is the truncating value.The smooth cost based on the truncated linear model is also referred to as discontinuity preserving cost since it can prevent the edges of objects from oversmoothing.The corresponding energy function used here is the most conspicuous one and is defined as where N contains the edges in the four-connected neighborhood set.
The objective is to find a solution which minimizes 4.3 .The solution means the correct depth information in the scene.Figure 11 shows the results of "Tsukuba" data set using different energy minimization methods available in 44 .In the past decades, segmentbased stereos 32 have been boomed as they perform well in reducing the ambiguity associated with textureless regions and enhancing noise tolerance by aggregating over pixels with homogenous properties.Usually, those algorithms firstly segment the source image.Then the matching cost is computed over the entire segment.A plane fit method is applied to refine the result.

Binary Image Segmentation
Binary image segmentation is widely used in medical image analysis and object recognition.Here, each pixel is assigned with a label l with 0 ≤ l ≤ 1.In the simplest case, we have l ∈ {0, 1}, where 0 represents the pixel belonging to the background and 1 to the foreground.The segmentation result should be accurate and fine enough for successful applications such as object category, photo editing, and image retrieval.Although segmentation is regarded as one of the most difficult problems due to the complexity of real scene and noise corruption, MRF model can often successfully deal with this challenging problem.
The corresponding energy function is represented the same as 2.9 .The data cost E d represents whether the pixel property is consistent with the statistic distribute of possible region.It may be simple to take such an absolute difference of pixel intensity and the mean of region gray level.Alternatively, the complex data term often leads to better results.For example, in 60 , the data cost uses the color data model which is the log-likelihood of a pixel and is modeled as two separate Gaussian mixture models.The smoothness term is a simple Potts model In 4.4 , Dis m, n is the Euclidean distance of pixel m and pixel n, and f p f q denotes the indicator function taking 0 and 1.K is a constant.If K 1, the smoothness term recovers the Ising model which encourages smoothness everywhere.K determines how coherent the similar grey level in a region is.Recently, user interaction was proposed to refine the results in 60-62 .Usually, the user first marks some pixels to indicate the background and foreground.With those labeled pixels, we can get the corresponding region statistics.
GC is the most common optimal tool for binary MRF combined with both color texture information and edge information.Further, the marked pixels can be used as the seeds in the cut-based algorithm.A graph cut extension, that is, grabcut 60 , was proposed for iterative minimization of the energy.

Conclusion
It is now acknowledged that MRF is one of the most successful approaches for solving labeling problems in computer vision and image analysis.The most challenge of MRF models is to develop its efficient inference algorithm in order to find the low-energy configuration.
As in computer vision, there are too many nodes.For example, consider two frame images with the size of a × b.If each node takes N possible labels, the computation space is N a×b .Clearly, the inference algorithm should be efficient enough to overcome this dilemma.Secondly, constructing reasonable MRF also plays key roles, especially for some new vision applications.For instance, there are many different grid topologies and nonlocal topologies.Thirdly, the parameters of MRF model should be efficiently learned form image instead of manually or empirically chosen.Furthermore, further studies can focus on the energy functions which can not be efficiently solved by using state-of-the-art methods.

Figure 1 :
Figure 1: An example of the 5th-order neighbor system.
j} where f a, b measures the Euclidean distance between a and b.

Figure 2 :
Figure 2: a Horizontal pair-site cliques, b vertical pair-site cliques, c and d diagonal pair-site cliques.

Figure 3 :
Figure 3: Schematic comparison of ML and MAP methods.

6 MathematicalFigure 4 :
Figure 4:The model for a 4-neighbor MRF i.e., the pairwise MRF .The dash circles are the observed nodes, and the white circles are the unobserved labels.

Figure 5 :Figure 6 :
Figure 5: a Standard pairwise MRF model with image-grid data, where the circle represents a node or a pixel.The black circles are neighbors of the white one; b an example of MRF used in segmentation, where the nodes of neighboring segments are connected by applying Delaunay triangulation method.

Figure 7 :
Figure 7: An example of image denoising.a is the original noisy image.c is the denoising result by BP using pairwise interaction 5 .b is the result by BP with the learned 2-by-2 model.The results no longer exhibit any piecewise constancy.Not only edges are preserved, but also smoothly varying regions are preserved better than those with higher-order clique 39 .

Figure 8 :
Figure 8: An example of min-cut/max-flow graph cut.The gray circles represent the nodes, and the solid lines are the edges between the nodes.The curve of each "flow" is connected to the source terminal or sink terminal.The potential of flow is measured by the width of line.The dotted line indicates a cut of graph partition.

Figure 9 :
Figure 9: Message passing in BP. m ij x i t is a message from node i to its neighbor j and indicates what the state should be in node j.

Figure 10 :
Figure 10: a Messages passing from node s to its neighbor u. b The belief of node s is calculated according its neighbors' messages.

Figure 11 :
Figure 11: Comparison of different optimization algorithms.a image is the original left image.b is the result BP. c is the result by α-expansion.d is the result TRW 44 .

Figure 12 :
Figure 12: Comparison of different optimization algorithms for binary segmentation of the image "sponge." a image is original.b is the result of BP. c is the result of α-expansion.d is the result of TRW.There is a slight difference among the three results.Besides, since this is a binary labeling problem, the α-expansion finds the global optimum in a single iteration 44 .
Figure 12 shows the results of binary segmentation with different methods using identity parameters 44 .Considering multilabel segmentation, Micusik and Pajdla 63 formulated singleimage multi-label segmentation into coherent regions in texture and color as a max-sum problem.As a region merging method, Mignotte 64 used MRF fusion model combining several segmentation results to achieve a more reliable and accurate result.More recently, Panda and Nanda 65 proposed an unsupervised color image segmentation scheme using the homotopy continuation method and compound MRF model.Chen et al. 66 proposed image segmentation method based on MAP or ML estimation.Li 67 introduced a multiresolution MRF approach to texture segmentation problems.Rivera et al. 68 presented a new MRF model for parametric image segmentation.Some other works 69-71 carried out for learning of the prior distribution.MRF is also widely used in medical image segmentation.Zhang et al. 72 proposed segmentation of brain MR images through a hidden MRF.Scherrer et al. 73 used expectation maximization to segment the images in an MRF model.Anguelov et al. 74 segment 3D scanned data into objects using GC.Hower et al. 75 investigated in the context of neuroimaging segmentation.As a low level vision problem, the segment is often applied for object classification.Honghui et al. 76 proposed a robust supervised label transfer method for semantic segmentation of street scenes.Feng et al. 77 recently proposed a method to optimize the MRF, which can automatically determine the number of labels in balance of accuracy and efficiency.