Conditional Random Fields for Image Labeling

With the rapid development and application ofCRFs (Conditional RandomFields) in computer vision,many researchers havemade some outstanding progress in this domain becauseCRFs solve the classical version of the label bias problemwith respect toMEMMs (maximum entropyMarkovmodels) andHMMs (hiddenMarkovmodels).This paper reviews the research development and status of object recognition with CRFs and especially introduces two main discrete optimization methods for image labeling with CRFs: graph cut and mean field approximation.This paper describes graph cut briefly while it introduces mean field approximation more detailedly which has a substantial speed of inference and is researched popularly in recent years.


Introduction
Recognizing and labeling objects and properties in a given image is an important task in computer vision.The goal of image labeling is to label every pixel or groups of pixels in the image with one of several predetermined semantic object or property categories, for example, "dog," "building," and "car."It is a natural ability for human beings to perform object recognition effortlessly, but it is not straightforward for a computer to do so.Researchers [1][2][3][4] are still trying to improve the image labeling technique to reach a better result in terms of speed and accuracy.Figure 1 is an example of label image labeling.
Image labeling usually includes several issues: first we should set up a model and train it; then we will make inference of labeling for a new image.The state-of-the-art of algorithmic solution to image labeling is yet to reach a satisfactory state, especially for the process of inference.Graph cut method [5][6][7][8]was popular previously.But the speed of graph cut method is very slow, especially when there are many labels.In [1], Vineet et al. are able to achieve remarkable speed-ups and improvements in accuracy with graph cut base inference techniques comparing with the baseline method in both joint stereo-object labeling and object class segmentation.However, their method [9] has two limitations: the first is the fact that mean field approximation assumes complete factorization over the individual variables; the second limitation relates to the form of the pairwise weights in the formula which are a linear combination of Gaussian kernels.See Section 3.2 for more details of these two limitations.
Naturally, human beings understand a scene mainly by using the spatial and visual information assimilated through their eyes.Inversely, given an image or several images, this information, such as boundary or object, is extremely necessary for scene interpretation.What we hope is to capture the full interaction between pixels.Due to the sensor noise and complexity of the real world, researchers realize that the solution of vision problems can be transformed to some equivalent optimization process as exact interpretation is unapproachable for computers.
In the early history of computer vision, Markov random field (MRF) was popularly used in both low-level and highlevel vision perception after it was first introduced into vision by S. Geman and D. Geman in 1984 [10].The MRF provides a mathematical framework to find optimal solutions by using the contextual visual information in the images.Recently, the MRF model regained attention in the field of computer vision thanks to the progress in powerful energy minimization algorithms [3] such as graph cut [6], belief propagation [11], dual decomposition [12], fusion move [13], and iterated conditional modes.The MRF has been applied to image problems such as restoration, matting [14], segmentation, optical flow, object classification [15,16], face recognition [17], and text recognition [18].Figure 2: Some examples of labeling problems in computer vision.For stereo matching, the goal is to find the corresponding pixel in one image given a pixel in another image.Its label set is the differences (disparities) between corresponding pixels.For image segmentation, its goal is to partition an image into multiple disjoint regions with region IDs as its label set.For image restoration, it tries to "compensate for" or "undo" defects which degrade an image, and its label set is restored intensities or color.
Object classification can be formulated as a pixel labeling problem; that is, the correct label is to be assigned to each pixel or clique where the label of a pixel represents some property in the real scene, such as the same object or disparity.In [3], Chen et al. introduced the background, basic concepts, and fundamental formulation of image labeling with MRF.They discussed two distinct types of discrete optimization method, that is, belief propagation and graph cut.And they further applied them to the solutions of two classical vison problems: stereo and binary image segmentation using MRF model.Figure 2 shows some examples of labeling problems in computer vision.
It was later recognized that the image labeling problem can be naturally described with a Conditional Random Fields (CRFs) model [1].The CRF model was first proposed by John Lafferty et al. [19]  The use of CRFs was originally restricted in the area of Information Extraction [22][23][24][25], in which, given a dataset, the problem is to extract relevant information that belongs to some predefined types.Since the datasets are mostly linguistic, imposing a chain structure on the texts is both effective in capturing temporal relations and efficient in inference and learning for texts is inherently sequential.Therefore, CRFs have been quickly adopted in a wide range of text processing applications, such as part-of-speech tagging (POS), chunking [26,27], and semantic role labeling [28].Later on, the application of CRFs has been expanded to word alignment [29], question answering [30], and document summarization [31].
Recently, the research of the CRF model in computer vision has been very popular, as it can be solved by efficient energy minimization algorithms.The efficiency of inference is a critical issue for CRFs in training and predicting the labels on new inputs.After training a CRF model, the marginal distribution over subsets of labels is computed so as to estimate the parameters of the model.As a result, it can be used to predict the labels of a new input such as a new image using the most likely labels.A lot of inference algorithms have been deployed to solve the CRF optimization problems, such as iterated conditional modes [32], Monte Carlo methods [33], graph cut methods [5][6][7][8], and message passing methods, in which mean field inference [1,34] and belief propagation [35] are the two most popular ways, and people also developed many extensions around the methods.
Local information is well captured by the standard form of a CRF [6,36].Since it is not effective for modeling global information as it often fails to capture global consistency in image recognition, researches on how to capture global information of images in CRF with different forms [5][6][7]37] become a hot area.To capture both local and global information of images makes the learning and inference very tough; we should not only focus on the accuracy of the method, but also consider the efficiency which turns out to be very poor with the increasing number of the input, such as the dimensions of the feature captured, or the number of input images.Therefore, many methods [38][39][40][41] have been proposed to solve such a problem.Recently, a number of cross bilateral Gaussian filter-based methods have been proposed for problems such as object class segmentation [34], denoising [42], and stereo and optical flow [2]; all of these permit substantially faster inference, which maintains or improves accuracy as well.On the basis of [6], Vineet et al. [1] show how higher-order terms can be formulated such that filter-based inference remains possible and demonstrate their techniques on joint stereo and object labeling problems, as well as object class segmentation.In fact, they show that they are able to speed up inference in these model around 10-30 times with respect to competing graph cut methods.
In this paper, we review the progress in the inference of image labeling with CRF models.As mentioned above, a good inference method algorithm is critical in both predicting a new label with a new input and learning the parameters of the model to satisfy the goals of accuracy and efficiency which are two main aspects that we pursue.
Section 2 gives the model of CRFs and their extensions.In Section 3, we mainly introduce two inference methods: graph cut and mean field approximation which are widely used in recent years.And we conclude this paper in Section 4.

The Model of CRFs
A CRF is a discriminative undirected probabilistic graphical model that can represent relationships between different variables [20,43].The structure of a CRF model helps to estimate the unobserved ones given the observed ones.The classical CRF model is described as follows [34].
Denote by  the input variable and  = ( 1 ,  2 , . . .,   ) the joint output variable.The input variable  represents our knowledge about the domain such as color and texture.The output  can be continuous or discrete, but, in most cases, all the labels we set are discrete.
We would like to model the mapping from  to  via the conditional distribution ( | ).As a result, we are only interested in the output structure conditioned on the input.CRFs approach the modeling of ( | ) by representing  as a Markov random field.More precisely, let  = (, ) be an undirected graph, where  is the set of nodes in the graph and each node corresponds to a variable   , and  is the set of edges.Let  = || denote the number of nodes in the graph.Define  as the set of input random variables and  = { V } V∈ as the set of output random variables where  =  ∪  and each  V (V ∈ ) takes a value from a range of possible discrete labels.In a conditional random field, we assume that each random variable  V obeys the Markov property when conditioned on , such that the conditional probability distribution of  V given its adjacent nodes is independent of the rest of the nodes in the graph.That is, if  is such a graphical model that where (V) is the set of adjacent nodes of V, the (, ) is conditional random field (CRF).Let  = { V | ∀V ∈ } represent the neighbor system to indicate the interrelationship between nodes or the order of CRF.The edges are added between one node  V and its neighbors  V .Usually, the neighbor system should satisfy the following: (1) A site does not neighbor with itself:  ∉   .
(2) The neighboring relationship is mutual: The definition of the neighbor system is important because it reflects how far the contextual constraint is.For regular data, as in Figure 3, the neighbors of  are defined as the set of sites within a radius of sqrt() from  where  is the order of the neighbor system.One has , where dis(, ) measures the Euclidean distance between  and .
In object recognition problems, the observations  are often the image data themselves, or extracted visual features, and  correspond to the outputs of vision system, for example, possible class labels of the image to be classified, which is shown in Figure 4.
To make the concept clear, we only consider the case when each variable in  takes a value from a range of possible discrete labels, although they can be either continuous or discrete in a more general case.The paper will describe it in two aspects: probabilistic and energy function.
Under probabilistic understanding, it gets the set of all maximal cliques Λ of , by using  and  to denote the values assigned to variables  and , respectively.The conditional probability distribution of a CRF can be written as where the so-called potential function or compatibility function Ψ  is a nonnegative potential function defined over  which is a maximal clique in . is a normalization factor which is also called partition function depending on the observed values of input variable  and is defined as We also assume that the conditional distribution over graph  is an exponential family [44]; thus we require each potential function Ψ  to have the form where   is a real-valued parameter vector and {  } is a set of feature functions defined on the potential Ψ  .
To simplify the solution to the energy function (see ( 2)), one can take the negative logarithm of the left hand side and right side of (2), and the problem of maximizing the conditional probability becomes an energy minimization problem.In practice, we usually model structures using pairwise constraints, since inference is easier in this case and the model parameters are easy to learn.For example, in computer vision problems, we often see CRFs with maximal cliques of size 2. In this case we can write down the energy as where we call  the unary potential and  the pairwise potential.Occasionally we also use high-order cliques and there are special types of high-order clique potentials that are useful in a few applications.Probabilistic models need to be normalized properly and in many cases require evaluating intractable integrals over the space of all possible variable configurations.While energy functions have no such normalization requirement, thus they provide more flexibility in designing the architecture of the underlying graphical model.
The standard form [1,25] of a CRF is good for modeling local information.We can write down the form of the standard CRF as follows: where  is an input image,  = {  } ∈ represents labeling, and   is a category label at size . is a set of sites in the image,   is a set of neighbors of , and  is a coefficient that modulates the effects of the potentials.
In fact, the unary potential   represents relations between labels and local image features.It predicts label   based on the local features at site .And the pairwise potential   represents relationships between labels of neighboring sites.It means if neighboring sites have similar image features,   favors the same category label for them; if not, they might be assigned different category labels.So the pairwise potential   works for data-dependent smoothing.What is important is that both potentials represent only local information, as a result, the global information was lost, and some intuitive mistakes can happen; for example, a "dog" might appear in the water [43].Using the global information, some classification mistakes in image labeling will be avoided which is shown in Figure 5.
Later on, the multiscale CRF [43] (mCRF) was invented to use regional and global label features that encode particular label patterns at local and global scales.The form of mCRF can be presented below by multiplicatively combining component conditional distributions that capture statistical structure at different spatial scale : Although the mCRF uses regional and global label features, it has massive variables and parameters to be estimated.And it also involves inefficient stochastic sampling for learning and label inference.So the overwhelmingly large dataset size and number of classes are its limitations in practical application.
The boosted random fields [37] model long-range interactions learned by using a boosting algorithm [45].The hierarchical CRF [23] (hCRF) uses a hierarchical structure of CRFs to model long-range interaction (e.g., relative configurations of objects or regions) and short-range interactions (e.g., pixel-wise label smoothing) in a tractable manner.Its two-layer formulation to exploit different levels of contextual information in images for robust classification is general enough to be applied to different domains ranging from pixelwise image labeling to contextual object detection.Both of these two methods do not incorporate global information of the image and thus make the labeling highly dependent on local information.The random field model proposed by Toyoda and Hasegawa [46] explicitly models local information and global information in conditional random field.The method extracts global image features as well as local ones and uses them to predict the scene of the input image.The form is where   and   are global unary potential and global pairwise potential, respectively, , , and  are coefficients that modulate the effects of the potentials, and  is the partition function for normalization.The global unary potential   represents relationships between labels and global image features.It predicts the spatial configuration of labels according to the scene of the input image.The global pairwise potential   represents the compatibility of all pairs of labels.This method not only incorporates the local information and global information, but also enables rapid processing by using the global image features.However, it will not do the classification well if there are too many classes (there are only 7 classes in their experiments) because the relationship between classes becomes substantially complex.
Some researchers [47][48][49] move their research point to the higher-order cliques.In fact, most energy minimization methods for solving computer vision problems assume that the energy can be represented in terms of unary pairwise clique potentials.As a result, this assumption severely restricts the representational power of these models making them unable to capture the rich statistics of natural scenes [50], while higher-order clique potentials have the capability to model complex interactions of random variables and thus could overcome this problem.The initial work with highorder potentials [36,[50][51][52] has been quite promising but their use has been limited due to the unavailability of efficient algorithms for minimizing the resulting energy functions.Kohli et al. [49] extend the class of energy functions for which the optimal -expansion and -swap moves can be computed in polynomial time.In the paper, they propose the   Potts model for which the optimal move can be found by solving a st-mincut problem.They define the   Potts model potential for cliques of size  as where  max >   , ∀  ∈ .For a pairwise clique this reduces to the  2 Potts model potential defined as   (, ) =   if  =  =   and  max otherwise.The Gibbs energy of the CRF with high-order cliques is as follows in this paper: where  is a clique which represents the path   = {  ,  ∈ } of the frame  and  is the set of all cliques.The example in the paper demonstrates the importance of enforcing label consistency over homogeneous regions for object class segmentation.However, the inference speed is inefficient comparing to mean field inference method.
The   Potts model potential is a particular case of the pattern-based potentials [48] which is defined as where   ⊂  || is a set of recognized patterns (i.e., label configurations for clique) each associated with an individual cost    , while a common cost  max is applied to all other patterns.If we set   to be the  configurations with constant labels, then we will get the   Potts model as described.
Cooccurrence relations capture global information about which classes tend to appear together in an image and which do not.And to model object class cooccurrence statistics a new term () is added to the energy: Torralba et al. [53] proposed the use of additional unary potentials to capture scene based occurrence priors.Their costs took the form: However, the complexity of inference over such potentials scales linearly with the size of the graph; they are prone to overcounting costs and it also requires an initial hard decision of scene type before inference.Rabinovich et al. [54,55] proposed cooccurrence as a soft constraint that took the form: where  is some potential which penalizes labels that should not occur together in an image.It can capture the global information, however, because it is on the basis of a fully connected graph; the memory requirements of inference scale badly with the size of a fully connected graph.It grows with complexity (|| 2 ) rather than (||) with the size of the graph.
To improve these methods, Ladicky et al. [40] proposed a new form of (): where () = { ∈  : ∃  = } which guarantees invariance to the size of an object and (()) can be seen as a particular higher-order potential defined over a clique which includes the whole of , that is,   ().And the restriction is placed on (()) that it should be nondecreasing with respect to the inclusion relation; that is,  1 ,  2 ∈ , and  1 ∈  2 imply that ( 1 ) ≤ ( 2 ).By incorporating these potentials, they got a quantitatively better and visually more coherent labelings.But it carries a comparable higher computer cost comparing to mean field inference.Similar to Ladicky et al. 's form of (), Vineet et al. [47] proposed the form of (Λ()): where where [] is 1 for a true condition and 0 otherwise.They used filter-based mean field inference to solve the energy with higher-order terms and showed that they are able to spend up inference in relative models about 10-30 times with respect to competing graph cut methods [43].Joint optimization for object class segmentation is another important area of research in image labeling, such as combining objects and attributes for image segmentation [56], or joint optimization for object class segmentation and dense stereo reconstruction [4].In [57], Farhadi et al. proposed a method to shift the goal of recognition from naming to description; for example, we not only recognize a basketball as a basketball, but also describe its attributes such as round.Therefore, the method allows them not only to name a familiar object, but also to report unusual aspects of a familiar object and to learn how to recognize new objects with few or no visual examples.The attributes in the paper consist of two aspects: semantic and discriminative.Since the concepts of objects and attributes are both important for describing images precisely, in [57], they formulated the problem of joint visual attribute and object class image segmentation as a dense multilabeling problem, where each pixel in an image should be associated with both an object class and a set of visual attributes labels.In the paper, they proposed a factorial multilabel CRF model which combines the multiclass CRF model and the multilabel model.
The multiclass CRF for objects can be defined in terms of an energy function: where    and    are unary potential and pairwise potential functions, respectively, and  = {(, ) | ,  ∈ ,  ̸ = }.The multilabel CRF for attributes is defined as where  = { 1 ,  2 , . . .,   } are a set of random variables and  = { 1 ,  2 , . . .,   } are a set of random attribute labels.
Rather than taking values directly in  though, the   's take values in the power-set operator.They also defined a joint CRF in terms of a pairwise energy over the   (  = (  ,   )): where Using a two-level hierarchical model, where labeling object classes and attributes is done not only at the pixel level but also at a regional level, they gave the following energy: It was recognized that the problems of dense stereo reconstruction and object class segmentation can both be transformed as one CRF model based labeling problem, in which every pixel in the image is assigned a label corresponding to either its disparity, or an object class.This inspires [4,46] to provide an energy minimization framework that unifies the two problems.In their paper, the energy function of object class segmentation using a CRF took the form And the problem of dense stereo reconstruction using a CRF can be written as Thus the energy of the CRF for joint estimation can be written as Using the fact that certain objects occupy a certain range of real world heights, they jointed unary potentials successfully by where ℎ(  , ) is the corresponding height above the ground plane and (ℎ | ) is a histogram based measure of the naïve probability that a pixel taking label  has height ℎ in the training set.So the combined unary potential can be written as where    ,    , and    are the corresponding weights.For pairwise interactions, we know that an object classes boundary is more likely to occur if the disparity of two neighboring pixels differs significantly.Taking it into account, they chose tractable pairwise potentials of the form Although the two models described as above need more parameters to learn which makes the processes of learning and inference more complicated, they achieved a better scene understanding comparing to other models before.

Inference Methods
Over the years, a large number of inference algorithms have been developed; although exact inference in such CRFs is intractable, much attention has been paid to developing fast approximation algorithms, including graph cut approaches [6], variants of belief propagation [11,35,50], and a number of Gaussian filter-based methods [1,39].In this section, we briefly introduce two inference methods for approximating energy minimums; one is the classical method, graph cut, and the other is mean field approximation which has been popular in recent years.[59] first applied the graph cut in computer vision which describes a large family of MRF inference algorithms based on solving min-cui/maxflow problem.If a type of computer vision problems can be formulated in terms of an energy function, then we can use graph cut to get the minimum energy configuration that corresponds to the MAP theory.Figure 6 is an example of min-cut graph cut.

Graph Cut. Greig and Porteous
In this method, we set a directed weighted graph  = (, ) which consists of a set of nodes  and a set of directed edges  and the edge weight is nonnegative.The nodes correspond to pixels in image labeling problem.There are two additional nodes which are called terminals, that is, the source  and the sink .In computer vision, terminals correspond to the set of labels that can be assigned to pixels.All

The Mean Field Approximation.
Recently, a number of mean field approximations in computer vision have been proposed, such as object class segmentation [8,9,11,34].The mean field algorithm finds the distribution , which is closest to  which is the exact distribution by minimizing the KL-divergence ( | ) within the class of distributions representable as a product of independent marginal, () = ∏    (  ) [63].Although the approximation of  as a fully factored distribution is likely to lose a lot of information in the distribution, this approximation is computationally attractive.The mean field approximation can be formulated as follows: where  is the energy functional.See [63] for more details.
The approach of [34] of provides a filter-based method for performing fast approximate maximum posterior marginal (MPM) inference; for example, the solution satisfies  MPM  ∈ arg max  ∑ {|  =} ( | ), in multilabel CRF models with fully connected pairwise terms, where the pairwise terms have the form of a weighted mixture of Gaussian kernels.We can express the fully connected pairwise CRF as where ( | ) is the energy associated with a configuration  conditioned on  and   and   are unary and pairwise potential functions, respectively.And, in [34], the pairwise potentials take the form of a weighted mixture of Gaussian kernels: where  is a label compatibility function,  () (⋅, ⋅),  = 1 ⋅ ⋅ ⋅  are Gaussian kernels, and  () (⋅, ⋅),  = 1 ⋅ ⋅ ⋅  are the corresponding weight of the kernels.We briefly deduce the whole process of the iterative update equation: where where The marginal   (  ) which we need is found by minimizing a Lagrangian that consists of all terms in ( | ) plus Lagrange multipliers assuring that the marginal   (  ) are probability distributions.The detailed derivations will be presented below: Figure 8: Results of [1] on Leuven dataset.From (a-e): input image, ground truth, object labeling for [4] (using graph cut + range-moves for inference), object labeling, and stereo outputs from dense CRF with higher-order terms and extended cost-volume filtering [1].
where   is a Gaussian kernel corresponding to the th component of ( 30), and ⊗ is the convolution operator.The following algorithms are the algorithms used in [34].
Algorithm 1 (mean field in fully connected CRFs).
while not converged do end while In [34], the permutohedral lattice [39] was used for the filter-based inference; the recently proposed domain transform filtering approach [58] has certain advantages over the permutohedral lattice.Since domain transform filtering approach does not subsample the original signal, its complexity is independent of the filter size, while the complexity and filter size are inversely related using the permutohedral lattice.In [47], it was demonstrated that the domain transform approach achieves even faster inference times than using the permutohedral lattice for accurate object/stereo labeling.On the basis of [34,47] the mean field approximation to the inference of models with higher-order terms was further applied.
In [47] the pattern-based potentials  pat  (  ) were added, which is described in Section 2, to the energy function; the required expectation for the mean field updates (39) can be calculated: where  |= is the subset of patterns in   for which   = .
A particular case of the pattern-based potential is the   -Potts model, and the required expectations can be expressed as The paper [1] also added coconcurrence potentials (see [47] for more details) which is over the entire image clique with a defined form and tested their approach on object class segmentation.As a result, they showed substantial improvements in inference speed with respect to graph cut based methods, particularly by using recent domain transform filtering techniques, while also observing similar or better accuracies.Figures 8 and 9 are the results of [1] in both stereo and image labeling.All the experiments in [1] are based on an Inter5 Xeon5 3.33 GHz processor, and they fixed the number of full mean field update iterations to 5 for all models.
In Figure 8, [1] applied their model to the Leuven dataset, consisting of stereo images of street scenes, with ground truth labeling for 7 object classes, and manually annotated ground truth stereo labeling quantized into 100 disparity labels.In their model they used JointBoost classifier responses to form the object unary potentials.A truncated  2 -norm of the intensity differences is used to form the disparity potentials.For the densely connected pairwise terms, identical kernels and weightings and Ising model for the label compatibility function were used.For the   -Potts potentials,   = 0 for all  = 1, . . .,  was set and  max was set by cross-validation.Figure 9 is the results of [1] on PascalVOC-10 dataset.
Table 1: Quantitative comparison on Leuven dataset of [1].The table compares the average time per image and performance (object and stereo labeling accuracy) of joint object and stereo algorithms, using graph cut + range-move (GC + Range ()), an extension of cost-volume filtering, and [1]'s dense CRF with higher-order terms and filter-based inference (with and without cost-volume filtered unary, and using different approaches).HO means higher-order terms of [1]   From Figure 8 and Table 1, we note that the densely connected CRF with higher-order terms (Dense + HO) achieves comparable accuracies to [4], and that the use of domain transform filtering methods [58] permits an extra speed-up, with inference being almost 12 times faster than the least accurate setting of [4] and over 35 times faster than the most accurate.The Dense + HO + CostVol approach achieves the best overall stereo accuracies.Although the improved stereo performance appears to generate a small decrease in the object labeling accuracy in [1]'s full model, the former remains at an almost saturated level.
Figure 9 and Table 2 compare timing and performance of [1]'s approach (final 2 lines) against two baseline.The importance of higher-order information is confirmed by the better performance of all algorithms compared to the basic dense CRF of [34].Further, the filter-based inference is able to improve substantially on the inference time and classaverage performance of the AHCRF [40], with   -Potts and cooccurrence potentials each giving notable gains.
Although the mean field algorithm is an easy approximation method, it still has several limitations.As mentioned in [9], the first limitation is related to the fact that the mean field approximation assumes complete factorization over the individual variable.As a result, the mean field inference methods are usually sensitive to initialization although the simplified model leads to efficient and tractable models for learning and inference.Another limitation relates to the form of the pairwise weights in (30) which are a linear combination of Gaussian kernels.In fact, they allow each Gaussian component to take only zero mean and use the same combination of Gaussian kernels for each label pair.Although these are improved in [9], they are still lead to unsatisfactory results.Therefore, in the future, we hope to find some other methods which have not only substantial speed of inference but also considerable accuracies.

Conclusion
Recently, CRF is accepted as one of the popular approaches for solving the image labeling problem in computer vision and image analysis.An important issue in CRF models is to develop an efficient inference algorithm to find the most appropriate labels especially when considering the global information of an image.
In this paper we review the research development and status of object recognition with CRFs, especially the two main discrete optimization methods for image labeling with CRFs: graph cut and mean field approximation.We describe graph cut briefly while we introduce mean field approximation more detailedly which has a substantial speed of inference and is popular in recent years.Compared to the graph cut method, the mean field inference improves speed substantially for its simplified model.
In the application of image labeling problem in computer vision, one typical problem is that there are too many nodes.For example, for an image with the size of  × , supposing each node takes  possible labels, the computation space is  × .Thus the computation space expands exponentially with the growth of image's size.It is very clear that the inference algorithm plays a very important role in these problems.Another key issue is to construct reasonable CRF models as Section 2 introduces.Learning the parameters of a CRF model efficiently from images instead of being manually or empirically chosen is also an important issue, though it is not the focus of this paper.
Nowadays, many tasks in computer vision and image analysis can be formulated as a labeling problem where the correct label has to be assigned to each pixel or clique.However, computational expense of training is still a computational burden for the need to perform inference repeatedly during training process.In the future, we hope to improve the accuracy of mean field inference for image labeling while maintaining its efficiency.Solving these problems will greatly influence some technology such as driverless car.On the other hand, with the development of the skills for capturing image depth information such as Kinect, depth information of an image is easily obtained like color features.So it is considerable to combine these properties with CRF models and efficient inference approaches for image labeling and stereo reconstruction in 3-dimensional space.Moreover, using these theories for facial action labeling research may be another strategy.

Figure 1 :
Figure 1: An example of image labeling.An image in (a) is a set of pixels  with observed intensities   for each  ∈ .A labeling  shown in (b) assigns some label   ∈ {0, 1, 2} to each pixel  ∈ .Such labels can represent depth (in stereo), object index (in segmentation), original intensity (in image restoration), or other pixel properties.Thick lines in (b) show labeling discontinuities between neighboring pixels [5].
in 2001.In their work they present iterative parameter estimation algorithms for Conditional Random Fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.The CRF model is brought to image labeling by Shotton et al., Peng and McCallum, and Kristjansson et al. [20-22].

Figure 3 :
Figure 3: An example of 5th-order neighbor system.

Figure 5 :
Figure 5: (a, b) Two small image patches that are difficult to label based on local information.(c, d) Images containing the patches.We will usually make mistakes in such classification problems if we use only the local information because the color and texture features in these two patches are too similar.However, as described in (c, d), the global context makes it clear what the patches are ((a, c) water; (b, d) sky) [43].

Figure 6 :
Figure 6: An example of min-cut graph cut.The circles represent the pixels, and the lines including curves represent the edges between nodes including -links and -links.The dotted line indicates a cut of graph partition [3].
Figure 4: The model of CRF in image labeling.  represents the label in th pixel, and   is the features of the corresponding pixel such as color and texture.The red lines in this figure only connect neighboring pixels which means each random variable   obeys the Markov property.