A Stochastic-Variational Model for Soft Mumford-Shah Segmentation

In contemporary image and vision analysis, stochastic approaches demonstrate great flexibility in representing and modeling complex phenomena, while variational-PDE methods gain enormous computational advantages over Monte Carlo or other stochastic algorithms. In combination, the two can lead to much more powerful novel models and efficient algorithms. In the current work, we propose a stochastic-variational model for soft (or fuzzy) Mumford-Shah segmentation of mixture image patterns. Unlike the classical hard Mumford-Shah segmentation, the new model allows each pixel to belong to each image pattern with some probability. Soft segmentation could lead to hard segmentation, and hence is more general. The modeling procedure, mathematical analysis on the existence of optimal solutions, and computational implementation of the new model are explored in detail, and numerical examples of both synthetic and natural images are presented.


INTRODUCTION: SOFT VERSUS HARD SEGMENTATION
Segmentation is the key step towards high-level vision modeling and analysis, including object characterization, detection, and classification. There have been some recent developments indicating that certain high-level visual tasks such as global scene interpretation might be able to bypass segmentation [1,2]. Nevertheless, segmentation still remains perhaps the most important and inspiring task to date in lowor middle-level vision analysis and image processing. The segmentation problem can be formulated as follows. Given an image I ∈ L 2 (Ω) on a 2-dimensional (2D) domain Ω (assumed to be bounded, smooth, and open), one seeks out a closed "edge set" Γ, and all the connected components Ω 1 , . . . , Ω K of Ω \ Γ, such that by certain suitable visual measure (e.g., textural or photometric), the image I is discontinuous along Γ while smooth or homogeneous on each segment Ω i . Each image patch I i = I| Ωi is also called a pattern, and Ω i is its support.
We will call this most common practice "hard" segmentation. A hard segmentation partitions the image domain Ω along a definitive edge set Γ, and outputs nonoverlapping pattern supports Ω 1 , . . . , Ω K .
The present work introduces the notion of "soft" segmentation. Mathematically, a hard segmentation amounts to the partition of the unit using indicator functions: 1 Ωi (x), a.e. (in Lebesgue) x = x 1 , x 2 ∈ Ω. (1) A soft segmentation seeks out instead a softer partition of the unit: where p i 's are continuous or smoother functions. Formally, each p i could be considered as the mollified version of 1 Ωi (x).
In the stochastic literature of image analysis and modeling, the above notion of soft segmentation is closely connected to mixture image models (e.g., [3]). Suppose a given image I is composed of K unknown patterns: where ω denotes the pattern label variable. At each pixel x ∈ Ω, ω(x) ∈ {1, . . . , K} becomes a random variable. Then the 2 International Journal of Biomedical Imaging Figure 1: Natural images often do not have clear-cut "hard" boundaries between different patterns. Along the arrow, for example, one only observes that the sand pattern gradually becomes a grass pattern. Such a "soft" view is the stochastic view on the segmentation problem.
p i 's in (2) carry the natural stochastic interpretation: For this reason, each p i will be called the ownership of pattern i, following Jepson and Black [3]. (Some authors also prefer to call it the membership [4].) Instead of the repulsive ownership in a hard segmentation, a soft one allows each pattern to "own" a pixel with some likelihood. Soft segmentation is more general since it can lead to natural hard segmentation under the maximum-likelihood (ML) principle. Given a soft segmentation {p i (x) : i = 1 : K}, one can define for each pixel x ∈ Ω its unique owner ω * (x) by ω * (x) = arg max ω∈1:K p ω (x), (5) and if the maxima are nonunique, accept the largest index from the arg max pool. The segments are then defined by which leads to a natural hard segmentation. Formula (5) and (6) are called the hardening formulae. Soft segmentation has been motivated by practical analysis of natural images. Patterns in natural scenes often do not have clear-cut boundaries. In Figure 1, for example, there does not seem to exist a "hard" boundary between the grass and sand areas. If one draws an oriented line as shown in the figure, it makes more sense to state that along the arrow, the pattern transits from being "more" sand-like to being "more" grass-like. Such consideration favors the following stochastic view that along the arrow, the ownership Prob ω(x) = grass increases, while Prob ω(x) = sand decreases.
In the present work, we propose a new stochastic-variational soft segmentation model for the following celebrated Mumford-Shah model [5,6]: where H 1 stands for the 1D Hausdorff measure [7], which is simply the length when Γ is regular enough. For notational conciseness, the default area-element symbol dx = dx 1 dx 2 will be omitted in most integral formulae.
As stated in the abstract, the stochastic softness induces more flexibility and universality in modeling, while the variational-PDE approach facilitates rigorous mathematical analysis as well as more efficient computational implementations compared with purely stochastic approaches including, for example, the Monte Carlo method or Gibbs' sampling [8][9][10][11].
The paper has been organized as follows. Section 2 builds up the soft Mumford-Shah (SMS) model under the Bayesian rationale and the MAP estimator [12,13], which are the formal stochastic foundations of the present model. In Section 3, the prior energy on the ownerships p i 's is developed based on the celebrated work of Modica and Mortola [14] on phase-field modeling and Γ-convergence approximation in material sciences and phase transitions. In Section 4, we analyze the main mathematical properties of the proposed SMS model, including the admissible space, hidden symmetry and symmetry breaking via weak supervision, and the existence theorems. In Section 5, we then derive the system of Euler-Lagrange equations of the SMS model for which the role of the probability simplex constraint is discussed in detail. Section 5 also introduces the alternating-minimization algorithm to compute the Euler-Lagrange equations. Finally, the numerical performance of the SMS model is demonstrated in Section 6 via both synthetic and natural test images that are sufficiently representative and generic.
Throughout the manuscript, the notation F[X, Y | Z] in the deterministic setting always denotes a quantity (often a functional, or an energy) F that depends on X, Y , and Z but with Z given or fixed. Similarly, that is often unimportant as far as the optimization on X (given Y and Z) is concerned. These notations therefore have been inspired by conditional probabilities in the stochastic setting (formally under the Gibbs' correspondence:

Bayesian rationale
Segmentation can be done in some feature spaces such as gradient-like highpass features or Gabor features (e.g., [11,15,16]). The Mumford-Shah model easily extends to such general features (e.g., [15]), even though it was originally Jianhong (Jackie) Shen 3 formulated only for intensity fields. For maximal clarity in exposing the core ideas of the current work, we will also focus only on the latter, while leaving as canonical exercises to adapt the new model for any given feature distribution.
Let K be the total number of intended patterns. As in [10,11], K could also be treated as an unknown to be optimally estimated, which however does not add much to the most significant contribution (i.e., the modeling and computation of the "softening" procedure) of the present work.
Given an image input I = I(x) on a bounded, regular, and open domain Ω, the primary goal of soft segmentation is to compute the ownerships Define P(x) = (p 1 (x), p 2 (x), . . . , p K (x)), and where the ( e i | i = 1 : K) denotes the canonical Cartesian basis of R K . Δ K−1 is often called the canonical (K − 1)-simplex, or the probability simplex in R K . Then meaning that the total ownerships always add up to 100% at any pixel x ∈ Ω. Associated with each pattern label, ω = i is a smooth function u i (x) ∈ H 1 (Ω), similar to the original Mumford-Shah model. Here the Sobolev space H 1 (Ω) is defined by [17] Define U(x) = (u 1 (x), u 2 (x), . . . , u K (x)). Then the goal of soft segmentation is to estimate the optimal vectorial pair of ownerships and patterns given an image I: Prob P, U | I .
By the Bayesian formula [12,13], the posterior given I is expressible via Prob P, U | I = Prob I | P, U Prob(P) Prob(U) Prob(I) , (14) assuming that the mixture patterns U and the mixture rules P are independent (as two vectorial random fields). We will call the first term a "mixture generation" model, since it reveals how the image data should look like given the information of the patterns and their ownerships. By taking logarithmic likelihood E[·] = − log Prob(·), or the formally Gibbs' energy in statistical mechanics [18,19], one attains the soft segmentation model in its "energy" form: Assuming that all the pattern channels are independent of each other, one has That is, we assume that U as a random vector field has independent scalar components. It has been motivated by the facts that 2D images are the optical projections of 3D scenes and that different objects in 3D are independently positioned in different ranges or depths. For Sobolev-regular patterns, that is, functions whose gradients are square integrable, one may impose the homogeneous Sobolev energies: for some scalar weight α that models the visual sensitivity to intensity roughness. Unlike the original Mumford-Shah model, the energy for each channel has been defined on the entire image domain Ω instead of on each "hard-cut" patch Ω i . Thus the energy form (17) must carry out extrapolation for practical applications. Long-range extrapolations are, however, often unimportant after being weighed down by their negligible ownerships p i 's.

Gaussian mixture with smooth mean fields
In this section we discuss the mixture generation model Assume that the patterns are all Gaussian with mean fields u 1 , u 2 , . . . , u K . For simplicity, also assume that they share the same variance σ 2 (which readily generalizes to the more general case with variations). Then at any given pixel x ∈ Ω, Define the Gaussian probability density function (pdf) The pdf of the mixture image I at any pixel x is given by Thus ideally the "energy" for the mixture generation model should be given by for some μ > 0, (21) provided that given two fields P and U on Ω, for any two disjoint and finite sets of pixels X and Y , Here (We also must emphasize that the above derivation should be considered as motivational rather than rigorous, due to the continuum setting.) In the current work, we will adopt a reduced form of the complex formula (21), which is simpler and easier to manage both in theory and for computation. Assume that each soft where the additive constant only depends on σ and K. This suggests the following convenient energy form for the mixture generation model: which amounts to a weighted least-square energy [20]. The weight λ reflects visual sensitivity to synthesis errors. In combination of (15), (17), and (24), the new soft segmentation model takes the form of minimizing Notice that here the ownership distribution P "softens" the "hard" segmentation boundary Γ in the original Mumford-Shah model (8). To complete the modeling process, it suffices to properly define the prior or regularity energy E[P], which is the main task of the next section.

MODICA-MORTOLA'S PHASE-FIELD MODEL FOR OWNERSHIP ENERGY
To generalize but not to deviate too far from classical hard segmentation, it is natural to impose the following two constraints: (a) each pattern ownership p i (x) has almost only two phases: on (corresponding to p i = 1) and off (to p i = 0), and the transition band in between is narrow; (b) the soft boundaries, or equivalently the transition bands, are regular, instead of being zigzag.
In combination, one imposes the following Modica-Mortola type of energy with a double-well potential [14]: Here ε 1 controls the transition bandwidth. Since ε 1, the second term necessarily demands p i 0 or 1 to lower the energy, which well resonates with the expectation in (a). The first term, weighted by the small parameter ε, amounts to a regularity condition on each p i , which meets the requirement in (b).
Energies in the form of (26) are very common in material sciences, including the theories of liquid crystals and phase transitions [21,22]. Mathematically, they have been well studied in the framework of Γ-convergence [23], which we now give a brief introduction in the present context. We also refer the reader to the works of Ambrosio and Tortorelli [24,25] on the Γ-convergence approximation to the classical Mumford-Shah segmentation model.
Recall that for any q(x) ∈ L 1 (Ω), its total variation as a Radon measure is defined by [7,26,27] where B 2 stands for the unit disk centered at the origin in R 2 .
(The TV measure was first introduced into image processing by Rudin et al. [28].) Define that for any q ∈ L 1 (Ω), As a result, a finite energy E 0 [q] necessarily implies that q has two phases only, and to be a subspace of L 1 (Ω) (as a metric space). Then Modica and Mortola's well-known results in [14] readily lead to the following theorem.
That is, Jianhong (Jackie) Shen We refer the reader to Modica and Mortola [14] for a proof (with some necessary modification). Here we only point out that the "tight" sequence (q * ε | ε) in (ii) can be constructed using a smooth sigmoid transition across the hard boundary of a given two-phase function q. Recall as in the theory of neural networks [29] that a sigmoid transition between 0 and 1 is achieved by The scaling parameter ε participates in the transition by the form of σ(t/(3ε)). In particular, ε indeed corresponds to the width of the transition band when t is a distance function. This theorem reveals the close connection of the particular choice of E ε [p i ] in (26) with the original Mumford-Shah model.

Proposition 1.
Suppose that p ε 's "optimally" (i.e., by the above sigmoidal transition) converge to a given 2-phase pattern Similar results have appeared in the earlier influential works of Ambrosio and Tortorelli [24,25] on the Γconvergence approximation to the Mumford-Shah model. The technique has also been extensively applied in image computation and modeling [30][31][32][33][34][35] to overcome the difficulty in representing and computing the free boundary Γ.
To summarize this section, we propose the following energy model for the ownership distribution P(x) = (p 1 (x), p 2 (x), . . . , p K (x)): One, however, must realize that different ownerships are not decoupled by this energy though it has appeared so. The energy E ε [P] must be coupled with the constraint of the probability simplex: In particular, for small ε, although (35) implies that each ownership p i tends to polarize to 0 or 1 independently, they have to cooperate with each other under the above simplex constraint to optimally share the ownerships.

The model and admission space
Combining the preceding two sections, we have developed the complete formula for soft Mumford-Shah segmentation with K patterns, that is, to minimize with the constraint that that is, p i ≥ 0, i = 1 : K, and K i=1 p i = 1. As discussed previously, it is this simplex constraint that induces coupling among different channels into the seemingly decoupled model (37).
Besides the simplex constraint, the last term in the energy (37) requires p i ∈ H 1 (Ω) for i = 1 : K. Similarly, the second term requires each pattern u i ∈ H 1 (Ω). Then with the assumption that "the given image I ∈ L 2 (Ω)," E[P, U | I] is well defined and finite for any admissible patterns U and pattern ownership distribution P:

Breaking the hidden symmetry via weak supervision
Let S K denote the permutation group of {1, . . . , K}. Each permutation σ ∈ S K is a 1-to-1 map: Theorem 2 (hidden symmetry of SMS). For any σ ∈ S K , In particular, suppose that P * , U * = arg min (P,U)∈admK is an optimal pair. Then for any σ ∈ S K , (P * σ , U * σ ) is a minimizer as well.
The proof is straightforward and thus omitted. Such symmetry not only worsens the nonuniqueness of the minima to the nonconvex energy functional in (37), but also potentially jitters intermediate solutions in iterative computational schemes (i.e., hysterical transitions in the admissible space).
To break the permutation symmetry, we turn to a weak supervision scheme in which a user specifies K distinct domain patches: and imposes the symmetry-breaking conditions: where δ i j denotes Kronecker's delta. That is, a user requires each given patch Q i to be a "pure" pattern exclusively labelled by i. Computationally, this weak supervision process can be automated based on multiscale patch statistics as in the contemporary works on scene recognition [1,2], or more generally, the learning theory [36,37].

Existence theorems for nonsupervision and supervision
In this section, we briefly state the existence theorems for the soft Mumford-Shah segmentation model (37) without or with the supervision (46). The detailed proof has been moved to the appendix, under the suggestion of one of our referees. Skipping this section will cause no serious problem in comprehending or implementing the models. Mathematically, the existence issue has special appeal to model developers, especially for models that are highly nonconvex. Nonconvex variational models normally fail to guarantee the uniqueness of optimal solutions, and existence is hence often the best one can attempt to establish theoretically [38]. Notice that most interesting variational models in contemporary image and visual analysis are nonconvex, which include, for example, the original Mumford-Shah model [6], various image restoration models (e.g., deblurring and dejittering) [7,39], as well as most optical-flow models [38].
The theoretical proof in the appendix, however, does reveal an important behavior of the model (37) which carries practical implications. If certain channel i becomes dumb in the limiting process of the proof (i.e., the limit p * i ≡ 0 for a minimizing sequence), it has often been introduced unnecessarily in the first place, and the associated optimal pattern u * i could be any featureless constant image. The related issue of determining an optimal class number K (i.e., without containing dumb channels nor missing visually important channels) is also intrinsically driven by the complexity theory of natural images, in particular, the multiscale complexity [40]. Theoretically, K could be any integer, ranging from zero to the infinity, as one zooms into the details of a continuum image from the atomic scale to the ordinary observational scales of the naked eyes. Thus ideally, K itself could be introduced as a random variable taking 0, 1, 2, . . ., and becomes part of the model itself. This idea has already been explored in purely stochastic settings, for example, see Tu and Zhu [10].
For the supervised scenario motivated earlier, the following existence theorem still holds.

Theorem 4.
Suppose that I ∈ L 2 (Ω). Then an optimal pattern-ownership pair must exist to the soft Mumford-Shah segmentation model (37) with supervision (46), assuming that each patch Q i has a positive Lebesgue measure |Q i | > 0.
The proof is almost identical to the unsupervised case in the appendix, and simplifies substantially by noticing that no channel could become dumb due to supervision. Furthermore, the functions ρ i 's in the previous proof can be directly set to be without the necessity of turning to Lemma 2.

Mixture of homogeneous Gaussians
When each pattern i is a homogeneous Gaussian N(m i , σ) with a distinct mean value m i , one has Theorem 5. Suppose that I ∈ L 2 (Ω). Then a minimizer pair (P * , m * ) to E[P, m | I] exists for both the unsupervised and supervised cases.
The proof can be derived readily from the previous general cases and is hence left out. When K = 2, a similar model was proposed earlier by Shen [33] under the symmetrization transform: The model (49)

Euler-Lagrange equations on (K − 1)-simplex
To minimize the energy for the soft Mumford-Shah segmentation one resorts to its gradient-descent flow or Euler-Lagrange equations. In this section, we discuss these equations and their practical computational schemes. The first-order partial variation on U given P leads to, for i = 1 : K, where n stands for the outer normal vector field along ∂Ω. Thus the Euler-Lagrange equations on the patterns are all in the form of linear Poisson equations with variable coefficient fields: with Neumann adiabatic boundary conditions, where the source terms are f i (x) = λp i (x)I(x). The first-order variation on the ownerships P is carried out on the probability (K −1)-simplex Δ K−1 , which is a compact manifold (with border) of codimension 1 embedded in R K . Chan and Shen [42] developed a general framework for modeling and computing image features that "live" on general manifolds, and especially those that are embedded in R K . We will follow the approach there.
Without the simplex constraint on the ownerships, for any given U, the first-order variation of the soft energy E under P → P + δP is given by where H 1 is the 1D Hausdorff measure along ∂Ω, and Define V = (V 1 , . . . , V K ) and v = (v 1 , . . . , v K ). Then which holds for any free variation of P in R K , or one writes in the free-gradient form In reality, P ∈ Δ K−1 . Let T P Δ K−1 denote the tangent space of Δ K−1 at any single point P ∈ Δ K , and the orthogonal projection onto the tangent space in R K .
Since the normal direction of the tangent plane is given by K, the projection operator is explicitly given by, for any w ∈ T P R K , The constrained gradient of E on Δ K−1 is therefore given by In particular, the system of Euler-Lagrange equations on P given U is given by for i = 1 : K. The coupling among different channels is evident from these two formulae.
Proof. This is obtained by direct computation: at any z ∈ ∂Ω, As a result, the boundary conditions in (62) simplify to the ordinary Neumann conditions ∂p i /∂n = 0, i = 1 : K. Combining all the above derivations, we have established the following theorem.  (46), the ownerships must satisfy the interpolation conditions:

Theorem 6 (Euler-Lagrange equations). The system of Euler-Lagrange equations of E[P, U | I] is given by
or equivalently, the equations on p i 's in (64) hold on Ω \ ( K i=1 Q i ) with Neumann conditions along ∂Ω, and Dirichlet Similarly, one has the following result for the piecewise constant SMS model (49), which carries much lower complexity compared with the full SMS model.
with Neumann conditions for all the ownerships p i 's along ∂Ω.

Computation of the Euler-Lagrange equations
Computationally, as well practiced in multivariate optimization problems, (64) and (66) can be solved via the algorithm of alternating minimization (AM) [30,39]. The AM algorithm is closely connected to the celebrated expectationmaximization (EM) algorithm in statistical estimation problems with hidden variables [3,43]. In the current context, the ownership distributions p i 's could be treated as the hidden variables. Like EM, the AM algorithm is progressive. Given the current (t = n) best estimation of the patterns U n = (u n i | i = 1 : K), by solving or equivalently, with Neumann boundary conditions, one obtains the current best estimation of the ownerships P n = (p n i | i = 1 : K). Subsequently, based on P n , by solving or equivalently, with Neumann boundary conditions, one completes a single round of pattern updating U n → U n+1 . The same procedure applies to the piecewise constant SMS equations in (66). Since the system (70) is linear and decoupled, the main computational complexity resides in the integration of (68), which is coupled and nonlinear due to the simplex constraint and the double-well potential in the energy. Define e i (x) = (u i (x) − I(x)) 2 and e = (e i | i = 1 : K). In order to solve given e and V = V(P, U) = V(P, e) (see (55)), first notice that since K i=1 p i = 1 and Δ( K i=1 p i ) = 0. We also split the double-potential force in (71) by In combination, the nonlinear equation (71) can then be solved iteratively: by the following linearization procedure: with Neumann adiabatic boundary conditions for all the channels i = 1 : K. This system of linear Poisson equations can be conveniently integrated using any elliptic solvers. The detailed numerical analysis on the convergence behavior of the entire algorithm above, however, is still an open problem and well deserves some systematic investigation.  Figure 3: Synthetic image of a T-junction: hard segmentation from the SMS model via "hardening" formulae (5) and (6). The 120degree regularization behavior at the junction point is also well known in the classical Mumford-Shah model [6].

COMPUTATIONAL EXAMPLES
In this section, we present the computational results of the proposed soft Mumford-Shah model. Notice that the extension of the above SMS models to color images is straightforward by having the gray values u i 's replaced by RGB vectors. (We, however, must remind the reader that perceptually RGB may not be the most ideal representation of colors compared with other nonlinear approaches, e.g., brightnesschromaticity [42] and HSV [44].) Figures 3 and 4 illustrate the performance of the SMS model on two synthetic images with multiple phases. Figure 3 shows a typical T-junction and Figure 4 shows a 3phase image with a narrow bottleneck. Plotted in the figures are the hard segments obtained from the SMS model via the hardening formulae (5) and (6).
Plotted in Figures 5 and 6 are the hardened segments of two MRI brain images computed by the soft Mumford-Shah segmentation model via formulae (5) and (6). For this application, a user specifies three small patches (three rectangles in both examples) Q 1 , Q 2 , and Q 3 , and the SMS model proceeds with the extra interpolation conditions in (46) for the ownerships. Notice in the second example that the detailed branching of the complex boundary is well resolved by the model.
In Figure 7, another example of a natural image is segmented via the SMS model and the "hardening" formulae (5) and (6). A user supervises with three patches Q 1 , Q 2 , and Q 3 , and designates the two on the body to a pattern ownership  Figure 4: Synthetic image of a narrow bottleneck: hard segmentation from the SMS model via "hardening" formulae (5) and (6). The thickening regularization at the bottleneck junction can be explained similarly by the classical Mumford-Shah model for which minimum-surface or "soap-foam" behavior arises due to the surface tension energy. Also, see the recent work by Kohn and Slastikov [45] for the singularity analysis of a similar problem arising from micromagnetism. p body and the third (from the ocean) to p ocean . If the three are treated as distinct patterns, the SMS model still works, but one needs an extra step of high-level vision processing (e.g., based on Grenander's graph models [46]) to group the skin-tone and the purple-shirt patterns in order to capture the entire body faithfully.
Finally, plotted in Figures 8 and 9 are the ownerships from the SMS model based on the 3-phase and 4-phase supervisions separately in Figure 2. The stochastic nature of the outcomes (i.e., the softly transiting ownerships p i 's instead of hard segmentation) is closer to the way a human subject may perceive such a natural scene. In particular, the SMS model seems to be consistent with the most recent theory that hard pattern segments may not be absolutely necessary for natural scene recognition [1,2].

CONCLUSION
In this work, we have improved the celebrated Mumford-Shah segmentation to allow stochastic fuzziness of individual patterns. The proposed model outputs the ownership (or membership) probability distributions for all the patterns, from which the classical hard segmentation can be obtained based on stochastic decision rules such as the principle of maximum likelihood or the Bayesian classifier.  Figure 6: A brain image with low noise: hard segmentation from the SMS model via "hardening" formulae (5) and (6). Notice how the detailed branching of the gray matter has been successfully resolved by the model.
The key component of the new model is an ensemble of regularized double-well potentials inspired by the literature of material sciences and variational calculus. The model is nonconvex and the existence of optimal soft segmentation has been proven. A preliminary algorithm has been proposed and implemented, but without convergence analysis. Several generic numerical examples have demonstrated the flexibility and performance of the new model.
Our future work will mainly focus on (1) automating the weak supervision process based on statistical patch analysis, as inspired by the recent work of Li and Perona [1], and (2) developing a comprehensive framework for the effective computation of such a nonconvex and multivariate variational model (with Alan Yuille).

PROOF OF THE EXISTENCE THEOREM 3
We will need the following lemma for the proof. Proof. Denote the Lebesgue measure of a measurable set W by |W |. Since p * ≥ 0 and Ω p * > 0, there must exist some c > 0, such that V = x ∈ Ω | p * > 2c has a finite but positive measure.
In particular, there exists some N, such that for any n > N, p n > c on W. Define Original noisy image  (5) and (6), based on a 2-phase supervision. Denote the two rectangles on the body by Q 1 and Q 2 , and the third by Q 3 . Supervision provides the ownership interpolation condition: p body = 1 on Q 1 ∪ Q 2 and 0 on Q 3 , while p ocean = 1 on Q 3 and 0 on Q 1 ∪ Q 2 . Patch selection can also be automated based on multiscale patch statistics (e.g., see Li and Perona [1]).
Then Ω ρ = 1, and for any n > N, Thus by the Schwarz inequality (or E[X] 2 ≤ E[X 2 ] in probability theory), The lemma holds if one defines We are ready to prove Theorem 3.
Proof. Take the special pattern distribution: Thus the infimum of the energy must be finite. Let (P n , U n | n) ⊆ adm K (see (40)) be a minimizing sequence for the soft Mumford-Shah energy (37). That is, E[P n , U n | I] converges to inf P,U E[P, U | I].
Due to the third term in the energy and the simplex constraint, for each channel i, (p n i | n) must be bounded in H 1 (Ω). By the L 2 -weak compactness, there must exist some P * ∈ L 2 (Ω, R K ), and a subsequence of (P n | n), which after relabelling will still be denoted by (P n | n) for convenience, such that Then by the L 2 lower semicontinuity of Sobolev measures, Furthermore, with possibly another round of subsequence refinement, one can assume that P n (x) −→ P * (x), a.e. x ∈ Ω, n → ∞. (A.10) Since the probability simplex Δ k−1 is closed and P n (x) ∈ Δ K−1 , one concludes that And by Fatou's lemma [47,48], one has (A.12) (In fact, the equality holds by Lebesgue's dominated convergence [48].) After the above subsequence selection on P n 's, one naturally has an associated subsequence of (U n | n), which for convenience is still denoted by (U n | n) after relabelling. For each specific channel i, we then consider two scenarios separately.
Suppose that p * i (x) ≡ 0, a.e. x ∈ Ω. We then define for that channel Such a channel is called a "dumb" channel. Otherwise, one must have Ω p * i > 0, and from the first term in (37),  Figure 2(a). Plotted here are the three ownership distributions p 1 (x), p 2 (x), and p 3 (x). Due to "under"-supervision, namely the number K of specified patterns is less than that of the visually meaningful ones, the grass pattern has "absorbed" the ocean pattern due to the greenish color they happen to share.
Since Ω I 2 p n i ≤ Ω I 2 , by the triangle inequality, for some constant C i independent of n. Then by the generalized Poincaré inequality [48,49] on Ω, w − w, ρ i L 2 ≤ A i ∇w L 2 , (A. 18) where A i = A i (ρ i , Ω) is independent of w ∈ H 1 (Ω), one concludes that for some constant D i . As a result, (u n i | n) must be bounded in H 1 (Ω). By the L 2 -weak compactness of bounded H 1sequences, there is a subsequence of (u n i | n), for convenience still denoted by (u n i | n) after relabelling, such that u n i −→ u * i ∈ L 2 (Ω), n −→ ∞, (A.20) converging in the sense of both L 2 and almost everywhere. Then by the lower semicontinuity, Combining both cases just analyzed above, we have established that and hence (P * , U * ) must be a minimizer. (We must caution our reader that since index relabelling has been performed for a couple of times to simplify notations, this last sequence (P n , U n ) is not the one we have started with originally.) This completes the proof.

ACKNOWLEDGMENTS
The author is very grateful to Professor Alan Yuille for an enlightening discussion after the current work was first presented. For their generous teaching and continual inspiration, the author is always profoundly indebted to Professors Gil Strang, Tony Chan, Stan Osher, David Mumford, Jean-Michel Morel, and Stu Geman. The author must thank his wonderful former teacher, Professor Dan Kerstan at the Psychology Department of the University of Minnesota, for his first introduction on mixture image models and stochastic visual processing several years ago. The author also thanks the Institute of Mathematics and its Applications (IMA) and the Institute of Pure and Applied Mathematics (IPAM) for their persistent role in supporting this new emerging field. Finally, the author would like to dedicate this paper to his dear friends Yingnian Wu and Song-Chun Zhu for the unique friendship cultivated by the intellectually rich soil of vision and cognitive sciences. The generous help from our referees is also enormous. This work has been partially supported by the NSF (USA) under Grant no. DMS-0202565.