A Bayesian Generative Model for Surface Template Estimation

3D surfaces are important geometric models for many objects of interest in image analysis and Computational Anatomy. In this paper, we describe a Bayesian inference scheme for estimating a template surface from a set of observed surface data. In order to achieve this, we use the geodesic shooting approach to construct a statistical model for the generation and the observations of random surfaces. We develop a mode approximation EM algorithm to infer the maximum a posteriori estimation of initial momentum μ, which determines the template surface. Experimental results of caudate, thalamus, and hippocampus data are presented.


Introduction
3D surfaces are important geometric models for many objects of interest in image analysis and Computational Anatomy. For example, they are often used to represent the shape of 3D objects, the surface of human faces, and the boundaries of brain structures or of other human organs. Most data analysis methods in this domain are templatecentered, and a proper estimation of a template plays an important role to obtain high quality results. This paper is devoted to the description of statistically supported template estimation method which is adapted to surface data sets.
Our approach will be to build a generative statistical shape model in which the template is a parameter. We will then estimate it using maximum likelihood. This model relies on the very natural setting for which an observed surface is a noisy version of a random deformation of the template. This is the most generic and most basic approach of the deformable template paradigm, even if we add a small refinement by including a prior distribution on the template, based on what we will call a hypertemplate. Even with this global scheme which is fairly simple, we will see that implementing it in the context of surfaces will constitute a significant theoretical and numerical challenge.
At the exception of the recent work of [1], this approach significantly differs from what has been mostly proposed in the literature, in which most of the methods compute templates as averages over specific common parametrizations of the surfaces (using, for example, the sphere as a parameter space [2]). Parametric representations, however, are limited by the fact that, because they are defined a priori and independently for each object, they cannot be assumed to suitably align important features in a given data set of surfaces (i.e., give similar coordinates to similar features in the surfaces). This usually results in oversmoothed template surfaces (which is the equivalent of getting blurry template images in the case of image averaging). In [1], a similar diffeomorphic transformation model is used, but, as we will see, our Bayesian construction will provide a wellspecified template whereas [1] needs to rely to topologically unconstrained approximations to end up with a manageable template.
In addition to the references above, there have been several publications addressing the issue of shape averaging 2 International Journal of Biomedical Imaging over a dataset, although most of them work with 3D volume data or landmark points set. In several cases, the average is based on metric properties of the space of shapes [3][4][5][6][7], and the template is computed as an intrinsic average, minimizing the sum of square distances to each element of the set (Fréchet mean). Such methods have been implemented in the context of diffeomorphic deformation models (which are also models of interest in the present paper) for landmark matching [8], for 3D average digital atlas construction [9], and to quantify variability in heart geometry [10]. Other definitions of the average, adapted to situations in which the data is corrupted by noise, have been proposed [11][12][13], based on variational approaches (but not relying on a generative statistical model).
Our approach to build a generative statistical shape model is reminiscent of the construction developed in [14] for linear models of deformations, and in [15] for large diffeomorphic deformations in 3D volume image averaging. Adapting these ideas to surfaces will however require new algorithms and numerical developments.
In order to present our model, we need to first provide some background materials and notation, describing in particular the geodesic shooting equations that we will use to generate deformed surfaces. We will then introduce a random statistical model describing the generation and the observations of random surfaces. We then develop a Mode Approximation EM algorithm for surface averaging, to estimate the template from observations. In the optimization part, we derive and implement a new variational scheme, which is also applicable to surface matching, providing an alternative approach to the one originally proposed in [16,17]. Finally, we present and discuss experimental results on caudate, thalamus, and hippocampus data.

EPDiff for Surface Evolution
We will base our random shape model on the socalled EPDiff equation, which describes the evolution of deformable structures (like images, surfaces, or landmarks) under the action of groups of diffeomorphisms. It is a geodesic equation for a Riemannian metric on diffeomorphisms, and describes a momentum conservation law in the associated Hamiltonian system. The reader interested by the theory behind this equation can refer to [18][19][20], but most of this background will not be needed for the present paper, in which we will only use the specific form of the equations for surface evolution. The term EPDiff comes from its determination as an Euler-Poincaré equation in the group of diffeomorphisms, as introduced in [21]. One of its main interests here is that it provides a numerically stable, easily described, Hamiltonian evolution over diffeomorphisms, which will constitute an ideal support for our shape models.
The EPDiff equations describe the combined time evolution of a diffeomorphism, denoted φ(t, ·) and of what can be interpreted as a momentum, denoted p(t, ·). The initial conditions are always φ(0, x) = x for φ, and some initial value, p 0 , for p. This initial momentum will be a key component of the statistical model that will be built later on.
Let us start with the simplest form of the equation, which assumes that p 0 is a vector-valued function over R d , that is, Letting ∇ 1 K denote the gradient of K with respect to its first variable, the corresponding EPDiff equation is (2) Here, the notation a · b refers to the usual dot product between vectors in R 3 . If K is smooth enough, this system has solutions for arbitrary large t, and the mapping x → φ(t, x) is a diffeomorphism at all times.
The interesting fact about these equations is that they can have singular variants that are described in a similar way and have the same existence properties. The simplest way to relate the variants to the previous equation is to replace the Lebesgue's measure in the integrals by another, possibly singular, measure. For example, taking a surface S 0 in R 3 , we can use the volume form on S 0 as a reference measure and obtain the equations (in which p 0 and p(t, ·) need only to be defined on S 0 ) where dω S0 is the volume form on S 0 . Note that the first equation is defined for x ∈ R 3 , but it suffices to use it for x ∈ S 0 to obtain an equation for the evolving surface We can write a discrete form of the equations by replacing dy by a sum of Dirac measures (at points x 1 , . . . , x L in R 3 ), which gives, letting a l (t) = p(t, x l ), Similarly to (3), the first equation is valid for all x ∈ R 3 , but it suffices to solve it for x = x l , l = 1, . . . , L to obtain the evolution of the point set International Journal of Biomedical Imaging 3 Also, (5) can be seen as a discretization of (3) in which x 1 , . . . , x L are the vertices of a triangulation of S 0 , and a l (t) = p(t, x l )δσ S0 (x l ), where δσ S0 (x l ) is the area of a surface element around x l . The evolution of point sets is the most important from a practical point of view, since it is an ODE that can be easily implemented. Assuming a radial kernel K(x, y) = γ( x − y 2 ) like in (1), and denoting γ kl = γ( x k − x l 2 ), and γ kl = γ ( x k − x l 2 ), (5) can be rewritten as Once the initial position of the vertices, x(0) = (x 1 (0), . . . , x L (0)), and the initial momentum, a(0) = (a 1 (0), . . . , a L (0)), are provided, the evolution of the point set is uniquely determined.

Random Triangulated Surfaces.
If a triangulated template surface T with vertices x (T) is given, and we solve, until time t = 1, (5) initialized with x(0) = x (T) and a random initial momentum a(0) = α, the displaced vertices provide a random perturbation of the initial surface that will be denoted by T α . This is stated in the following definition. . Let α ∈ (R 3 ) L be a collection of L vectors in R 3 . Let (x(t), a(t)) be the solution of (5) with initial condition x(0) = x (T) and a(0) = α. One defines T α to be the triangulated surface with vertices x (Tα) = x(1) and the same topology as T.
By letting α be random, we build T α as a random deformation of T. This will form the "ideal", unobserved, surface, of which only a noisy version is observable (the noise process will be described in the next section).
Following [15], we will use a Bayesian formulation in which T is itself represented as a random deformation T 0,μ := (T 0 ) μ , where T 0 is a fixed surface that we will call the hypertemplate, and μ is a prior initial momentum shooting from T 0 to T (same notation as in Definition 1). One of the main interests of using a hypertemplate is to fix the topology of T so that it belongs, by construction, to the same class of objects as T 0 .
So, if N surfaces are observed, we need to model the probability distribution of the prior momentum, μ (starting at T 0 ), which specifies T = T 0,μ and of N deformation momenta α (1) , . . . , α (N) which specify the surfaces T α (1) , . . . , T α (N) . We now provide a statistical model for the joint probability distribution of μ, α (1) , . . . , α (N) . We first introduce some notation. Letting K be the kernel introduced in the previous section to define the geodesic shooting equations, we let Γ T be the 3L by 3L matrix formed with the 3 by 3 blocks K(x (T) k , x (T) l )Id R 3 . We define, for a triangulated surface T with L vertices x (T) , and α ∈ R 3L , We define the joint distribution of μ, α (1) , where λ is a fixed parameter regulating the weight on the hypertemplate.
There is a technical difficulty here, which is that one must make sure that this probability can be normalized (Z exists), which requires that the exponential is integrable. That this is true is not straightforward, and we have not been able to find a proof that works with any choice of the kernel K. One way to deal with this is to introduce a constant A μ (which can be chosen arbitrarily large so that it does not interfere with the algorithms that will follow), and add to the model the constraint that μ T0 is smaller than A μ . Under such an assumption, one obtains (after integrating out the α's) This is finite, since, for any given μ, the transformation x (T0) → x (T0,μ) is the restriction of a diffeomorphism to the vertices of T 0 (as seen from (5)). This implies that Γ T0,μ is nonsingular, and its determinant is bounded away from 0 when μ is restricted to a compact space. In fact, the choice A μ = ∞ can be proved to be acceptable for a large class of kernels. Those are kernels for which the smallest eigenvalue of Γ T decreases at a speed which is at most polynomial in the minimal distance, h T , between the vertices. A list of kernels satisfying this property can be found in [22]. For such kernels, we find that (L being fixed) log det Γ T = O(log h T ). Just sketching the argument here, one can prove, using elementary properties of dynamical systems, that h T = O(exp(−C μ T0,μ ) for some constant C, so that the log determinant in (9) is linear in μ T0,μ and Z is well defined, even with A T0,μ = ∞. For very smooth kernels, including the Gaussian, bounds on the smallest eigenvalue of Γ T are much worse (with a decay which is exponential in (−1/h 2 T )), and the previous argument does not work. Since the bounds in [22] hold uniformly with respect to the number of points, a polynomial bound may still be valid for a fixed L, although we were unable to discover it.
Notice that, conditionally to the template, the momenta α are independent and follow a Gaussian distribution with inverse covariance matrix given by Γ T . An example of simulated random deformations obtained using such a model is provided in Figure 1. The six surfaces following the template are independent realization of the model described in (9).

Observation and Noise.
The second part of our generative model is to describe the observation process, which takes an ideal surface T α generated according to the model above, and returns the noisy observable.
Modeling such a noise process is a tricky issue. Obvious choices (like adding noise to the vertices of T α ) do not work because one cannot assume that the observed surfaces are discretized consistently with the template. In this paper, we will work around this issue by assimilating the observation of a surface that of a singular Gaussian process.
For this, we consider that surfaces in R 3 are not observable directly, but only via their action on test functions, that we will call sensors. We define a sensor to be a smooth vector field w over R 3 (typically with small support). Given an oriented surface S, define where N S is the normal to S. The real number, (S, w) is the measurement of S through the sensor w. Now, modeling noisy surfaces will result in assuming that, given any w, the measurement (S, w) is a random variable. We will assume that it is Gaussian, and more generally, that, given m sensors, w 1 , . . . , w m , the random vector ((S, w j ), j = 1, . . . , m) is Gaussian.
S, via its action on sensors, is therefore modeled as a Gaussian random field. Given an ideal surface T α , we will assume that its mean is given by and the process is thereafter uniquely characterized by its covariance operator We will assume that this covariance is associated to a symmetric operator L obs so that (The apparently redundant parameter σ 2 , which could have been included in L obs , appears here because it can be easily estimated by the algorithm, with the operator L obs remaining constant.) To finalize our model, it remains to describe how S is discretized, that is, to make explicit a finite family of sensors through which S is measured. Let (z j , j ∈ J) form a regular grid of points in Ω. Let γ s be a radial basis function (a Gaussian, e.g.,) and define, for j ∈ J and d = 1, 2, 3 where e d is the dth vector of the standard basis of R 3 (this therefore specifies 3|J| sensors). The resulting observed variables are where N (d) S = N S · e d is the dth coordinate of N S . These variables are, by assumption, jointly Gaussian, with means m j,d = (T α , w j,d ) and covariance matrix Assuming that L obs is translation invariant, the resulting expression is a function of z i − z i that we will denote Let R obs = (r (obs) i j ) be the inverse matrix of the one with coefficients (γ obs (z j − z j ), j, j ∈ J). The log likelihood of the process will include error terms taking the form International Journal of Biomedical Imaging 5 Replacing y j,d by its expression in (16), we have Let us analyze the first term. We have with the notation Treating the other terms similarly, we can rewrite the error term in the form and we will abbreviate this (introducing a notation for the right-hand side) as E obs = (1/σ 2 ) S − T α 2 obs . Thus, we can write We have the following important proposition.

Proposition 1.
Assume that γ s is taken equal to the Green function of L obs . Then, when the grid J becomes finer and K obs is given by (22), one has See the appendix for a proof of this proposition (which requires some background on the theory of Hilbert spaces with a reproducing kernel) and possible extensions. For practical purposes, we will use γ obs instead of K obs in · obs , therefore assuming that the sensors are chosen according to the proposition. It is interesting to notice that the resulting norm in this case is precisely the norm that has been introduced in [16] to compare surfaces, when they are considered as elements of a reproducing kernel Hilbert space of currents. We refer to [16] for details on the mathematical construction.
In the rest of the paper, we develop a parametric procedure to estimate the template T from the observation of i.i.d. surfaces S (1) , S (2) , . . . , S (N) generated as described above. This includes in particular N hidden deformation momenta α (1) , α (2) , . . . , α (N) , such that the complete distribution of observed and unobserved variables is where we have written for short T = T 0,μ . We now discuss the estimation of the parameter μ, and of the associated template T 0,μ , which is the main purpose of this paper. This will be implemented with a mode approximation of the EM algorithm, as described in the next section.
The EM algorithm is an iterative method that updates a current estimation of μ using the following two Eand Msteps.
M-step: maximize this expression with respect to μ and replace the current estimation of μ by the obtained maximizer.
The conditional expectation can be expanded as Considering the highly nonlinear relation between α (n) and the deformed surface T α (n) , an explicit computation in (27) is impossible. We will therefore rely on the classic mode approximation in the EM which replaces the conditional distribution by a Dirac measure taken at the conditional mode, yielding where α (n) = arg min This results in a maximum a posteriori procedure that maximizes alternatively in μ and in the α (n) 's. Like in [15], we will refer to it as a mode approximation to the EM (MAEM) rather than a MAP algorithm, in order to strengthen the fact that it is an approximation of the maximum a posteriori procedure relying on the likelihood of the observed data only. As illustrated in [14], the MAEM can be biased (leading to inexact estimation of the template, even with a very large number of samples), especially when the noise is important, but it is obviously more feasible than the exact EM. Notice that the MAEM method is a special form of EM algorithm, and as such optimizes a lower bound of the log likelihood of the observed data.
We summarize the two steps of the ith iteration of the MAEM in our case. Suppose μ and the α (n) 's are the current variables to be updated. Then the next iteration is as follows.
M step: with α (n) fixed, update μ with the minimizer of We now discuss how each of these steps can be implemented.

MAE
Step. Our goal in this section is to optimize (30). We work with fixed n and drop it from the notation to simplify the expressions. The objective function is This problem is equivalent to the surface matching algorithm considered in [16], with a slightly different formulation since [16] optimize an energy with respect to a time-dependent momentum instead of just the initial momentum (i.e., they solved simultaneously the geodesic estimation and the matching problems). These two formulations are equivalent when using continuous time (they produce the same minima), but they yield different results when discretized. In our setting, formulating the problem as in (32) is natural, and focuses on the modeled random variable, α. We need to compute the variation of E with respect to α. This computation will be useful for the M-step also. We first discuss the discretization of the error term, which follows [16].
Then a discrete approximation of S − S 2 obs is where S is considered as fixed and U is therefore considered as a function of the vertices, x (S ) , of S . With this notation, we can write International Journal of Biomedical Imaging 7 We want to compute the gradient of E and for this, apply the chain rule to the transformations α → T α → U(x (Tα) ), yielding where A * is the transpose matrix of A.
The gradient of U with respect to x (S ) has been computed in [16], and is given as follows. We denote by F (S) l the set of faces (triangles) that contain a vertex x l in a triangulated surface S. For f ∈ F (S) l , we let e l ( f ) denote the edge opposed to x l in f , positively oriented in f . With S = T α , we have where a g = 1 if g ∈ F (S ) and a(g) = −1 if g ∈ F (S) . Now let us derive the variation of x (Tα) with respect to the initial momentum, α. We know that x (Tα) = x (1), where x and a evolve according to the system (7) dx k dt with x(0) = x (T) and a(0) = α (and γ kl , γ kl are short for γ( x k − x l 2 ) and γ ( x k − x l 2 )). Now an infinitesimal variation α → α + δα in the initial condition induces infinitesimal variations a + δa and x + δx over time, and the pair (δx, δa) obeys the following differential system, that can be obtained from a formal differentiation of (7): with γ kl = γ ( x k − x l 2 ). One can rewrite it in the matrix form: where J(t) = Jxx Jxa Jax Jaa with (42) Solving this system with initial condition δx(0) = 0 and δa(0) = δα provides what we have denoted One does not need to compute all the coefficients of the matrix dx (Tα) /dα using this equation in order to apply the transpose in (36). This is fortunate because this would constitute a computationally demanding effort given that this matrix is 3L by 3L with L large. The right hand side of (36) can be in fact computed directly using a single dynamical system, given by d dt where J(t) is defined in (39). If (44) is solved from time t = 1 to time t = 0 with η x (1) = ∂ x U and η α (1) = 0, then This is a simple consequence of the theory of linear differential systems (a proof is provided in the Appendix for completeness). Note that the matrix J(t) depends on the solution of (7) computed with initial conditions x(0) = x (T) and a(0) = α. To emphasize this dependency, we will denote it J(t) = J (T,α) (t) in the following. Given this, we see that a variation α → α + δα induces a first-order variation δE of the energy given by 8 International Journal of Biomedical Imaging where the T-dot product is δα , α T = δα · (Γ T α) and Γ T is the matrix formed with 3 by 3 blocs γ kl Id R 3 . We choose to operate the gradient descent with respect to this dot product and therefore choose a variation proportional to δα = −(2α + Γ −1 T η α (0)). So, the algorithm to compute an optimal α is the following.
Algorithm 1 (MAE Step for Surface Template Estimation).
This algorithm has to be applied N times (for all α k , k = 1, . . . , N) in the MAE step. Remark 1. The matrix Γ T being typically very badly conditioned, we numerically compute Γ −1 T η α after adding a small positive number to the diagonal of Γ T . The inversion itself is computed using conjugate gradient.

M
Step. There are many similarities between the M-step and the E-step variational problems, so that we will be able to only sketch the detail of the computation here. We need to minimize Let us consider the variation of each term in the sum (fixing n, that we temporarily drop from the notation). Since we can write The function U being defined as before, we see that the derivative of the second term is given by applying the chain rule again, this time in the form Like in the previous section, the transpose of the differential applied to the gradient of U can be computed by solving a dynamical system backward in time. In fact, it is the same system as with the variation in α, namely, d dt still initialized with η x (1) = ∂ x (Tα) U and η α (1) = 0, but the relevant result now is η x (0). The gradient of U is then Once this is computed, the next step is to compute (reintroducing n in the notation) This follows a similar procedure, using (44), with T 0 instead of T and μ instead of α. This requires solving d dt initialized with η x (1) = ∂ x (T) U and η μ (1) = 0. The variation of E associated to an infinitesimal variation of μ is then We summarize the M step in the following algorithm.
Algorithm 2 (M-Step Algorithm for Surface Template Estimation).
(ii) With α (n) 's fixed, apply Algorithm 2 to obtain a new value for μ.
(j) Hypertemplate T 0 (k) Estimated template T (iii) Solve (7) initialized with the hypertemplate T 0 and the newly obtained μ to update the estimated template, T.

Result and Discussion
We applied the algorithm to surface data of human brain's caudate, thalamus, and hippocampus. All data are courtesy of Center for Imaging Science at Johns Hopkins University. Each surface has around 5−10 thousand triangle cells. We randomly chose one as the hypertemplate and the others as observed surfaces. In these experiments, we set λ = 1.0 and σ 2 = 1.0. Figures 2 and 3 are the template estimation result for caudate and thalamus, respectively.
We also applied the algorithm to 101 hippocampus surface data in the BIRN Project (Biomedical Informatics Research Network). In Figure 4, Panels (a)−(h) are 8 examples of the 101 observations. Panel (i) is the hypertemplate and Panel (j) is the estimated template.
The result is visually satisfying in the sense that the estimated template is found to agree with a qualitative representation of the population. For example, in the caudate experiment, the estimated template has an obviously narrower upper tip than the hypertemplate. This captures the population characteristic since most observed data have narrower upper tips. Notice that the obtained template does not look smoother than the rest of the population, as would typically yield template estimation methods that average over fixed parametrizations. Figure 5 shows how the energy in (31) changes with the iteration in the caudate experiment. This confirms the effectiveness and convergence of the algorithm. One can see the energy drops quickly in the first twenty iterations, then gradually slows down. After the 35th iteration, the energy changes little and the estimated template remains stable.
In our model, the hypertemplate can be provided by an atlas obtained from other studies, although we here simply choose one of the surfaces in the population. Actually, as Figure 6 shows, different choices for the hypertemplate yield very similar results.

Conclusion
In this paper, we have presented a Bayesian approach for surface template estimation. We have built, for this purpose, a generative statistical model for surfaces: the construction first applies a random initial momentum to the template surface, then assumes an observation process using test functions and noise. The template is assumed to be generated as a random deformation of a hypertemplate, completing the Bayesian model. We used a mode approximation EM scheme to estimate the surface template, and introduced for this purpose a novel surface matching algorithm optimizing with respect to the initial momentum. The procedure has been tested with caudate, thalamus, and hippocampus surface data, showing its effectiveness and convergence, and also experimentally proved to be robust to variations in the choice of the hypertemplate.

A. Proof of Proposition 1
Let us assume that L obs γ s = δ 0 , that is, γ s is the Green Kernel of the operator L obs . This assumption implies, in particular, that γ s = γ obs . Given a smooth function f , denote Then f J can be identified to the orthogonal projection of f on V j , because r (obs) is the inverse of the Gram matrix (matrix of dot products) of (γ obs (· − z j ), j ∈ J), which are the generators of V J , and f z j = f , γ obs ·, z j V (A.4) by assumption. Now, let J m be a sequence of progressively finer grids with resolution m tending to 0 when m tends to infinity. We prove that f Jm − f V → 0, for which it suffices to prove that V Jm is dense in V . This is equivalent to the fact that no nonzero g in V can be orthogonal to all the V Jm 's. But g orthogonal to V Jm is only possible when, for all j ∈ J m , g z j = g, γ w · − z j V = 0. (A.5) Since the z j s form an arbitrarily fine grid in V and functions in V are continuous, this implies g = 0. So, f J → f in V , which implies, for example, pointwise convergence as soon as V is embedded in the set of continuous functions (that is, if γ obs is continuous). This directly implies Proposition 1 in the case γ s = γ obs , by taking f (x) = γ obs (x − y) for a given y.
Extensions of this result is when γ s is obtained by applying some linear operator, say A, to γ obs . One then has j, j γ obs · − z j γ s x − z j ⎞ ⎠ (A.6) and passing to the limit in the sum (for which one needs γ s ∈ V and A continuous on V ), one gets K obs (x, x) → A 2 γ obs .

B. Transposing Linear Differential Equations
We here justify the procedure described in Section 4, and prove the following fact: consider the solution, z(t) ∈ R p , of the ordinary differential equation where J is a time-dependent operator (a p by p matrix). Then, for any u ∈ R p , we have where η is the solution of the differential equation