Numerical Solution of Diffusion Models in Biomedical Imaging on Multicore Processors

In this paper, we consider nonlinear partial differential equations (PDEs) of diffusion/advection type underlying most problems in image analysis. As case study, we address the segmentation of medical structures. We perform a comparative study of numerical algorithms arising from using the semi-implicit and the fully implicit discretization schemes. Comparison criteria take into account both the accuracy and the efficiency of the algorithms. As measure of accuracy, we consider the Hausdorff distance and the residuals of numerical solvers, while as measure of efficiency we consider convergence history, execution time, speedup, and parallel efficiency. This analysis is carried out in a multicore-based parallel computing environment.


Introduction
High-quality images are crucial to accurately diagnose a patient or determine treatment. In addition to requiring the best images possible, safety is a crucial consideration. Many imaging systems use X-rays to provide a view of what is beneath a patient's skin. X-ray radiation levels must be kept at a minimum to protect both patients and staff. As a result, raw image data can be extremely noisy. In order to provide clear images, algorithms designed to reduce noise are used to process the raw data and extract the image data while eliminating the noise. In video imaging applications, data often have to be processed at rates of 30 images per second or more. Filtering noisy input data and delivering clear, highresolution images at these rates require tremendous computing power. This gave rise to the need of developing highend computing algorithms for image processing and analysis which are able to exploit the high performance of advanced computing machines.
In this paper, we focus on the computational kernels which arise as basic building blocks of the numerical solution of medical imaging applications described in terms of partial differential equations (PDEs) of parabolic/hyperbolic type. Such PDEs arise from the scale-space approach for description of most inverse problems in imaging [1]. One of the main reasons for using PDEs to describe image processing applications is that PDE models preserve the intrinsic locality of many image processing operations. Moreover, we can rely on standard and up-to-date literature and software about basic computational issues arising in such case (such as the construction of suitable discretization schemes, the availability of a range of algorithmic options, and the reuse of software libraries that allow the effective exploitation of highperformance computing resources). Finally, PDEs appear to be effectively implemented on advanced computing environments [2].
We consider two standard discretization schemes of nonlinear time-dependent PDEs: semi-implicit scheme and fully implicit scheme [3]. The former leads to the solution of a linear system at each time (scale) step, while the computational kernel of the fully implicit scheme is the solution of a nonlinear system, to be performed at each time (scale) step. Taking into account that we aim to solve such problems on parallel computer in a scalable way, in the first case, we 2 International Journal of Biomedical Imaging use, as linear solver, Krylov iterative methods (GMRES) with algebraic multigrid preconditioners (AMG) [4,5]. Regarding the fully implicit scheme, we use the Jacobian-Free Newton-Krylov (JFNK) method as nonlinear solver [6].
In recent years, multicore processors are becoming dominant systems in high-performance computing [7]. We provide a multicore implementation of numerical algorithms arising from using the semi-implicit and the implicit discretization schemes of nonlinear diffusion models underlying most problems in image analysis. Our implementation is based on parallel PETSc (Portable Extensible Toolkit for Scientific Computation) computing environment [8]. Parallel software uses a distributed memory model where the details of intercore communications and data managements are hidden within the PETSc parallel objects.
The paper is organized as follows. In Section 2, an overview of the PDE model equation used in describing some of inverse problems in imaging applications will be given. Then, the segmentation problem of medical structures is discussed. Numerical approach will be introduced in Section 3. Section 4 is devoted to the discussion of numerical algorithms based on semi-implicit and implicit numerical schemes. In Section 6, we describe the experiments that we carried out to show both the accuracy and the performance of these algorithms, while Section 7 concludes the work.

Diffusion Models Arising in Medical Imaging
The task in medical imaging is to provide in a noninvasive way information about the internal structure of the human body. The basic principle is that the patient is scanned by applying some sort of radiation and its interaction with the body is measured. This result is the data, whose origin has to be identified. Hence, we face an inverse problem. Most medical imaging problems lead to ill-posed (inverse) problems in the sense of Hadamard [9][10][11]. A standard approach for dealing with such intrinsic instability is to use additional information to construct families of approximate solution. This principle characterizes regularization methods that, starting from the milestone Tikhonov regularization [12], are now one of the most powerful tools for solution of inverse ill-posed problems. In 1992, Rudin et al. introduced the first nonquadratic regularization functional (i.e., the total variation regularization) [13] to denoise images. Moreover, the authors derive the Euler-Lagrange equations as a time-dependent PDE. In the same years Perona and Malik introduced the first nonlinear multiscale analysis [14].
Scale-space theory has been developed by the computer vision community to handle the multiscale nature of image data. A main argument behind its construction is that if no prior information is available about what are the appropriate scales for a given data set, then the only reasonable approach for a vision system is to represent the input data at multiple scales. This means that the original image u(x), x ∈ R 2 should be embedded into a one-parameter family of derived images, in which fine-scale structures are successively suppressed: A crucial requirement is that structures at coarse scales in the multiscale representation should constitute simplifications of corresponding structures at finer scales-they should not be accidental phenomena created by the method for suppressing fine-scale structures. A main result is that if rather general conditions are imposed on the types of computations that are to be performed, then convolution by the Gaussian kernel and its derivatives is singled out as a canonical class of smoothing transformations [15,16]. A strong relation between regularization approaches and the scale-space approach exists via the Euler-Lagrange equation of regularization functionals: it consists of a PDE of parabolic/hyperbolic (diffusion/advection) type [17], defined as follows.
Nonlinear Diffusion Models. Let x = (x, y) ∈ R 2 and u(x, τ), defined in [0, T] × Ω, be the scale-space representation of the brightness function image u(x) defined in Ω ⊂ R 2 describing the real (and unknown) object and u 0 (x) the observed image (the input data). Let us consider the following PDE problem: [0, T] is the scale (time) interval; g(v) is a nonincreasing real valued function (for v > 0) which tends to zero as v → ∞. Initial and boundary conditions will be provided according to the problem to be solved (denoising, segmentation, deblurring, registration, and so on). Equations in (2) describe the motion of a curve (a moving front) with a speed depending on a local curvature. Such equations, known as level set equations, were first introduced in [18]. The original idea behind the level set method was a simple one. Given an interface Γ in R n of codimension one (i.e., its dimension is n − 1), bounding an (perhaps multiply connected) open region Ω, we wish to analyze and compute its subsequent motion under a velocity field v. This velocity can depend on position, time, the geometry of the interface (e.g., its normal or its mean curvature), and the external physics. The idea, as devised in 1988 by Osher and Sethian, is merely to define a smooth (at least Lipschitz continuous) function φ(x, t), that represents the interface Γ as the set where φ(x, t) = 0. Thus, the interface is to be captured for all later time, by merely locating the set Γ(t) for which φ vanishes. The motion is analyzed by convecting the φ values (levels) with the velocity field v. This elementary equation is International Journal of Biomedical Imaging 3 Actually, only the normal component of v is needed: and the motion equation becomes Taking into account that the mean curvature of Γ(t) is equation (5) describes the motion of Γ(t) under a speed v N proportional to its curvature cur (Mean Curvature Motion, MCM equation) [18][19][20]. This basic model has received a lot of attention because of its geometrical interpretation. Indeed, the level sets of the image solution or level surfaces in 3D images move in the normal direction with a speed proportional to their mean curvature. In image processing, equations like (5) arise in nonlinear filtration, edge detection, image enhancement, and so forth, when we are dealing with geometrical features of the image-like silhouette of object corresponding to level line of image intensity function. Finally, the level set approach instead of explicitly following the moving interface itself takes the original interface Γ and embeds it in higher dimensional scalar function u, defined over the entire image domain. The interface Γ is now represented implicitly as the zeroth level set (or contour) of this function, which varies with space and time (scale) using the partial differential equation in (2), containing terms that are either hyperbolic or parabolic. The theoretical study of the PDE was done by [21] which proved existence and uniqueness of viscosity solutions.

A Case Study: Image Segmentation.
In this paper, we use equations (2) for image segmentation. The task of image segmentation is to find a collection of nonoverlapping subregions of a given image. In medical imaging, for example, one might want to segment the tumor or the white matter of a brain from a given MRI image. The idea behind level set (also known implicit active contours, or implicit deformable models) for image segmentation is quite simple. The user specifies an initial guess for the contour, which is then moved by image-driven forces to the boundaries of the desired objects. More precisely, the input to the model is a user-defined point-of-view u 0 , centered in the object we are interested in segmenting. The output is the function u(x, τ). Function u(x, τ) in (2) is the segmentation function, u 0 represents the initial contour (initial state of the segmentation function), and the image to segment is I 0 . Moreover, as proposed in [22], instead of following evolution of a particular level set of u, the PDE model follows the evolution of the entire surface of u under speed law dependent on the image gradient, without regard to any particular level set. Suitably chosen, this flow sharpens the surface around the edges and connects segmented boundaries across the missing information. In [22,23], the authors formalized such model as the Riemannian mean curvature flow where the variability in the parameter also improves the segmentation process and provides a sort of regularization. Thus, (2) becomes accompanied with initial condition u 0 and zero Dirichlet boundary conditions. Regarding u 0 , it is usually defined as a circle completely enclosed inside the region that one wish to segment. The term g(v), called edge detector, is a nonincreasing real function such that g(v) → 0 while v → ∞, and it is used for the enhancement of the edges. Indeed, it controls the speed of the diffusion/regularization: if ∇u has a small mean in a neighborhood of a point x, this point x is considered the interior point of a smooth region of the image and the diffusion is therefore strong. If ∇u has a large mean value on the neighborhood of x, x is considered an edge point and the diffusion spread is lowered, since g(v) is small for large v. A popular choice in nonlinear diffusion models is the Perona and Malik function [14]: g(v) = 1/(1 + v 2 /β), β > 0. In many models, the function g(|∇u|) is replaced by its smoothed version g(|∇G σ * u|), where G σ is a smoothing kernel, for example, the Gauss function, which is used in presmoothing of image gradients by the convolution. For shortening notations, we will use abbreviation In conclusion, we use the Riemannian mean curvature flow, as model equation of the segmentation of medical structures: given I 0 , the initial image and u 0 equals to a circle contained inside an object of the image I 0 , we are interested in segmenting, we compute u(x, τ) by solving (7). The level sets of u(x, τ), at steady state, provide approximations of the contour to detect.

Numerical Schemes
Nonlinear PDE in (7) can be expressed in a compact way as ∂u x, y, τ ∂τ = F u x, y, τ, ∇u x, y, τ , I 0 , where Scale Discretization. That is discretization with respect to τ. If [0, T] is the scale interval and nscales is the number of scale steps, we denote by τ i the ith scale-step for all i = 1, . . . , nscales, so that τ i+1 = τ i + Δτ, where Δτ = T/nscales is the step-size. Using the Euler forward finite difference scheme to discretize the scale derivative on the left hand side of (9), we get or, equivalently Let us denote as u i = u(x, y, τ i ), i = 1, . . . , nscales, the function u evaluated at τ i . Equation (12) is rewritten as Depending on the collocation value, used to evaluate u(x, y, τ) with respect to the parameter τ, inside the F function on the right hand side of (12) three iterative schemes derive: , that is, we use u i to discretize the numerator |∇u| of the fraction ∇u/ 2 + |∇u| 2 . Other quantities are evaluated at u i−1 ; (iii) implicit scheme: In summary, the difference between the semi-implicit and the implicit scheme relies on the scale discretization of the term |∇u| at the numerator of ∇u/ 2 + |∇u| 2 inside the function F. This term controls the diffusion process, and it plays the role of edge-enhancement. If we consider the three-dimensional (3D) domain Ω T = Ω × [0, T], the semiimplicit scheme employs a sort of 2D + 1 discretization of Ω T proceeding along nscales two dimensional (2D) slices each one obtained at τ ≡ τ i , while the fully implicit scheme uses a fully 3D discretization of Ω T . This difference suggests that the fully implicit scheme may provide a more accurate edge detection than the semi-implicit scheme. This difference is highlighted by considering their discretization errors.
Space Discretization. That is discretization with respect to (x, y). If Ω is the space domain, we introduce a rectangular uniform grid on Ω consisting of N x × N y (for simplicity we assume that Ω is a rectangular of dimension 1 × 1; this means that h x = 1/N x and h y = 1/N y ), nodes (x i , y j ) = (lΔx, mΔy),l = 1, . . . , N x , m = 1, . . . , N y , and we use finite volumes to discretize the partial derivatives of u, as in [24,25]. (14) be the vector obtained from the scale-space discretization of the function u, we have the following iteration formulas.
(i) Explicit scheme: where, for each i, the matrix y is the unit matrix and N SI is the scale steps number.
(ii) Fully-implicit scheme: where N I is the scale steps number and [A] l,m i , for each i, is a nonlinear vector operator on R N 2 International Journal of Biomedical Imaging 5 In particular, we apply the Crank-Nicholson scheme [3] which uses the average of the forward Euler method at step i − 1 and the backward Euler method at step i:

Algorithms and Their Computational Complexity
The effectiveness of these schemes depends on a suitable balance between accuracy (scale-space discretization error), number of flop/s per iteration (algorithm complexity), and the total execution time needed to reach a prescribed accuracy (software performance). Let us denote the discretization error E d . It is Explicit scheme is accurate at the first order both with respect to scale and space, that is, p = q = 1; anyway, it is the one straightforwardly computable. The computational kernel is a matrix-vector product, at every scale step. This scheme requires very small time steps in order to be stable (CFL (Courant-Friedrich-Levy) condition that guarantees the stability of the evolution), and its use is limited rather by its stability than accuracy. This constraint is practically very restrictive, since it typically leads to the need for a huge amount of iterations [3]. Semi-implicit scheme is absolutely stable for all scale steps. The accuracy, in terms of discretization error with respect to both scale and space, is of the first order, because p = 1, q = 2 [24,25]. Crank-Nicholson provides a discretization error of second order, that is, p = q = 2, but it requires extra computations, leading to a nonlinear system of equations, at every time step, while stability is ensured for all scale steps [3]. In the following, we collect these results: Then, the fully implicit scheme provides an order of accuracy greater than that provided by the others. This difference may be important in those applications of image analysis where the edges are fundamental to recognize some pathologies.
Algorithm complexity of these schemes depends on the choice of the numerical solver. Concerning the semi-implicit scheme, we employ Krylov subspace methods, which are the most effective approaches for solving large linear systems [5]. In particular, we use Generalized Minimal RESidual method (GMRES) equipped with Algebraic multigrid (AMG) preconditioner. Such techniques are convenient because they require as input only the system matrix corresponding to the finest grid. In addition, they are suitable to implement in a parallel computing environment. For the fully implicit scheme we use the Jacobian Free Newton Krylov Method (JFNK) [6]. JFNK methods are synergistic combinations of Newton-type methods for superlinearly convergent solution of nonlinear equations and Krylov subspace methods for solving the Newton correction equations. The link between the two methods is the Jacobian-vector product, which may be probed approximately without forming and storing the elements of the true Jacobian.
Let us briefly describe the numerical algorithms that we are going to implement, which are based on the semi-implicit and the implicit discretization schemes, together with their complexity.
Algorithm SI (Semi-Implicit Scheme). For all i = 1, . . . , N SI solution of with respect to u l,m i . HS l,m i−1 , for each i, is a matrix ∈ R N 2 x ×N 2 y . As space derivative we use the 2nd order finite covolume discretization scheme (see [24,25] for convergence, consistence and stability). By this way, matrix [A] l,m i−1 is a block pentadiagonal matrix with tridiagonal blocks along the main diagonal and diagonal blocks along the upper and lower diagonals.
with respect to u l,m i . HI l,m i , for each i, is a nonlinear vector operator on R N 2  Algorithm SI. For each scale step, to solve the linear system (21), we employ GMRES iterative method (see Algorithm 1). Computational kernel of GMRES is a matrix-vector product. Taking into account the structure of the coefficient matrix (we assume that N x = N y = N, then h = 1/N = h x = h y ), the computational cost of GMRES is where k SI GMRES is the maximum iterations of GMRES (over the scale steps). Computational complexity of Algorithm SI is Algebraic multigrid (AMG) method follows the main idea of (geometric) multigrid (MG), where a sequence of grids is constructed from the underlying geometry with corresponding transfer operators between the grids [26]. The main idea of MG is to remove the smooth error, that cannot be eliminated by relaxation on the fine grid, by coarse-grid correction. The solution process then as usual consists of presmoothing, transfer of residuals from fine to coarse grids, interpolation of corrections from coarse to fine levels, and optional postsmoothing. In contrast to geometric multigrid, the idea of AMG is to define an artificial sequence of systems of equations decreasing in size. We call these equations coarse-grid equations. The interpolation operator P l,m lv and the restriction operator R l,m lv define the transfer from finer to coarser grids and vice versa. Finally, the operator on the coarser grid at level lv + 1 is defined by The AMG method consists of two main parts, the setup phase and the solution phase. During the setup phase, the coarse-grids and the corresponding operators are defined. The solution phase consists of a multilevel iteration. The number of recursive calls, which is the number of levels lv, depends on the size and structure of the matrix. For our case, we use the V-cycle pattern with the FALGOUT-CLJP coarse grid selection [27]. Looking at the Algorithm SI in Table 1, the preconditioner P is just A lv+1 . Computational cost of each iteration of GMRES is that of the AMG preconditioner plus the matrix-vector products: then, we get: Following picture shows a schematic description of Algorithm SI that emphasizes its main steps and the most time consuming operation, that is, the matrix vector products needed at each step of GMRES.
Algorithm I. For each scale step, to solve the nonlinear equations (22), we employ the Jacobian-Free Newton-Krylov (JFNK) method. JFNK is a nested iteration method consisting of at least two and usually four levels. The primary levels, which give the method its name, are the loop over the Newton method:  Outside of the Newton loop, a globalization method is often required. We use line search.
Forming each element of J which is a matrix of dimension N 2 × N 2 requires taking derivatives of the system of equations with respect to u. This can be time consuming. Using the first order Taylor series expansion of HI l,m i (u l,m n + ρv), it follows that where ρ is a perturbation. JFNK does not require the formation of the Jacobian matrix; it instead forms a result vector that approximates this matrix multiplied by a vector. This Jacobian-free approach has the advantage to provide the quadratic convergence of Newton method without the costs of computing and storing the Jacobian. Conditions are provided on the size of ρ that guarantee local convergence. Algorithm 3 shows a schematic description of Algoritm I that emphasizes its main steps and the most time consuming operation, that is, evaluations of the nonlinear operator HI l,m i at each innermost step. Algorithm complexity of JFNK is where N New is the maximum number of Newton's steps, over the scale steps, k I GMRES is the maximum number of GMRES iterations (computed over Newton's steps and scale steps)), f is the number of evaluations of HI l,m i . Finally, we get A straightforward comparison between the algorithm complexity of these algorithms shows that Algorithm SI asymptotically seems to be comparable with respect to Algorithm SI. Of course, the performance analysis must also take into account the efficiency of these two schemes in a given computing environment. Next section describes the PETScbased implementation of these algorithms that we have developed in a multicore computing environment.

The Multicore-Based Implementation
The software has been developed using the high-level software library PETSc (Portable Extensible Toolkit for Scientific Computations) (release 3.1, March 2010) [8]. PETSc provides a suite of data structures and routines as building blocks for the implementation of large-scale codes to be used in scientific applications modeled by partial differential equation. PETSc is flexible: its modules, that can be used in application codes written in Fortran, C, and C++, are developed by means of object-oriented programming techniques. The library has a hierarchical structure: it relies on standard basic computational (BLAS, LAPACK) and communication (MPI) kernels and provides mechanism needed to write parallel application codes. PETSc transparently handles the moving of data between processes without requiring the user to call any data transfer function. This includes handling parallel data layouts, communicating ghost points, gather, scatter and broadcast operations. Such operations are optimized to minimize synchronization overheads.
Our parallelization strategy is based on domain decomposition: in particular, we adopt the row-block data distribution, which is the standard PETSc data distribution. Rowblock data distribution means that blocks of dimensions N 2 / p × N of contiguous rows of matrices of dimension N 2 × N 2 are distributed among contiguous processes. By the same way, vectors of size N are distributed among p processors as blocks of size N/P. Such partitioning has been chosen because overheads, due to redistribution before the solution of the linear systems, are avoided. Further, rowblock data distribution introduces a coarse grain parallelism which is best oriented to exploit concurrency of multicore multiprocessors because it does not require a strong cooperation among computing elements: each computing element has to locally manage the blocks that are assigned to it.
The computing platform that we consider is made of 16 blades (1 blade consisting of 2 quad core Intel Xeon E5410@2.33 GHz) Dell PowerEdge M600, equipped with IEEE double precision arithmetic. Because high performance technologies can be employed in medical applications only to the extent that the overall cost of the infrastructure is 8 International Journal of Biomedical Imaging affordable, and because we consider single images of medium size, we show results obtained by using 1 blade, that is, we run the parallel algorithms on up to p = 8 cores of a single blade. Of course, in case of multiple images or sequences of images, the use of a greater number of cores may be interesting.

Experiments
In this section we present and discuss computational results obtained by implementing Algorithm I and Algorithm SI in a multicore parallel computing machine. Before illustrating experimental results, let us briefly describe the choice of (1) test images, (2) comparison criteria, (3) parameters selection.
(1) Test Images. we have carried out many experiments in order to analyze the performance of these algorithms. Here we show results concerning the segmentation of a (malignant) melanoma (see Figure 7) [28]. Epiluminescence microscopy (ELM) has proven to be an important tool in the early recognition of malignant melanoma [29,30]. In ELM, halogen light is projected onto the object, thus rendering the surface translucent and making subsurface structures visible. As an initial step, the mask of the skin lesion is determined by a segmentation algorithm. Then, a set of features containing shape and radiometric features as well as local and global parameters is calculated to describe the malignancy of the lesion. In order to better validate computed results and to analyze the software performance, we first consider a synthetic test image simulating the object we are interested in segmenting (see Figure 1).
(2) Comparison Criteria. We compare the algorithms using the following criteria: (a) distance from original solution. As measure of the difference between two curves, we use the Hausdorff distance measured between the computed curve and the original one. It is well known that the Hausdorff distance is a metric over the set of all closed bounded sets (see [31]), here we restrict ourselves to finite point sets because that is all that is necessary for segmentation algorithms [32]. Given two finite point sets C 1 and C 2 , the Hausdorff distance d H between the sets C 1 and C 2 is defined as follows: where · is the euclidean norm. It identifies the point a ∈ C 1 that is fastest from any point of C 2 and vice versa, then it keeps the maximum, (3) Parameters Selection. We set K = 1.0 and = 1.0. Let us explain how we select the values of the scale step size and the number of scale steps. Regarding Δτ, its value is chosen according to that required to E d . Taking into account that, in Algorithm SI, E d is accurate at the first order with respect to Δτ SI , while it is accurate at the second order with respect to Δτ I , in Algorithm I, by requiring that the discretization error is about the same, we get Finally, in Algorithm SI, the stopping criterion of the linear solver (GMRES) uses the tolerance while the preconditioner AMG uses the tolerance TOL = 10 −7 and 25 as maximum number of AMG-levels. In the Algorithm 2, the stopping criterion of the nonlinear solver uses the tolerance International Journal of Biomedical Imaging Regarding the number of scale steps (N SI and N I ), taking into account that its choice depends on Δτ SI,I and on the value of τ ≡ T, that is, the value of the scale parameter corresponding to steady state of the segmentation function u(x, τ), solution of the PDE model. To check the steady state, we require that the residuals, corresponding to different scale steps, reach the tolerance We found that this corresponds to T = 0.4 (see Figure 2) for Test 1 and to T = 2 for Test 2.
Test 1: Synthetic Image. In Tables 1, 2, and 3, we show the Hausdorff distance and execution time by requiring that discretization error is of the first, second, and third order, that is, Moreover, concerning the number of scale steps, it follows that Note that while in the first case, that is, if we require E d = O(10 −1 ), these two algorithms are quite numerically equivalent, both in terms of execution time and of the computed result; as discretization error decreases, Algorithm I appears to be more robust than Algorithm SI, in the sense that Algorithm I reaches the steady state with high accuracy (the Hausdorff distance is of 95%), while the computed results of Algorithm SI are less accurate, even though the execution time of Algorithm I sometimes slightly increases. These results suggest that if it needs to get an accurate and reliable result, Algorithm I should be preferable. Figure 3 show segmentation results. Finally, note that the execution time of these two algorithms asymptotically is the same, as stated by the analysis of computational cost carried on in Section 4.
Convergence History. Convergence history is illustrated by showing the behavior of relative residuals versus the scale steps (see Figures 4,5,and 6), and by reporting iteration number of the GMRES and of Newton's method, respectively, (see Tables 4, 5, 6, and 7). We consider Δτ SI = 0.16, 0.016, and Δτ I = 0.04. In the following, we show results corresponding to the segmentation of a melanoma (see Figure 7). As expected, because this is a real image, the steady state is reached at a         and Figure 8, compare results of Algorithm SI and Algorithm I.  Parallel Performance. We show the performance of the multicore-based parallel algorithms and their scalability as the number of cores increases. We run the parallel algorithms using up to p = 8 cores of the parallel machine.
Finally, we report execution time, speedup, and efficiency of Algorithm I, at scale step Δτ = 0.4 (i.e., Figures 15,16,and 17) and Δτ = 0.04 (i.e., Figures 18, 19, and 20), respectively. Note that parallel efficiency of both algorithms always is, at least, of 60% and, on average, of about 80%. In particular, parallel efficiency of Algorithm I is about 90%. Execution time of both algorithms reduces to about 2 seconds on eight cores in the first case (i.e., E d = (10 −1 ), and to about 10 seconds in the second case (E d = O(10 −2 )). This means that, in a multicore computing environment, both algorithms provide the requested solution within a response time that can be considered quite acceptable in medical imaging applications and, in particular, that Algorithm I is competitive with Algorithm SI.        Finally, we show results on scalability of parallel algorithms. Let T p (N) be the execution time of the parallel algorithm running on p cores for segmenting an image of size N 2 × N 2 . We measure the scalability of these algorithms by measuring T p (N) as N varies, once p is fixed, and by measuring T p (N) as N and p grow. We note that, in case of Algorithm SI, the scaling factor is while, for Algorithm I, we get as scaling factor   This means that Algorithm I scales better than Algorithm SI. In Figures 27 and 28 (for Algorithm SI) and Figure 29 and 30 (for Algorithm SI), we report T p (N) as N and p varies. In particular, each point of the graph refers to the execution time of the parallel algorithm at (p = k · p 1 , N = k · N 1 ), where p 1 = 1, N 1 = 210, and k = 1, 2, 3, 4.

Conclusions
A straightforward comparison between the semi-implicit and the fully implicit discretization schemes of nonlinear PDE of parabolic/hyperbolic type states that fully implicit discretization usually leads to too expensive algorithms.
In this paper, we provide a multicore implementation of two numerical algorithms arising from using these two discretization schemes: semi-implicit (Algorithm SI) and fully implicit (Algorithm I). Taking into account that we aim to solve such problems on parallel computer in a scalable way, in the first case, we use, as linear solver, Krylov iterative methods (GMRES) with algebraic multigrid preconditioners (AMG). Regarding the fully implicit scheme, we use the Jacobian-Free-Newton Krylov (JFNK) method as nonlinear solver.
We compare these two algorithms using different metrics measuring both the accuracy and the efficiency. We note that if we require that the discretization error E d is E d = O(10 −1 ), these two algorithms are quite numerically equivalent, both in terms of execution time and of the computed result; while, as discretization error decreases, Algorithm I appears to be more robust than Algorithm SI, in the sense that International Journal of Biomedical Imaging Algorithm I reaches the steady state with high accuracy (the Hausdorff distance is of 95%), while the computed results of Algorithm SI are less accurate, even though the execution time of Algorithm I sometimes slightly increases. These results suggest that if it needs to get accurate and reliable results, Algorithm I should be preferable.
The parallel efficiency of both algorithms always is, at least, of 60% and, on average, of about 80%. In particular, parallel efficiency of Algorithm I is of about 90%. Execution time of both algorithms reduces to about 2 seconds on eight cores if E d = (10 −1 ) and to about 10 seconds if E d = O(10 −2 ). This means that, in a multicore computing environment, Algorithm I is competitive with Algorithm SI.
In conclusion, our results suggest that if it is required high accuracy of the computed solution in a suitable turnaround time, using a multicore computing environment fully implicit scheme provides an accurate and reliable solution within a response time of few seconds, quite acceptable in medical imaging applications, such as computer-aided-diagnosis.