Two Simulated Annealing Optimization Schemas for Rational Bézier Curve Fitting in the Presence of Noise

Fitting curves to noisy data points is a difficult problem arising in many scientific and industrial domains. Although polynomial functions are usually applied to this task, there are many shapes that cannot be properly fitted by using this approach. In this paper, we tackle this issue by using rational Bézier curves. This is a very difficult problem that requires computing four different sets of unknowns (data parameters, poles, weights, and the curve degree) strongly related to each other in a highly nonlinear way. This leads to a difficult continuous nonlinear optimization problem. In this paper, we propose two simulated annealing schemas (the all-in-one schema and the sequential schema) to determine the data parameterization and the weights of the poles of the fitting curve. These schemas are combined with least-squares minimization and the Bayesian Information Criterion to calculate the poles and the optimal degree of the best fitting Bézier rational curve, respectively. We apply our methods to a benchmark of three carefully chosen examples of 2D and 3D noisy data points. Our experimental results show that this methodology (particularly, the sequential schema) outperforms previous polynomial-based approaches for our data fitting problem, even in the presence of noise of low-medium intensity.


Introduction
(1) Motivation.This paper deals with the problem of fitting a collection of noisy data points by means of a rational curve.This problem arises in several scientific and applied fields.It is an important issue, for instance, in regression analysis for statistics and machine learning and in approximation theory for numerical analysis.It also plays a key role in several industrial fields, most prominently in reverse engineering.In its most comprehensive meaning, reverse engineering consists of obtaining a digital replica of an already existing physical object or component.This is a typical procedure in medical and health areas, where noninvasive techniques such as magnetic resonance imaging (MRI) or computer tomography (CT) are commonly used to visualize inner organs or different parts of the human body for medical check, diagnosis, and therapy.Reverse engineering is also a common practice in consumer products, microchips, and other electronic components for different purposes.For instance, to analyze how a new device or machine in the market is built or how a particular component works.Another relevant application arises in automotive and aerospace industries, where prototypes are built on clay, foam rubber, wood, and other materials to help designers and engineers explore new ideas about shape or size and get a visual insight of a new part to be designed.This is a challenging task, since nowadays designs are becoming more and more organic in shape, making them more difficult to be replicated by computer from scratch.A typical approach in this regard is to obtain a set of measurements of the object or workpiece and then reconstruct it as a 3D model.Typical ways to measure include scanning technologies such as 3D laser scanners, touch scanners, coordinate measuring machines, light digitizers, and industrial computer tomography.Typically, the process yields a cloud of data points, which have to be fitted in order to recover the topological information of the original model.

Mathematical Problems in Engineering
When data come from accurate sources (i.e., with minimal measurement errors) and smooth shapes, curve reconstruction can be done through well-known curve interpolation techniques such as spline and Bézier curve interpolation.This approach is characterized by imposing the parametric curve to pass through all data points.However, most reverse engineering applications retrieve data from scanning, digitizing [1], or some mechanical device, such as pressure machines in the shoemaking industry [2], thus being subjected to measurement errors.In case of high-intensity noise, interpolation methods tend to fail since they force the curve to pass through all points, including the noisy outliers.This undesirable situation can be overcome by data approximation, where the fitting curve is only required to be sufficiently close to the original data according to some prescribed metrics.Owing to its ability to cope with noise inherent to data in real-world settings, in this paper we will use a curve approximation method.
Several families of functions can be used for this curve approximation task.Among them, the free-form parametric functions (such as Bézier and B-splines) are the most common in CAD/CAM (Computer-Aided Design/ Manufacturing), computer graphics and animation, virtual reality, and other fields.In particular, Bézier curves have been intensively used in automotive industry for decades.Mathematically, they are given by a linear combination of basis functions called the Bernstein polynomials with vector coefficients called poles or control points.The curve follows approximately the shape of its control polygon (the collection of segments joining the poles), and, hence, it reacts to the movement of its poles by following a push-pull effect.This nice feature was fundamental for the popularization of free-form curves for interactive design.Although nowadays Bézier curves have been mostly deprecated in that field, being overtaken by the more powerful B-splines, they are still in use in many other areas.For instance, True Type fonts use composite curves comprised of quadratic Bézier curves.Similarly, all Postscript font outlines are defined in terms of cubic and linear Bézier curves.Other recent applications include, for instance, robot path planning [3] and the determination of the airfoil geometry from a given   -distribution [4].
A recent paper by the authors addressed the problem of curve approximation with polynomial Bézier curves [5].The method was based on a hybrid scheme combining a popular single-particle metaheuristic called simulated annealing for global search and a local search optimizer for further refinement of the global solution.This hybrid scheme was applied to compute all relevant parameters of the approximating function with satisfactory results.
Although the polynomial representation is very easy to deal with (and, therefore, very convenient for many purposes) it is still limited in some ways.For example, it is well known that the polynomial-based schemes cannot represent accurately some important shapes such as the conics (e.g., circles, ellipses, and hyperbolas).A way to overcome this drawback is to consider the polynomial curve in homogeneous coordinates, leading to rational functions (i.e., functions that are quotients of two polynomials).In this sense, it is important to remark that conics can be canonically described as rational functions.Many other complicated shapes can be dealt with more efficiently if the (more limited) polynomial scheme is extended to a rational one.It is not a trivial task, however, since the rational scheme includes additional degrees of freedom that have also to be computed.As a consequence, our previous method is no longer valid and must be substantially improved.This is actually the goal of the present contribution.
(2) Aims and Structure of This Paper.In this paper we focus on obtaining an accurate fitting of a given set of noisy data points by using a rational Bézier curve.To this aim, we consider a modification of the standard simulated annealing (SA), a popular probabilistic method for global optimization based on thermodynamical principles.In our approach, the basic SA algorithm is further improved by considering two larger global optimization schemas.They are applied to compute all relevant parameters of our curve fitting problem.A typical shortcoming in this regard is that the accuracy of the model increases as the number of poles increases, eventually leading to overfitting.To prevent this undesirable effect, we also aim to seek a suitable trade-off between data fidelity and model complexity.To attain this goal we apply the Bayesian Information Criterion (BIC), a model-selection technique widely used in the context of information sciences.The basic idea of BIC is to combine a penalty term with the fitness function.This penalty term increases with the number of free parameters of the model, thus penalizing any unnecessarily large complexity of the model.
The structure of this paper is as follows: Section 2 summarizes briefly the previous work in the field.Some basic mathematical concepts and definitions and the curve fitting problem from noisy data points with rational Bézier curves are described in Section 3.Then, the simulated annealing algorithm is described in detail in Section 4. Section 5 describes all steps of our proposed methodology along with the two schemas used in this paper.The experimental results of the application of our method to three illustrative examples are reported in Section 6.The section also discusses the robustness of the method in the presence of noise.The paper concludes with the main conclusions of this work and some future lines of research.

Previous Work
The problem of data approximation with free-form curves has been a classical subject of research for many years.The first research works addressing this issue were published in the 60s and 70s and were mostly based on numerical procedures [6][7][8].The classical methods of this period select the free parameters by inferring some geometric properties from data in order to meet specific constraints [1,9,10].A major problem in this regard is to perform data parameterization, that is, to determine suitable parameter values associated with the data points.Typical choices are the uniform parameterization and the arc-length parameterization.Unfortunately, classical numerical methods are not able to obtain optimal solutions in the general case.Subsequent attempts to address this problem during the next decades provided solutions for several particular cases but were not successful in solving the general problem [9,11].More recent methods apply alternative approaches such as line use error bounds [12], curvature-based squared distance minimization [13], or dominant points [14].Although they perform well, they are still limited to particular cases and/or require strong constraints in terms of differentiability or noiseless data, assumptions that are not reasonable for real-world instances.
Mathematically speaking, curve fitting is usually addressed as a continuous optimization problem [6,8,9].However, traditional mathematical optimization techniques have had little success in solving this problem.In the last few years, the field has taken a new impetus with the application of several powerful artificial intelligence and soft computing techniques.Some of these methods are based on neural networks, such as the standard neural networks [15], Bernstein basis function networks [16], and Kohonen's SOM (Self-Organizing Maps) nets [17].In some other cases, the neural approach is combined with partial differential equations [18].This scheme has been extended to the more general functional networks in [19,20].More recently, powerful metaheuristic techniques have been applied to this problem, including swarm intelligence [21], artificial immune systems [22], genetic algorithms [23,24], physics-inspired thermodynamics [5], cuckoo search [25], support vector machines [26], firefly algorithm [27], and hybrid techniques [28,29].These methods obtain remarkable results for polynomial curves, but more complicate shapes (e.g., conics) are still not properly fitted.This is a good evidence of the interest and difficulty of this data fitting problem, which is described in detail in the next section.

Basic Concepts and Definitions.
In this section we assume that the reader is familiar with the concept of parametric curves.Bézier curves are a particular case of free-form parametric curves, introduced in the 60s for interactive design in the automotive industry.A nonrational (i.e., polynomial) parametric Bézier curve of degree  in R  is given by where {b  }  =0 ⊂ R  are vector coefficients (usually called poles or control points) and    () are the Bernstein polynomials of index  and degree , given by Note that in this paper vectors are denoted in bold.The polynomial representation is not powerful enough to represent a variety of shapes, particularly the conics (e.g., circles, ellipses, and hyperbolas).One way to overcome this limitation is to use homogeneous coordinates (see [10,30] for details).The basic idea is to consider the projection of the standard polynomial Bézier curve in R +1 , with new poles b ℎ  .The resulting curve in R  is called a rational Bézier curve.Mathematically, this curve can be described as a quotient of two polynomials, or as a linear combination of rational basis functions: where the rational basis functions are defined by where  is the curve degree and   is the last coordinate of the homogeneous control point b ℎ  .This set of new scalar parameters {  }  =0 , called weights, provides us with additional degrees of freedom for better shape approximation.They also increase the model complexity, however, as we introduce a new set of parameters that have to be computed as well.

The Fitting Problem. Let now {Q 𝑖 } 𝑚
=1 be a set of data points in R  .The problem consists of obtaining the rational Bézier curve, C(), of a certain degree  providing the best least-squares fitting of the data points.This leads to a minimization problem of the least-squares error Θ defined as the weighted sum of squares of the residuals: where {  }  =1 are scalar numbers used in situations when it may be reasonable to assume that sampled data should not be treated equally.In order to reflect faithfully the most common situation in real-world problems, in this paper we will assume that no information about the problem is available beyond the data points.This means that all data points must be treated equally; that is,   = 1, for all .Note, however, that our method is independent on the values of   .To represent the geometrical distribution of the data we need to associate a parameter   for each input point Q  .Therefore, our goal is to obtain the three sets of parameters {  }  =1 , {  }  =0 , and {b  }  =0 .It is obvious that since each blending function in (2) and ( 4) is nonlinear in , system (5) is also nonlinear.As a consequence, we have to deal with a multivariate continuous nonlinear minimization problem.In this paper we solve this problem by applying two different schemas of the simulated annealing optimization method, which is described in the next section.

Simulated Annealing
One of the major trends in global optimization during the last few decades has been to build algorithms trying to mimic certain efficient optimization patterns observed in natural processes.As a result, a series of very powerful Mathematical Problems in Engineering nature-inspired optimization algorithms (e.g., particle swarm optimization, genetic algorithms, or ant colony optimization) have been devised.Very often, they provide better solutions than previous traditional mathematical algorithms to several hard optimization problems.Although they are very diverse, all of them share two common features: to be inspired by realworld observation and to search for solutions in a stochastic way.Most of them are also derivative-free, meaning that they can be applied to problems where it is not possible to compute the derivatives of the objective function (or they are very expensive computationally).In this paper we apply simulated annealing, one of the most popular nature-inspired optimization algorithms.

4.1.
Background.The simulated annealing (SA) is a natureinspired metaheuristic optimization algorithm introduced by Kirkpatrick et al. in the context of combinatorial optimization [31].The algorithm is inspired in the annealing process in metallurgy, where metals are heated at very high temperatures and then slowly cooled down to reach a state of lower energy.During the process, atoms tend to move to configurations that minimize the system energy even if during such migration certain configurations rise the system overall energy (when it stabilizes for a fixed temperature, we call it thermal equilibrium).Such moves are more prominent at the beginning of the process than at the end, when the particles loose thermal mobility in order to polish the system inner structure to finally produce a better metal.As a result, the metals become stronger and with better properties, specially if the process is conducted several consecutive times (a process called reannealing).
The original SA algorithm is an advanced interpretation of the Metropolis-Hastings algorithm [32] to generate sample states of a thermodynamic system, showing the deep connections between statistical mechanics and combinatorial optimization.Given an initial (usually random) state in the solution domain, the algorithm iteratively perturbs it.Whenever a better solution is found, the change is always accepted; otherwise, it is accepted only with a certain probability.This probability is higher at the beginning (mimicking what happens in the thermodynamic process at high temperatures) than at the end.In other words, this idea of slow cooling is translated as a slow decrease of the probability of accepting such worse solutions.So essentially the system evolves from a free exploration of the search space at initial stages to a stochastic hill-climbing at latter stages.
Since its publication the algorithm has received a lot of attention from the scientific community, with many realworld applications in the most diverse fields, ranging from the classical NP-hard combinatorial travelling salesman problem [33] to the minimization of highly multimodal real-valued functions [34]; see [35] for an in-depth review of several simulated annealing applications.

The Algorithm.
The simulated annealing algorithm was originally designed to compute a good approximation of the global optimum of a fitness function (usually called the system energy) within a problem domain D, assumed to be continuous in this paper.Each point x ∈ D is a state of some physical system.Given an initial (usually random) state x 0 , the goal of the SA is to obtain the state with the minimum energy (associated with the best solution of the optimization problem).In our case, we have a real-valued function  : D ⊆ R  → R that we want to minimize.The algorithm performs an iterative process; at each iteration step, a new state x new is generated from the current one, x old , through a neighborhood function, denoted by N : D → D, that is, x new = N(x old ).Let now  old ≡ (x old ),  new ≡ (x new ) be their associated energies, respectively.The algorithm probabilistically decides between moving the system to the new state x new or staying in current state x old .This new state is chosen with a probability function P : D × D → [0, 1], called the acceptance function, which depends on two factors as follows: (i) on one hand, on the difference between their energy values, Δ =  old −  new , (ii) on the other hand, on a global parameter called temperature, denoted by , which varies according to a strictly decreasing function T : R + → R + called the cooling function.
The probability function is not arbitrary, but must comply with certain requirements.One of them is that P > 0 if Δ < 0, meaning that the system may move to the new state even if it is worse (has a higher energy) than the current one.The primary goal of this condition is to prevent stagnation (when the system gets trapped in the neighborhood of local optima, leading to premature convergence).Under the same condition Δ < 0, we also impose that P → 0 as  → 0, while P →  > 0 otherwise.Basically, these conditions state that, for sufficiently small values of , the system will increasingly promote "downhill" changes (i.e., changes leading to lower energy values) and avoid "uphill" changes.
In other words, the lower the temperature is, the easier it is to reject a worse solution.In fact, in the particular case  = 0, the procedure will only allow downhill moves, meaning that the algorithm reduces to a greedy search algorithm.Another desirable feature is to promote small uphill moves over large ones.This effect can be obtained by modulating the probability as a function of the parameter Δ so that P decreases as Δ increases.Under these conditions, the temperature  becomes a critical parameter in describing the evolution of the system, as its value determines the sensitivity of the system to energy variations.
In this paper, we consider the classical acceptance function derived from the Metropolis-Hastings sampling algorithm, first introduced by Metropolis et al. in [32] as a Monte-Carlo method to simulate the creation of new states in a thermodynamic system and given by where  represents the system temperature at the iteration where x new has been generated.(6) meets all required conditions for a proper acceptance function indicated above.It also provides an adequate tradeoff between exploration and exploitation: at initial stages (i.e., higher temperatures) the algorithm explores the search space while at latter stages it resembles a hill-climbing algorithm, with the difference that now there is always the possibility to accept a worse state.The corresponding pseudocode of the simulated annealing algorithm used in this paper is shown in Algorithm 1.A typical execution begins with a randomly chosen state (x 0 ,  0 ) and an initial (very high) temperature  0 .The algorithm generates new states, according to N, at each iteration and probabilistically decides whether or not the new state is accepted according to the probability P. According to the pseudocode, the algorithm can be summarized as the interaction between two cycles: the external one controls the temperature updating and the inner one performs the Metropolis procedure for a given outer iteration.The temperature  is updated through the cooling function T only when the thermal equilibrium is reached.This workflow is repeated until a stopping criterion is met.Classical stopping criteria are as follows: the system reaches a state good enough for the specific application, the method reaches a prescribed number of iterations, or the solution does not improve after a prescribed number of consecutive iterations.Note that both the cooling schedule criterion (line 24) and the stopping criterion (line 26) can be defined in many different ways, depending on the problem at hand.In this sense, these criteria can be either variables or functions or even rules.In our implementation it is assumed that they are Boolean functions that return true or false depending on whether or not the given conditions for each particular problem are met.The specific conditions used for our problem are explained in Section 5.

Overview of the Method.
As discussed above, our problem consists of reconstructing the underlying shape of a given set of noisy data points by using a rational Bézier curve.This implies solving a nonlinear least-squares minimization problem while simultaneously minimizing the required number of free parameters.Solving this problem requires computing four different sets of unknowns: data parameters, poles, weights (represented in this section by vectors p, w, and b, resp.), and the curve degree, .Our approach to tackle this issue is a hybrid strategy combining classical methods (least-squares minimization), modern stochastic methods (simulated annealing), and information science metrics (Bayesian Information Criterion (BIC)).
Before explaining how our method works, let us introduce the following notation: from now on, we will use the subindex (⋅)  when searching for the curve weights, (⋅)  when searching for the curve parameters, and (⋅) , when

Data parameterization and weight computation
All-in-one scheme searching for both parameters and weights.Thus, SA  stands for simulated annealing applied to parameter search,  , means the objective function with domain of definition y = (w, p) ∈ W × D ⊂ R +1 × [0, 1]  , and so on.The superindex (⋅)  denotes the search for a nonrational Bézier curve and (⋅)  stands for a rational one.Hereafter, consider W = (0, 100] +1 and D = [0, 1]  .Without loss of generality, we can also assume that  0 =   = 1,  1 = 0, and   = 1. Figure 1 shows the main steps of our method.Basically, it consists of four major tasks: data parametrization, weight computation, data fitting, and degree determination.Upper part of this figure summarizes the method: we initially set a range for the curve degree ; then, for each value of this parameter  within that range, we apply a combination of simulated annealing and least-squares optimization (box SA-LSQ) to perform the first three tasks and compute the data parameters, weights, and poles of the best fitting rational Bézier curve for this value of .Then, the BIC value of the resulting curve (corresponding to the last task, degree determination) is obtained (box BIC).At its turn, the SA-LSQ can be decomposed into two steps (middle layer of Figure 1): SA performs data parameterization and weight computation (see Section 5.2 for details), while LSQ is used to compute the poles (see Section 5.3).This combination of SA and LSQ is repeated iteratively until no further improvement is reached.In this paper, we introduce two different SA schemas, shown graphically in the lower layer of Figure 1: the sequential schema and the all-in-one schema.They are explained in detail in the next section.

Data Parameterization and Weight Computation.
In this step we perform two different (but intertwined) tasks: data parameterization and weight computation.The former consists of obtaining a discrete association between the set of parameters {  }  =1 and the noisy data points {Q  }  =1 to be fitted, while the latter computes the weights.Both tasks are performed simultaneously by using the simulated annealing approach described in the previous section.The input for the SA method is given by the following: (i) the curve degree, , (ii) initial random parameter vectors p 0 , w 0 , and b 0 , (iii) the energy function, given by ( 5), (iv) a neighborhood function (described in Section 5.2.1), (v) a cooling schedule (described in Section 5.2.2), (vi) a stopping criterion (described in Section 5.2.3).
In this work, two different simulate annealing schemas are considered: the sequential schema and the all-in-one schema.Basically, the former calculates the different sets of unknowns of our optimization problem in a sequential way (i.e., only some parameters are initially computed and subsequently used to compute the remaining parameters), while the latter computes all unknowns at once.Let us analyze them in detail.
Using the notation introduced above, the sequential schema (SEQ) begins by finding the best nonrational Bézier fitting curve through SA   , then it performs the following sequence iteratively: until there is no further improvement in the final solution of either SA procedure in comparison with the previous one.Each routine takes the preceding solution as an input parameter.The schema can be summarized as follows (see also the lower layer of Figure 1, where the numbers in red indicate the different steps of the algorithm): (1) Apply the SA   with a random initial guess p 0 ∈ D to find a non-rational Bézier curve that fits the data better.Let p  be the resulting solution.
(2) Search for a rational Bézier curve through SA   with a random initial guess w 0 ∈ W and fixed parameters (p  ).Let (w  , p  ) be the resulting solution.
(3) Apply the SA   optimization with p  as the initial guess and fixed weights w  .Let (w  , p  ) be the resulting solution.
In general, the energy function for the simulated annealing procedures SA  in steps (2) and (3) above is that in (5).However, a different energy function is required for the nonrational case in step (1), SA  , given by that corresponds to the case   = 1 for all .The all-in-one schema (AIO) corresponds to the minimization of functional where for each state vector (w, p) we compute (w, p) as the least-squares solution of (5) through SA  (,) .

Neighborhood Function.
The neighborhood function is one of the key components of the SA algorithm.Furthermore, its role becomes even more important for multimodal optimization problems, where the objective function is of the many local peaks surrounded by deep valleys type.This is exactly what happens in this paper.There are several alternatives for the neighborhood function.A very popular choice is the fast neighborhood function, which builds the new solution by modifying the previous one in steps proportional to the system temperature: where k ∈ D is a random vector holding ‖k‖ 2 = 1.This is one of the tested functions in [5] but it required another support function (a local search method) in order to exploit the neighborhood of a solution.In this paper we remove the supporting local method by maintaining two sets of controlling parameters: firstly, the global real temperature following the classical SA temperature.Then, the algorithm builds a second set of virtual temperatures (one for each spatial dimension) that are restarted at the beginning of each inner cycle.The resulting equation becomes where ⊙ represents the Hadamard product of two vectors and T has all its components set to the current system temperature at the beginning of each outer cycle.Thus, whenever x new is accepted through the acceptance criterion, we compute the absolute difference between the old and new solutions.For each component that behaves better than the previous one in the comparison, we update the corresponding component on T according to the cooling schedule.The proposed method is based on the Adaptive Simulated Annealing (ASA) algorithm but there is one key difference: our method does not require computing the gradient of the function.A further analysis about the performance of this neighborhood function for the third example of our benchmark is described in Section 6.3.
The reader is kindly referred to that section for further details on this issue.

Cooling Schedule.
By the cooling schedule we refer to a triplet ( 0 , , T) accounting for the selection of the initial temperature, the temperature parameter, and the cooling function (T), respectively, along with the thermal equilibrium criterion.The cooling schedule governs the pace at which the temperature is updated during the execution of the SA.Therefore, its choice is of primary importance for the good performance of the algorithm.Because of this reason, numerous cooling schedules have been proposed over the years, such as the linear schedule, the logarithmic schedule, and many others.A very common one is the power schedule, given by  new =    0 , where typically 0.9 <  < 0.99 and  > 0 is a scalar parameter.Another popular schedule is the Boltzmann schedule, given by  new =  0 / log().Finally, we also consider the fast schedule, governed by the law  new =  0 /, which provided a good balance among simplicity, speed, and good performance for other data fitting problems [5].In this paper we considered initially the power, Boltzmann, and fast schedules.However, our computational experiments showed that the fast schedule provides more visually appealing results and runs faster than the other two alternatives.Since these are two particularly valuable features in the context of data fitting, we eventually selected the fast schedule as the best choice for this work.The cooling schedule needs to accomplish two additional goals.On one hand, the starting temperature needs to be sufficiently high in order to let the algorithm freely explore the search space D but not so much that the system behaves like a random search for a large number of iterations.On the other hand, the cooling function must perform a slow reduction of the system temperature to prevent premature convergence.These goals are achieved by letting the acceptance function to initially accept worse states frequently, which in turn requires high temperatures.There is not known general way to select a good starting temperature for all problems.A common practice consists of selecting a very high temperature to rapidly cool the system after a certain proportion of worse solutions are accepted (this is called the acceptance ratio); then, begin to slowly cool the system, thus beginning with the proper SA algorithm.Another common technique consists of spending most of the computation time in running fast sample runs of the entire algorithm for different starting temperatures in order to determine the ideal temperature: one that promotes better solutions but still manages to explore the space avoiding local minima.This last practice was proposed in [36] as a way to avoid wasting computation time when running the SA.In [35] there is a brief, although insightful, overview of the different approaches on how to select the initial temperature.
In order to select the initial temperature we have discarded the time-consuming restarting strategy used in our previous work in favor of the method presented in [37].It consists of setting an acceptance ratio,  0 , and determining, through an iterative algorithm, a compatible starting temperature.Among the several possible ways to define  0 , in this work we choose the classical one of the quotient between the number of bad transitions accepted and the attempted ones, already proposed in [38].According to this definition, we set  0 = 0.8, a typical value in previous literature in the field.

Stopping Criteria.
Similar to the choice of the initial temperature, there is not a general set of stopping conditions suitable for all problems.There are, however, two common practices: the first one is to set a maximum number of iterations; the second one is to stop the execution of the algorithm when the system is frozen, that is, when no new solutions (either better or worse states) are accepted for a predefined number of iterations.Since the former can waste a lot of computation time with no further improvements on the solution, in practice the stopping criterion is often a combination of the two.In our implementation, the algorithm stops whenever one of the following conditions is met: either  eval > 1000 ⋅  or ♯[(Δ global )] = 10 ⋅ , where  eval denotes the total number of evaluations of the energy function,  denotes the number of free parameters of the system, and (⋅) denotes the lack of changes of the mean variation of the energy function.

Further Improvements.
In addition to the previous SA components, three other important modifications of the original SA algorithm have been included in our approach: (i) We improve the memory capacities of the method through elitism: the best state from the current iteration is encoded as a vector (x  best ,   best ) and stored in a temporal buffer.Obviously, this "best so far" solution is updated whenever a better solution is achieved during SA execution.We remark however that this (x  best ,  (iii) Finally we improve the cooling schedule with extra conditions for the thermal equilibrium.In our implementation, the inner cycle stops if the value of  0 is reached after  iterations.
These new features improve the performance of our approach significantly in terms of computational time and quality of results.

Data Fitting.
With the parameterization and weights calculated in previous steps, we compute the curve coefficients {b  }  =0 .Using (3), ( 5) can be rewritten as where R  = (   ( 1 ), . . .,    (  ))  represents the column vector of rational basis functions given by ( 4) at the best parameter values, Q = (Q 1 , . . ., Q  ) is a row vector, and (⋅)  represents the transpose of a vector or matrix.System ( 12) is overdetermined, meaning that no analytical solution can be obtained.Instead, we solved it numerically through least-squares minimization.If R + denote the generalized inverse (also known as Moore-Penrose pseudoinverse) of R = (R  )  , P = R + Q is the least-squares solution of this data fitting problem.From a computational point of view, it can be obtained by either LU decomposition or singular value decomposition (SVD).In this work, we have used the SVD solver from the popular numerical program Matlab.5.4.Degree Determination.The three previous steps assumed a given degree for the rational fitting curve.However, the optimal degree is problem-dependent, and therefore we need a method to compute it.This is not an easy task.It is well known that increasing the number of poles increases the accuracy of the model, because we have more degrees of freedom to adjust the data.However, this process can eventually lead to overfitting.In order to prevent this undesirable effect, some kind of trade-off between these two competing factors (the accuracy of the fitting and the complexity of the model) is required.To address this issue, in this paper we compute the Bayesian Information Criterion (BIC) for the resulting model.The BIC is an information criterion providing a good compromise between data fidelity and model complexity [39].This is done by introducing a penalty term for complex models into the target function: where Θ, , and  refer to the energy function, number of sampled points, and the number of free parameters, respectively.Note that, fixing  and , the BIC behaves like the error function and if we fix a value of Θ and , the criterion penalizes those models with a higher number of parameters.
Therefore, BIC provides us with a powerful procedure to compute the optimal value for the degree of the fitting curve.It always corresponds to the parameter value  with the smallest value for the BIC.

Experimental Results
In this section, we analyze the performance of our method by applying it to a benchmark of three illustrative examples of 2D and 3D noisy data points.These examples have been carefully chosen so that they exhibit challenging features such as selfintersections or strong changes of slopes and curvatures.
The first two examples correspond to real-world instances so that we can replicate the usual conditions of real-world applications, including the presence of noise of low-medium intensity.The last one is an academic example designed to analyze the effect of different levels of noise on our method.
For each example, we report the results of three different schemas used for comparative purposes: nonrational, rational all-in-one, and rational sequential.Finally we analyze the robustness of the method in the presence of noise by comparing each schema against the same data perturbed with different levels of noise for the last example.
Regarding the implementation issues, all the experiments were run on a AMD-FX-4100 Quad-Core Processor at 3600 Mhz with 8 GB DDR3 RAM running Linux 3.14.xLTS kernel and MATLAB 2012a.For each dataset and schema, the experiment has been executed 26 times.Then, 20 executions are finally selected (after removing the three bast and three worst executions) to provide statistical evidence for the results presented and assert the experiment reproducibility.
Table 1 reports our results for each dataset and experiment.The following items are arranged (in columns): dataset examined, the type of curve reconstructed (NR: nonrational; R: rational), the schema executed for rational curves (AIO: all-in-one; SEQ: sequential), the total number of calls to the energy function in the best case (represented by  eval ), the best and average BIC, the number of poles for the best BIC (represented by  pol ), and the relative mean error for each component ( mean ,  mean , and  mean ).When used, the acronym N.A. stands for not applicable.The logo, shoe, and torus names refer to the scanned logo, shoe profile, and curve on a torus datasets described below, respectively.The number after torus refers to the signal-to-noise ratio (SNR) applied.For the nonrational examples, a local search was performed in order to refine the SA solution.Our results show the good performance of the method even in highly noisy situations, which are the common case in real-world applications.

Example 1:
A Scanned Logo.The first example corresponds to the shape of a digitally scanned logo.The dataset consists of a set of 190 noisy 2D data points, represented by black × symbols in Figure 2. The figure shows our experimental results for the three cases analyzed: nonrational case (top), the all-in-one rational case (middle), and the sequential rational case (bottom).The figures on the left show the reconstructed points, represented as red empty circles.On the right, the best fitting curve is displayed as a red solid line.This example has been chosen because it represents a common real-world scenario: a scanned figure with the typical noise introduced during the scanning process.In addition to the high-intensity noise (clearly visible in all instances of Figure 2), this shape is also challenging because it includes difficult geometric features, such as several selfintersections and strong changes of slope and curvature.
As the reader can see from the figures, the method is able to recover the general shape of the data points with good accuracy.This is a very remarkable result because the original data points are highly noisy.Best results correspond to the sequential rational schema, while the nonrational and the AIO rational schemas perform almost similarly in this case.This fact is clearly visible in the topmost loop of the figures in right column, best fitted through the sequential schema (bottom figure).These visual results are in good agreement with the numerical results reported in Table 1.The best and average BIC for the sequential schema are approximately −770 and −701, respectively, while they are both −691 for the nonrational schema and −661 and −658 for the AIO rational schema.Note also that the three methods obtain the same optimal number of poles  pol = 13.In other words, the good accuracy of our method is not at the expense of a very large number of free variables.

Example 2: Shoe Profile.
The second example corresponds to a shoe profile obtained from a pressure-mechanical method without filtering, leading to a set of 400 threedimensional noisy data points.Figure 3 shows our experimental results.The interpretation of this figure is similar to that of previous example and, hence, it is omitted here to avoid redundant information.Once again, the best fitting is obtained with the sequential rational schema, although in this case, the visual and numerical results are closer for the two rational schemas and significantly worse for the nonrational one.Note also that we obtained a similar parameter value,  pol = 24, for the number of poles in all cases.

Example 3:
Curve on a Torus.Last instance in our benchmark corresponds to an academic example.It has been The simulation results with our method for the resulting 10 datasets are reported in the last 10 horizontal blocks of Table 1.Some important observations can be obtained from the numerical data in the table.The most important one is that the sequential rational schema outperforms the others in terms of BIC value, meaning that it provides the best tradeoff between accuracy and complexity for all instances in this example.According to our results, the optimal number of (a) Figure 2: Experimental results for Example 1: left: reconstructed points (red circles); right: best fitting curve (red solid line); top: nonrational case; middle: AIO rational case; bottom: sequential rational case.In all cases,  pol = 13.
poles for this example is  pol = 20 in all cases, except for the instance torus10, which corresponds to a case of very high noise intensity.In other words, this schema is able to capture the optimal number of poles in cases of noise of low and medium intensity.Furthermore, the method fits the data points very accurately.For instance, the relative fitting error for the noiseless case is as good as 10 −6 for each coordinate.These striking results are visually confirmed in Figure 4. Note, for instance, the very good fitting for the sequential rational schema (bottom row).
On the other hand, as expected, the BIC increases as the noise level increases, meaning that the method is affected by the noise intensity, but not drastically.In fact, the method is very resilient against noise, as it still yields very reasonable relative fitting errors of order 10 −2 for high-intensity noise (for instance, of SNR = 10) and 10 −3 for SNR = 30.For example, the visual quality of the fitting is clearly visible for the case SNR = 50, as shown in Figure 5.We remark, however, that in this case, the nonrational and AIO rational schemas require extra parameters to obtain their best fitting.Note, for instance, that  pol = 25 for the nonrational schema in this example.This effect can be explained by the fact that the nonrational curve has less degrees of freedom because no weights are available.As a consequence, more poles are usually required to compensate this limitation.But even in this case, the value of this parameter is lower or equal to 25.This result is a clear indication of the effectiveness of our proposal to use BIC to keep the dimension of the problem as low as possible and to prevent overfitting.
This third example has also been used to illustrate the good performance of our neighborhood function, described in Section 5.2.1.Figure 7 shows two graphical examples of the evolution of the BIC for the rational all-in-one schema versus the number of evaluations of the fitness function, given by the parameter  eval .The pictures display the examples torus and torus50 from Table 1, corresponding, respectively, to the noiseless case (top) and the noisy case with SNR = 50 (bottom).Both pictures show the evolution of the maximum, mean, and minimum BIC in a color-coded representation (in blue, green, and red, resp.).These BIC values have been obtained with our method from 20 executions out of 26 executions after removing the three best and three worst results for each case.As the reader can see, our neighborhood function allows the method to escape from local minima, a situation that happens particularly at earlier stages of the evolution, associated with an intensive exploration of the search space.Two temporal windows have also been included in the pictures to enlarge these initial stages by zooming for better visualization.After this initial period, the BIC decreases slower and the fitting error reaches a plateau where the exploitation phase becomes dominant.Finally, convergence to the optimal values (marked by the vertical magenta line) is achieved and the fitting error does no longer improve.These pictures clearly show that the method is well suited for multimodal problems, being able to escape from local minima, thus preventing premature convergence to happen.

Conclusions and Future Work
In this paper we introduce two new simulated annealing schemas for continuous optimization.They are applied to obtain the rational Bézier curve that fits better a given set of noisy data points in the least-squares sense.This is a very difficult problem that requires computing four different sets of unknowns: data parameters, poles, weights, and the curve degree.Besides, these free variables are strongly related to each other in a highly nonlinear way.This leads to a difficult continuous nonlinear optimization problem that cannot be decomposed into several independent subproblems.To address this challenging issue, we propose an optimization method combining classical methods (least-squares minimization), modern stochastic methods (simulated annealing), and information science metrics (Bayesian Information Criterion (BIC)).The simulated annealing algorithm is applied to perform data parameterization and to determine the weights of the poles of the fitting curve.This is done by using two different schemas: the all-in-one schema, which computes both sets of unknowns together simultaneously, and the sequential schema, which computes each set of unknowns in sequence, using the previous set of computed variables as the new input.The least-squares minimization is used to calculate the poles of the fitting curve.Finally, we apply the BIC to determine the optimal degree of the best rational Bézier fitting curve.This methodology has been applied to a benchmark of three illustrative examples of 2D Our computational experiments on the proposed benchmark show that our method is very suitable for the given data fitting problem.From the two schemas proposed, the sequential one performed particularly well in all instances, although the AIO schema also provides very good fitting results for our examples.These good numerical results are not obtained at the expense of a large number of variables.On the contrary, the application of the BIC allows to obtain models with a remarkable low number of free variables, even for difficult shapes with complicated features and noise.We also compared our results with our previous method with nonrational curves in [5].The new rational method clearly outperforms our previous approach for all instances of our benchmark.Our experiments with different levels of noise on the last example also show that the method is robust against noise of low to medium intensity.
The main limitations of our approach concern its performance in situations of high-intensity noise.Figure 6 shows our results for the third example with noise of SNR = 10.In general, our method is able to capture the tendency of the data even under these strongly adverse conditions, but some problems may arise in the neighborhood of the initial and last poles of closed curves.In particular, the continuity of such curves at that point cannot be assured.This situation is not critical at all; it can readily be avoided by introducing additional constraints in our problem.However, as expected, Figure 7: Evolution of the maximum, mean, and minimum value of BIC (in blue, green, and red color, resp.)versus the number of function evaluations for the curve on a torus example: top: noiseless case; bottom: noisy case (with SNR = 50).The inner boxed pictures also show a zoom of the initial exploration stages in both cases for better visualization.

Figure 1 :
Figure1: Workflow of the proposed method described in three layers (from upper to lower layer): the general workflow; decomposition of the SA-LSQ procedure; the two different SA schemas introduced in this paper: all-in-one schema (top) and sequential schema (bottom).
best ) vector is not used to drive the SA execution; instead, it is only used as a memory effect, with the role to (possibly) improve the convergence rate with respect to the standard (nonelitist) version of SA. (ii) We add a new operator, related to the domain of the problem, to work in combination with the neighborhood function.This extra functionality checks whether a new generated solution goes outside the search domain of the problem and sends it back into the search space whenever it goes away.To this purpose, we apply the classical cast back operator, a widely accepted routine in numerical methods.Suppose that the SA returns a new solution x new outside the problem domain D, obtained from a previous solution x old within D. The cast back procedure replaces x new by a new value x cb given by the convex combination: x cb = Proj(x new ) + (1 − )x old , where Proj(⋅) is the operator that projects any point outside the domain onto its closest point on the boundary of D and  is a uniform random number in the interval (0, 1).This procedure returns a new point x cb that is well within the search domain while simultaneously ensuring that the probability of the boundary is not increased by this operator.

Table 1 :
Experimental results on the three examples in our benchmark for the simulated annealing schemas discussed in this paper.
which corresponds to a curve on a torus.We consider a set of 202 three-dimensional data points with uniform sampling in the interval domain [0, 2].This dataset, labelled as torus in

Table 1 ,
is then perturbed with additive white noise of different intensities, modulated by a signal-to-noise ratio (SNR) ranging from 10 (very high intensity) to 90 (low intensity), with step-size 10.The corresponding datasets are labelled as torusN, where  indicates the SNR intensity.