An Automated Profile-Likelihood-Based Algorithm for Fast Computation of the Maximum Likelihood Estimate in a Statistical Model for Crash Data

Numerical computation of maximum likelihood estimates (MLE) is one of the most common problems encountered in applied statistics. Even if there exist many algorithms considered as performing, they can su ﬀ er in some cases for one or many of the following criteria: global convergence (capacity of an algorithm to converge to the true unknown solution from all starting guesses), numerical stability (ascent property), implementation feasibility (for example, algorithms requiring matrix inversion cannot be implemented when the involved matrices are not invertible), low computation time, low computational complexity, and capacity to handle high dimensional problems. The reality is that, in practice, no algorithm is perfect, and for each problem, it is necessary to ﬁ nd the most performing of all existing algorithms or even develop new ones. In this paper, we consider the computing of the maximum likelihood estimate of the vector parameter of a statistical model of crash frequencies. We split the parameter vector, and we develop a new estimation algorithm using the pro ﬁ le likelihood principle. We provide an automatic starting guess for which convergence and numerical stability are guaranteed. We study the performance of our new algorithm on simulated data by comparing it to some of the most famous and modern optimization algorithms. The results suggest that our proposed algorithm outperforms these algorithms.


Introduction
Let ℓðθÞ be a log-likelihood function where θ ∈ ℝ d is a parameter vector whose structure will be specified later. In computing the maximum likelihood estimate (MLE) which is the point b θ at which ℓðθÞ attains its maximum, the very first algorithm one can try is the Newton-Raphson (NR) algorithm which is no longer to be presented. The success of this algorithm comes from an enviable and unequalled property: that of quadratic convergence when the starting guess is well chosen (i.e., close to the unknown solution). However, it also has drawbacks: it is not globally convergent (i.e., its success depends on the starting guess); it can be costly or impossible to implement in high-dimensional problems because of its need of inverting the Hessian matrix at each iteration. Many other algorithms can be used whenever NR cannot. We refer the reader to [1][2][3][4] for a comprehensive review of these algorithms. Of all these algorithms, we can mention quasi-Newton (which use approximations of the inverse of the Hessian matrix rather than inverting it), Fisher scoring (a purely statistical method which consists in replacing the Hessian matrix with its mathematical expectation in order to ensure the ascent property and therefore numerical stability), block optimization, and derivative-free optimization algorithms. We can also mention minorization-maximization (MM) algorithms [5,6] which have made an extraordinary breakthrough in computational statistics and are increasingly used. The MM philosophy for maximizing a function ℓðθÞ is to define, in the first M step, a minorizing function for the objective function, and to maximize, in the second M step, the minorizing function with respect to the parameter vector θ.
Even if all these algorithms are considered as performing, they can suffer in some cases for one or many of the following criteria: global convergence (capacity of an algorithm to converge to the true unknown solution from all starting guesses), numerical stability (ascent property), implementation feasibility (for example, algorithms requiring matrix inversion cannot be implemented when the involved matrices are not invertible), low computation time, low computational complexity, and capacity to handle high dimensional problems. The reality is that, in practice, no algorithm is perfect, and for each problem, it is necessary to find the most performing of all existing algorithms or even develop new ones.
In this paper, we consider the computing of the maximum likelihood estimate (MLE) of the vector parameter of a statistical model of crash frequencies proposed by [7]. The vector parameter has the form θ = ðα, β T Þ T where α > 0 is the parameter of interest and β is a vector of secondary parameters. Although secondary, the subvector β can contain an important number of components (up to several hundreds) depending on the structure of the data. Thus, the classical algorithms mentioned above may need a great number of iterations (and thus have a slow convergence) or simply fail to converge. For the model considered in this paper, Newton-Raphson and Minorization-Maximization (MM) algorithms have been implemented by [7,8], but the numerical study of these algorithms has been limited to simple cases. The NR algorithm is known to be fast in simple cases and inefficient (or even impossible to implement) for high-dimensional problems. Moreover, its convergence depends on the starting guess (initial solution θ ð0Þ from which the algorithm starts). The MM algorithm, on the other hand, often requires a large number of iterations before converging to the solution.
The main contribution of this paper is to build an estimation algorithm which converges faster than the other algorithms and whose convergence is guaranteed and is not affected by the dimension (the number of components of θ). For this purpose, we exploit the splitting of the parameter vector into two blocks, and we apply the profile likelihood principle (see for example [9], p. 231) to reduce the search of θ to only the search of the parameter of interest α. Then, we develop a new estimation algorithm for which we provide an automatic starting guess from which convergence and numerical stability (ascent property) are guaranteed. This automatization of the starting guess reveals to be a true advantage for our algorithm over the others because it allows to circumvent the difficulty of finding an adequate starting guess. We show using simulated data that our algorithm outperforms Minorization-Maximization (MM) and Newton-Raphson algorithms which are two of the most famous and modern optimization algorithms.
The remainder of this paper is organized as follows. In Section 2, we describe the data and the estimation problem. The statistical model and the constrained maximum likelihood estimation of the parameters are also presented. In Section 3, we present the profile likelihood principle and then use it to design our new profile-likelihood-based algorithm (PLBA) for computing the MLE of θ. We also provide an automatic starting guess, and we prove that it guarantees convergence of the proposed algorithm. In Section 4, we prove that the proposed algorithm satisfies the ascent property. In Section 5, we report the results of the comparison of the PLBA with MM and NR algorithms. The paper finishes with some discussions and concluding remarks in Section 6.

Data, Statistical Model, and Problem Setup
The framework of this paper is the statistical analysis of crash data before and after the implementation of a road safety measure at s (s > 0) experimental sites (called treated sites) where crashes are classified by severity in r (r > 0) levels. This analysis is aimed at estimating the mean effect α > 0 of the safety measure simultaneously on all the s sites. This mean effect must be understood in the multiplicative sense and therefore must be compared to 1. A value α < 1 indicates a positive effect of the measure (an average reduction of 100 × ð1 − αÞ% in the number of crashes) while a value α > 1 indicates a negative effect of the measure (an average increase of 100 × ðα − 1Þ% in the number of crashes) and α = 1 indicates that the measure had no significant influence on the number of crashes. Let ð1Þ be a matrix of order s × ð2rÞ where x 1jk (respectively x 2jk ) is the number of accidents of severity level j (j = 1, ⋯, r) which occurred at site k in the period before (respectively, after) the implementation of the measure. Also let ð2Þ be a matrix of order s × r where z jk (j = 1, ⋯, r, k = 1, ⋯, s) is the ratio of the number of accidents of severity j in the "after" period to the number of accidents of the same severity in the "before" period on a control site (a site where the measure has not been implemented) paired with treated site k.
Let S r−1 = fðp 1 , ⋯, p r Þ ∈ ½0, 1 r , ∑ r j=1 p j = 1g and n k = ∑ 2 i=1 ∑ r j=1 x ijk be the total number of accidents observed on the treated site k. N'Guessan et al. [7] proposed the following multinomial-based probability distribution of parameter vector θ for x:

2
Journal of Applied Mathematics where θ = ðα, β T Þ T is the parameter vector, α > 0 is the mean where x •jk = x 1jk + x 2jk (see [7] for more details).
In this paper, we are interested in computing the maximum likelihood estimate (MLE) b θ of the unknown vector parameter θ defined by b θ = argmax θ∈ℝ * In the case s = 1, [10] built an algorithm to solve Equation (6). In the next section, we present a profilelikelihood-based algorithm (PLBA) for computing the MLE in the general case s ≥ 1. Our proposed PLBA is automated since we provide an automatic starting guess which guarantees convergence.

The Automated Profile-Likelihood-Based
Algorithm (PLBA) for Computing the MLE 3.1. A Brief Reminder on Profile Likelihood. Let us rewrite the log-likelihood as ℓðθÞ = ℓðα, βÞ. If, for a given α, the MLE b β of β may be written as a function b βðαÞ of α, that is, then the profile likelihood function (see for example [9], p. 231) is expressed as a function of α only. The maximization of ℓ p ðαÞ is equivalent to that of ℓðα, βÞ [11].

Computation of b β in Closed Form
Proof. Given α > 0, [7] proved that for all k = 1, ⋯, s, the components of b β k satisfy the following system of r equations: Thus, for all k = 1, ⋯, s, we apply Theorem 3.5 of [12] to Equation (10) and get where is a block-defined square matrix of order r + 1, is the Schur complement of Δ α,k in M α,k (see for example [13] (p. 34) for a reminder on the use of Schur complement for inverting a block matrix). The proof is thus completed.

Profile Likelihood
Theorem 2. The profile likelihood function is defined, up to an additive constant independent of α, by where Proof. Expression (5) is equivalent to

Journal of Applied Mathematics
For all k = 1, ⋯, s, Equation (9) yields and the relationships (9) and (16) enable us to write After some manipulations on the first, second and fourth terms, we get: Removing the third and sixth terms and the constants (first and fifth terms), we get (14).

Design of the PLBA
where Proof.
(i) On the one hand, function F is a one-to-one decreasing function (it is continuous and its derivative F ′ðuÞ is strictly negative for all u ≥ 0) and Since −x 1•• < 0 < x 2•• , Equation FðαÞ = 0 has a unique solution.
(ii) On the other hand, the MLE b α, if it exists, is solution to the optimization problem The profile log-likelihood ℓ p ðαÞ being differentiable for every α > 0, the MLE b α is then solution to Equation and ∑ s k=1 ∑ r j=1 x •jk = x 1•• + x 2•• . Thus, Equation ℓ p ′ ðb αÞ = 0 is equivalent to Fðb αÞ = 0, and b α is the unique root of F. Equation FðuÞ = 0 seems fairly complicated to solve in closed form for any s > 1 and must therefore be solved numerically. Obviously, there are many root-finding algorithms (see for example [14] (Chapter 3) or [15] (Chapter 3)). Here, we propose a numerical approximation of b α using the following one-dimensional Newton-Raphson (NR) rootfinding algorithm: where the starting guess α ð0Þ should be chosen in ℝ + by the user. Our choice of NR algorithm is motivated by the fact that it converges quadratically to the solution if α ð0Þ is chosen near the unknown solution. To overcome the difficulty of the choice of α ð0Þ , we prove that, if we set α ð0Þ = 0 as an automatic starting guess, then, the convergence of NR iterations (23) is always guaranteed. 4 Journal of Applied Mathematics To prove Theorem 4, we need the following Lemma 5.
Proof of Lemma 5.
(i) For all u ≥ 0, we have Therefore, φ is differentiable and where for all u ∈ ½0,+∞½, So the sign of φ′ðuÞ is the same as the one of FðuÞ. As F is a decreasing function and Fðb αÞ = 0, we have It means that φ is increasing on ½0, b α and decreasing on ½b α, +∞½.
(ii) Equation φðuÞ = u is equivalent to FðuÞ = 0, where F is defined by Formula (19). By Lemma 3, this equation has b α as unique solution; hence, b α is the unique fixed point of φ.
We may now prove Theorem 4.
Proof of Theorem 4. Assume that α ð0Þ = 0. Then, α ð0Þ ≤ b α and by item (iii) and (iv) of Lemma 5, α ð0Þ ≤ φðα ð0Þ Þ = α ð1Þ ≤ b α. It can be easily proved by induction that Thus, the sequence ðα ðmÞ Þ is increasing and bounded; hence, it converges to the unique fixed point of φ which is b α. Remark 6. Actually, from the proof of Theorem 4, it is clear that any starting value α ð0Þ ∈ ½0, b α will guarantee convergence of the sequence ðα ðmÞ Þ generated by NR iterations (23). However, since b α is unknown, it is difficult to find a value other than α ð0Þ = 0.
We therefore propose an algorithm (see Algorithm 1) starting from α ð0Þ = 0. The MLE b α is computed using NR iterations (23); afterwards, b β is computed from b α.

Ascent Property
The ascent property of Algorithm 1 (the fact that the profile log-likelihood is increased monotonically by the algorithm) is given by Theorem 7.
Theorem 7. The sequence ðα ðmÞ Þ generated by Algorithm 1 increases monotonically the profile log-likelihood ℓ p ðαÞ, that is Proof. From (22) and (29), we deduce that, for all α > 0, Hence, the 5 Journal of Applied Mathematics function ℓ p is increasing on the interval 0, b α and decreasing on ½b α, +∞½. From (30), we know that the sequence ðα ðmÞ Þ is increasing and still belongs to the interval 0, b α. Then, for all iteration m, we have α ðmÞ ≤ α ðm+1Þ and ℓ p ðα ðmÞ Þ ≤ ℓ p ðα ðm+1Þ Þ because ℓ p is increasing on 0, b α. The proof of Theorem 7 is then completed.

Simulation Study
We compare the performance of our PLBA with that of Newton-Raphson (NR) and MM algorithms in R software [16]. We implemented the NR algorithm using the R package pracma [17] and the MM algorithm using Theorem 3.3 of [8]. The choice of these two comparison algorithms is motivated by the following factors: (a) MM and NR algorithms are two of the most used algorithms in statistics for parameter estimation; (b) other algorithms such as quasi-Newton and derivative-free algorithms showed very low convergence proportions and their results are not reported here; (c) the results obtained for the particular case s = 1 in [10] suggest that MM and NR algorithms are much more efficient than quasi-Newton and derivative-free algorithms.
For these different scenarios, the number of parameters (1 + sr) is given by Table 1.
For the n k 's (k = 1, ⋯, s), we have chosen two common values: a low value (n = 50) and a great value (n = 5000). Except for the proposed algorithm whose starting guess is automated, the other algorithms were given randomly generated starting guesses.

Results
. Tables 2-7 present the average results obtained for the different scenarios over 1000 replications (i.e., 1000 repetitions of the data generation and computation of b θ). In these tables, CPU times are given in seconds and CPU time ratios are calculated as the ratio of the mean CPU time of a given algorithm to the CPU time of the PLBA. Thus, the CPU time ratio of the PLBA is always equal to 1. The Mean Square Error (MSE) is defined as Due to the increasing number of parameters, the MLE b θ has only been included for Scenario 1 ( Table 2).
From these tables, it can be seen that PLBA and MM algorithms have always converged while the convergence proportion of NR algorithm decreases (from 55.9% to 19.6%) with the number of parameters (see Figure 1). As     Journal of Applied Mathematics far as the MSE is concerned, the trend for all the algorithms is that the MSE decreases when the sample size n increases. By taking a look at the CPU times ratios, we notice that the CPU time ratios of the MM and NR algorithms are all well above 1. This means that the PLBA is significantly faster than these two algorithms. As shown on Figures 2 and 3, the CPU time ratios of MM and NR algorithms increase with the number of parameters. The PLBA is on average 8 to 50 times faster than MM and 6 to 74 times quicker than NR.

Analysis of the
Results. It appears that our proposed PLBA outperforms the Minorization-Maximization (MM) and the full Newton-Raphson (NR) algorithms. It always converges; it requires a low number of iterations and little computation time. The number of iterations varies slightly and seems to stabilize around six iterations whatever the starting guess and the number of parameters to be estimated. The MM algorithm also has a convergence rate of 100%, but its number of iterations is quite high, and this may indicate some sensitivity to the starting guess. The NR algorithm has    Journal of Applied Mathematics a convergence proportion at most slightly higher than 50% and this proportion decreases when the number of parameters increases. This is not surprising because at each iteration of the NR algorithm, a square matrix of order 1 + sr is numerically inverted. This numerical inversion can be complicated or impossible when the matrix to be inverted is ill-conditioned or singular. The fact that NR does not converge for all the replications is also not surprising since it is well known that NR may converge to bad values or may not converge at all if the starting guess is far from the true parameter vector.
The good results obtained by our proposed algorithm PLBA can be explained by several factors. First, the principle of profile likelihood enables to reduce the search for 1 + sr solutions to that of the single solution b α from which the remaining sr estimates b β jk (j = 1, ⋯, r, k = 1, ⋯, s) can be obtained by a simple formula. This also enables to reduce the computation time required by our proposed PLBA to converge. Secondly, the fact that the estimation of α relies on a one-dimensional NR enables our algorithm to enjoy the quadratic convergence of the NR algorithm. Thirdly, the definition of the automated starting guess ensures the convergence of our proposed algorithm PLBA, and the ascension property ensures its numerical stability.

Conclusion
In this article, we have built a profile-likelihood-based algorithm (PLBA) to compute, under constraints, the maximum likelihood estimate (MLE) of the parameter vector of a statistical model used in the analysis of a road safety measure applied to s sites presenting in total r accident severity levels. The parameter vector of the model is of the form θ = ðα, β T Þ T , where α is the parameter of interest and β is a vector of sr secondary parameters. Using the likelihood equations, we obtained the closed-form expression of the components of β as a function of the main parameter α and then used the principle of profile likelihood to express the log-likelihood only as a function of α. We then built an algorithm mixing a one-dimensional Newton method to compute the estimate of the parameter of interest α and the computation of the secondary parameters from their closed-form expressions. The starting guess of our proposed algorithm is automated in such a way as to guarantee its convergence towards the MLE b θ. The numerical studies suggest that our PLBA outperforms the Minorization-Maximization (MM) and the full Newton-Raphson (NR) algorithms in terms of computation time. Our PLBA converges to estimates close to the true values of the parameter vector even for small sample sizes. They also suggest that the problem addressed in this paper is difficult to tackle for the full NR algorithm (the latter having a convergence proportion at most slightly higher than 50%).

Data Availability
This study uses simulated data, and the data generation process is described in the paper.

Conflicts of Interest
The author declares that he/she has no conflicts of interest.