Maximum Likelihood Estimation of the VAR(1) Model Parameters with Missing Observations

Missing-data problems are extremely common in practice. To achieve reliable inferential results, we need to take into account this feature of the data. Suppose that the univariate data set under analysis has missing observations.This paper examines the impact of selecting an auxiliary complete data set—whose underlying stochastic process is to some extent interdependent with the former— to improve the efficiency of the estimators for the relevant parameters of the model. The Vector AutoRegressive (VAR) Model has revealed to be an extremely useful tool in capturing the dynamics of bivariate time series. We propose maximum likelihood estimators for the parameters of the VAR(1) Model based on monotone missing data pattern. Estimators’ precision is also derived. Afterwards, we compare the bivariate modelling scheme with its univariate counterpart. More precisely, the univariate data set with missing observations will be modelled by an AutoRegressive Moving Average (ARMA(2,1)) Model. We will also analyse the behaviour of the AutoRegressive Model of order one, AR(1), due to its practical importance. We focus on the mean value of the main stochastic process. By simulation studies, we conclude that the estimator based on the VAR(1) Model is preferable to those derived from the univariate context.


Introduction
Statistical analyses of data sets with missing observations have long been addressed in the literature.For instance, Morrison [1] deduced the maximum likelihood estimators of the parameters of the multinormal mean vector and covariance matrix for the monotonic pattern with only a single incomplete variate.The exact expectations and variances of the estimators were also deduced.Dahiya and Korwar [2] obtained the maximum likelihood estimators for a bivariate normal distribution with missing data.They focused on estimating the correlation coefficient as well as the difference of the two means.Following this line of research and having in mind that the majority of the empirical studies are characterised by temporal dependence between observations, we will try to generalise the previous study by introducing a bivariate time series model to describe the relationship between the processes under consideration.
The literature on missing data has expanded in the last decades focusing mainly on univariate time series models [3][4][5][6][7], but there is still a lack of developments in the vectorial context.
This paper aims at analysing the main properties of the estimators from data generated by one of the most influential models in empirical studies, that is, the first-order Vector AutoRegressive (VAR(1)) Model, when the data set from the main stochastic process, designated by {  } ∈Z , has missing observations.Therefore, we assume that there is also available a suitable auxiliary stochastic process, denoted by {  } ∈Z , which is to some extent interdependent with the main stochastic process.Additionally, the data set obtained from this process is complete.In this context, a natural question Mathematical Problems in Engineering arises: is it possible to exchange information between the two data sets to increase knowledge about the process whose data set has missing observations, or should we analyse the univariate stochastic process by itself?The goal of this paper is to answer this question.
Throughout this paper, we assume that the incomplete data set has a monotone missing data pattern.We follow a likelihood-based approach to estimate the parameters of the model.It is worth pointing out that, in the literature, likelihood-based estimation is largely used to manage the problem of missing data [3,8,9].The precision of the maximum likelihood estimators is also derived.
In order to answer the question raised above, we must verify if the introduction of an auxiliary variable for estimating the parameters of the model increases the accuracy of the estimators.To accomplish this goal, we compare the precision of the estimators just cited with those obtained from modelling the dynamics of the univariate stochastic process {  } ∈Z by an AutoRegressive Moving Average (ARMA(2,1)) Model, which corresponds to the marginal model of the bivariate VAR(1) Model [10,11].The behaviour of the AutoRegressive Model of order one, AR (1), is also analysed due to its practical importance in time series modelling.Simulation studies allow us to assess the relative efficiency of the different approaches.Special attention is paid to the estimator for the mean value of the stochastic process about which information available is scarce.This is a reasonable choice given the importance of the mean function of a stochastic process in understanding the behaviour of the time series under consideration.
The paper is organised as follows.In Section 2, we review the VAR(1) Model and highlight a few statistical properties that will be used in the remaining sections.In Section 3, we establish the monotone pattern of missing data and factorise the likelihood function of the VAR(1) Model.The maximum likelihood estimators of the parameters are obtained in Section 4. Their precision is also deduced.Section 5 reports the simulation studies in evaluating different approaches to estimate the mean value of the stochastic process {  } ∈Z .The main conclusions are summarised in Section 6.

Brief Description of the VAR(1) Model
In this section, a few properties of the Vectorial Autoregressive Model of order one are analysed.These features will play an important role in determining the estimators for the parameters when there are missing observations, as we will see in Section 4.
Hereafter, the stochastic process underlying the complete data set is denoted by {  } ∈Z , while the other one is represented by {  } ∈Z .The VAR(1) Model under consideration takes the form where   and   are Gaussian white noise processes with zero mean and variances  2  and  2  , respectively.The structure of correlation between the error terms is different from zero only at the same date , that is, Cov( − ,  − ) =    , for  = ; Cov( − ,  − ) = 0, for  ̸ = , ,  ∈ Z. Exchanging information between both time series might introduce some noise in the overall process.Therefore, transfer of information from the smallest series to the largest one is not allowed here.
We have to introduce the restrictions | 1 | < 1 and | 1 | < 1.They ensure not only that the underlying processes are ergodic for the respective means but also that the stochastic processes are covariance stationary (see Nunes [12, ch.3]).Hereafter, we assume that these restrictions are satisfied.
Next, we overview some relevant properties of the VAR(1) Model (1).Theoretical details can be found in Nunes [12, ch.3].
The mean values of   and   are, respectively, given by Concerning the covariance structure of the process   , For  1 ̸ =  1 , the covariance of the stochastic process   is given by Considering that  1 =  1 , we have In regard to the structure of covariance between the stochastic processes   and   , for  1 ̸ =  1 , we have , ∀ ,∈Z .
When  1 =  1 , the covariance function under study takes the form By writing out the stochastic system of (1) in matrix notation, the bivariate stochastic process Z  = [    ]  can be expressed as where   = [    ]  is the 2-dimensional Gaussian white noise random vector.Hence, at each date ,  ∈ Z, the conditional stochastic process Z  | Z −1 =z −1 follows a bivariate Gaussian distribution, , where the two-dimensional conditional mean value vector and the variance-covariance matrix are, respectively, given by Straightforward computations lead us to the following factoring of the probability density function of Thus, the joint distribution of the pair   and   conditional to the values of the process at the previous date  − 1, Z −1 , can be decomposed into the product of the marginal distribution of   | Z −1 and the conditional distribution of   |   ,Z −1 .Both densities follow univariate Gaussian probability laws: Also,   |   =  ,Z −1 =z −1 follows a Gaussian distribution with where  1 = (  / 2  ) or, for interpretive purposes,  1 = (  /  )  .The parameter  1 describes, thus, a weighted correlation between the error terms   and   .The weight corresponds to the ratio of their standard deviations.Moreover, The variance has the following structure: The conditional distribution of   |   ,Z −1 can be interpreted as a straight-line relationship between   and   ,  −1 , and  −1 .Additionally, it is worth mentioning that if   = ±1 or  2  = 0, the above conditional distribution degenerates into its mean value.Henceforth, we will discard these particular cases, which means that  3 ̸ = 0.

Factoring the Likelihood Based on Monotone Missing Data Pattern
We focus here on theoretical background for factoring the likelihood function from the VAR(1) Model when there are missing values in the data.Suppose that we have the following monotone pattern of missing data: That is, there are  observations available from the stochastic process {  } ∈Z , whereas due to some uncontrolled factors it was only possible to record ( < ) observations from the stochastic process {  } ∈Z .In other words, there are  −  missing observations from   .
Let the observed bivariate sample of size  with missing values: denote a realisation of the random process Z  = [    ]  ,  ∈ Z, which follows a vectorial autoregressive model of order one.The likelihood function, (), is given by where ]  is the 8-dimensional vector of population parameters.To lighten notation, we assume that there is no need for conditioning the arguments of the above probability density functions on the values of the processes at date  − 1.The likelihood function becomes Two points must be emphasised: first, we emphasise that the maximum likelihood estimators (m.l.e.) for the unknown vector of parameters will be obtained by maximising the natural logarithm of the above likelihood function.Second, a worthwhile improvement in reducing the complexity of the function to maximise is to determine the conditional maximum likelihood estimators regarding the first pair of random variables, Z 0 = [ 0  0 ]  , as deterministic and maximising the log-likelihood function conditioned on the values  0 =  0 and  0 =  0 .The loss of efficiency of the estimators obtained from such a procedure is negligible when compared with the exact maximum likelihood estimators computed by iterative techniques.Even for moderate sample sizes, the first pair of observations makes a negligible contribution to the total likelihood.Hence, the exact m.l.e. and the conditional m.l.e.turn out to have the same large sample properties, Hamilton [13].Hereafter, we restrict the study to the conditional loglikelihood function.
Despite the above solutions for reducing the complexity of the problem, some difficulties still remain.The loglikelihood equations are intractable.To go over this problem we have to factorise the conditional likelihood function.From (17) we get So as to work out the analytical expressions for the unknown parameters under study, we have to decompose the entire likelihood function (18) into easily manipulated components.
For the Gaussian VAR processes, the conditional maximum likelihood estimators coincide with the least squares estimators [13].Therefore, we may find a solution to the problem just raised in the geometrical context.The identification of such components relies on two of the most famous theorems in the Euclidean space: the Orthogonal Decomposition Theorem and the Approximation Theorem [14, Volume I, pages 572-575].Based on these tools it is straightforward to establish that the estimation subspaces associated with the conditional distributions   |  −1 and   |   , −1 , −1 are, by construction, orthogonal to each other.This means that each element belonging to one of those subspaces is uncorrelated with each element that pertains to their orthogonal complement.Hence, events that happen on one subspace provide no information about events on the other subspace.
The aforementioned arguments guarantee that the decomposition of the joint likelihood in two components can be carried out with no loss of information for the whole estimation procedure.From (18) we can, thus, decompose the conditional loglikelihood function as follows: Henceforth,  1 denotes the loglikelihood from the marginal distribution of   , based on the whole sampled data with dimension , that is,  0 ,  1 , . . .,  −1 .The function  2 represents the loglikelihood from the conditional density of   |   ,Z −1 computed by the bivariate sample of size : ( 0 ,  0 ) , ( 1 ,  1 ) , . . ., ( −1 ,  −1 ) . (20) The components  1 and  2 of (19) will be maximised separately in Section 4.1.

Maximum Likelihood Estimators for the Parameters
In Section 4.1 the m.l.e. of the parameters from the fragmentary VAR(1) Model are deduced.The precision of the estimators is examined in Section 4.2.

Analytical Expressions.
Theoretical developments carried out in this section rely on solving the loglikelihood equations obtained from the factored loglikelihood given by (19).Before proceeding with theoretical matters, we introduce some relevant notation in the ensuing paragraphs.
(ii) Maximising the loglikelihood function  2 : Based on ( 12) and ( 13) we get the loglikelihood function  2 2 . (25) We readily find out that the m.l.e. for the parameters under study are given by −1 ) 2 where  *  denotes the corresponding residual sums of squares.
Using the results from Section 2 we get the following estimators for the original parameters: Mathematical Problems in Engineering Thus, the analytical expressions for the estimators of the mean values, variances, and covariances of the VAR(1) Model are given by μ = α0 1 − α1 , (at the same date ,  ∈ Z) .
These estimators will play a central role in the following sections.

Precision of the Estimators.
In the section, the precision of the maximum likelihood estimators underlying equations (28) is derived.The whole analysis will be separated in three stages.First, we study the statistical properties of the vector Θ, where Θ = [ Θ1 Θ2 ]  , with Θ1 = [α 0 α1 σ2  ]  and Θ2 = [ψ 0 ψ1 ψ2 β1 ψ3 ]  .For notation consistency, the unknown parameter  1 is either denoted by  1 or  4 .That is,  4 ≡  1 .Secondly, we derive the precision of the m.l.e. of the original parameters of the VAR(1) Model (see (1)).Finally, we will focus our attention on the estimators for the mean vector and the variance-covariance matrix at lag zero of the VAR(1) model with a monotone pattern of missingness.
There are a few points worth mentioning.From Section 3 we know that there is no loss of information in maximising separately the loglikelihood functions  1 and  2 (19).As a consequence, the variance-covariance matrix associated with the whole set of estimated parameters is a block diagonal matrix.For sufficiently large sample size, the distribution of the maximum likelihood estimator is accurately approximated by the following multivariate Gaussian distribution: where I 1 and I 2 denote the Fisher information matrices, respectively, from the components  1 and  2 of the loglikelihood function (see ( 19)).There is an asymptotic equivalence between the Fisher information matrix and the Hessian matrix (see [8, ch.2]).Moreover, as long as Θ → Θ there is also an asymptotic equivalence between the Hessian matrix computed at the points Θ and Θ. Henceforth, the Fisher information matrices from (29) are estimated, respectively, by To lighten notation, from now on we suppress the "hat" from the consistent estimators of the information matrices.
The variance-covariance matrix for Θ1 takes the following form: −1 ) We stress that there is orthogonality between the error and the estimation subspaces underlying the loglikelihood function  1 .
Calculating the second derivatives of the loglikelihood function  2 results in the following approximate information matrix: Once again, we mention that there is orthogonality between the error and the estimation subspaces underlying the loglikelihood function  2 .The matrix I 2 can be written in a compact form: where the (4 × 4) submatrix I 21 and the scalar  22 are, respectively, defined as with Using the above partition of I 2 it is rather simple to compute the inverse matrix.In fact, with Unfortunately, there is no explicit expression for the inverse matrix I −1 21 .As a result, there are no explicit expressions for the approximate variance-covariance of the m.l.e. for the vector of unknown parameters Θ2 .Now, we have to analyse the precision of the m.l.e. of the original parameters of the VAR(1) Model, that is, Υ = [ 0  1  2   0  1  2  2    ]  .Recalling from Section 2, the one-one monotone functions that relate the vector of parameters under consideration, that is, The parameters  0 ,  1 , and  2  remain unchanged.A key assumption in the following developments is that neither the estimates of the unknown parameters nor the true values fall on the boundary of the allowable parameter space.
The variance-covariance matrix of the m.l.e. for the vector of parameters Υ is obtained by the first-order Taylor expansion at Υ.We also use the chain rule for derivatives of vector fields ([for details, see [14, Volume II, pages 269-275]).
Writing the vector of parameters Υ as a function of the vector Θ, the respective first-order partial derivatives can be joined together in the following partitioned matrix: where the (3 × 3) submatrix D 1 corresponds to the first-order partial derivatives of the vector with respect to itself, which means that D 1 is nothing but the identity matrix of order 3, D 1 = I 3 .On the other hand, this statement also means that the derivatives of the parameters under consideration with respect to either  0 ,  1 ,  2 ,  3 , or  4 are zero.In other words, the (3 × 5) submatrix D 2 is equal to the null vector, that is, D 2 = 0.
The (5 × 3) submatrix D 3 and the (5 × 5) submatrix D 4 are composed by the first-order partial derivatives of each component of the vector of parameters and  0 ,  1 ,  2 ,  4 ,  3 .Their structures are, thus, given by For finding out the approximate variance-covariance matrix of the maximum likelihood estimators for the unknown vector of parameters Υ, it is only necessary to pre-and postmultiply the variance-covariance matrix arising from expressions (29), (31), and (36) by, respectively, the matrix D and its transpose, D  .More precisely, Hence, with Σ Υ denoting the variance-covariance matrix of the m.l.e. for the vector of unknown parameters Υ.A more detailed analysis of the variance-covariance matrix (41) can be found in Nunes [12, ch.3, p.91-92].
We can now deduce the approximate variance-covariance matrix of the maximum likelihood estimators for the mean vector and the variance-covariance matrix at lag zero of the VAR(1) Model with a monotone pattern of missingness, represented by Ξ = [ 0  1  2       2   2    ]  .The first-order partial derivatives of the vector Ξ with respect to the vector Υ are placed in a matrix that is denoted by F. It takes the following form: According to the partition of the matrix D into four blocks-expression (38)-we partition the matrix F into the following blocks: the (3 × 3) submatrix F 1 corresponds to the partial derivatives of  0 ,  1 , and  2  with respect to themselves.As a consequence, F 1 is the identity matrix of order 3, that is, F 1 = I 3 .Regards to the (3 × 5) sub-matrix F 2 , its elements correspond to the partial derivatives of  0 ,  1 , and  2  with respect to  0 ,  1 ,  2 ,  2  , and   .Therefore, F 2 = 0.The partial derivatives of   ,   ,  2   ,  2  , and   with respect to  0 ,  1 , and  2  are gathered together in the (5 × 3) sub-matrix F 3 : where . (44) The 5-dimensional square sub-matrix F 4 corresponds to the partial derivatives of   ,   , with its nonnull elements taking the following analytical expressions: Straightforward calculations have paved the way to the desired partitioned variance-covariance matrix, called here Σ Ξ , with its submatrices defined by The matrix G that has just been defined as G = F 3 + F 4 D 3 corresponds to the first-order partial derivatives from the composite functions that relate   ,   ,  Ξ is formed by the covariances between the m.l.e. for   ,   ,  2  ,  2  , and   .The main point of the section is to study the variances and covariances that take part of the sub-matrix Σ 22 Ξ .Thus, it is of interest to further explore its analytical expression.The matrix G takes a cumbersome form.The most efficient way to deal with it is to consider its partition rather than the whole matrix at once.Let where the (4 × 2) sub-matrix G 11 takes the form with where  4 43 is defined by (45).The 4-dimensional column vector G 12 , the 2-dimensional row vector G 21 and the scalar  22 are, respectively, given by )}] , ) . (52) On the other hand, we can also make the following partition of the matrix H = F 4 D 4 : where the sub-matrix H 11 corresponds to the first order partial derivatives of the vector [     2   2  ]  with respect to the vector [ 0  1  2  4 ]  , whereas their derivatives with respect to the parameter  3 constitute the sub-matrix H 12 .The sub-matrix H 21 is composed of the first order partial derivatives of   with respect to each component of the vector [ 0  1  2  4 ]  .Finally, the scalar  22 =   / 3 = 0.
The desired variance-covariance matrix can therefore be written in the following partitioned form: missing data pattern.The precision of the estimators has also been derived.Special attention has been given to the estimator for the mean value of the stochastic process whose sampled data has missing values,   .We have compared the performance of the estimator for   based on the VAR(1) Model with a monotone pattern of missing data with those obtained from both the ARMA(2,1) Model and the AR(1) Model.By simulation studies, we have showed that the estimator derived in this article based on the VAR(1) Model performs better than those derived from the univariate context.It is essential to emphasise that, even numerically, it was quite difficult to compute the precision of the later estimators as we have shown in Section 4.2.
A compelling question remains unresolved.From an applied point of view, it would extremely useful to develop estimators for the dynamics of the stochastic processes.More precisely, we would like to get estimators for the correlation and cross-correlation matrices as well as their precision when there are missing observations in one of the data sets.It was not possible to achieve this goal based on maximum likelihood principles.As we have shown in Section 4.2, we have only developed estimators for the covariance matrix at lag zero.In future research, we will try to solve this problem in the framework of Kalman filter.
()  = (1/) ∑  =1  − represent the sample mean lagged  time units,  = 0, 1.The subscript ,  = 1, . . .,  − 1, allows us to identify the number of observations that takes part in the computation of the sample mean.A similar notation is used for denoting the sample mean of the random sample  0 , . . .,   , for  = 1, . . .,  − 1,  ()  .According to this new definition, the sample variance of each univariate random variable based on  observations and lagged  time units is denoted by γ()
2  ,  2  , and   with respect to  0 ,  1 ,  2 ,  2  ,   : 2 , 2 , and   with the vector of parameters Θ 1 .The elements of the matrix H = F 4 D 4 are the first-order partial derivatives from the composite functions that relate   ,   , 2 ,  2  , and   with the vector of unknown parameters Θ 2 .The 3-dimensional square sub-matrix Σ 11 Ξ corresponds to the approximate covariance structure between the m.l.e. of the parameters  0 ,  1 , and  2  .The (3 × 5) sub-matrix Σ12Ξ is composed of the approximate covariances between the m.l.e. that have just been cited and   ,   ,  2  ,  2  , and   ; its transpose is denoted by Σ21