Identification and Forecasting in Mortality Models

Mortality models often have inbuilt identification issues challenging the statistician. The statistician can choose to work with well-defined freely varying parameters, derived as maximal invariants in this paper, or with ad hoc identified parameters which at first glance seem more intuitive, but which can introduce a number of unnecessary challenges. In this paper we describe the methodological advantages from using the maximal invariant parameterisation and we go through the extra methodological challenges a statistician has to deal with when insisting on working with ad hoc identifications. These challenges are broadly similar in frequentist and in Bayesian setups. We also go through a number of examples from the literature where ad hoc identifications have been preferred in the statistical analyses.


Introduction
Mortality models are commonly used in a wide range of fields such as actuarial sciences, epidemiology, and sociology. They are often used in important decisions such as how to deal with unisex legislation in the pension industry; see Ornelas et al. [1] and Jarner and Kryger [2]. However, such models do often have inbuilt identification issues stemming from overparametrisation. While identification issues are omnipresent in statistical modelling, this paper focuses on mortality modelling, where estimated parameters are treated as time series and extrapolated to give forecasts of future mortality. The underlying theme of this paper is to provide strategies of avoiding arbitrariness resulting from the identification process. We suggest two ways forward. First, we can reparametrise the model in terms of a freely varying parameter, which therefore has to be of lower dimension than the original parameter. Secondly, we can work with an identified version of the original parameter as long as we keep track of the consequences of the identification choice. That way we ensure that two researchers making different identification choices get the same statistical inferences and forecasts.
A simple example is the age-period model for an ageperiod array of mortality rates. It is well-known that the levels of the age-and period-effects cannot be determined from the likelihood representing the overparametrisation of the model. When the estimated age-and period-effects are treated as time series and subjected to plotting and extrapolation, then our approach ensures that the statistical analysis is the same for two researchers identifying the above model in two different ways. Whereas this issue is relatively simple for the age-period model, identification becomes more tricky for complicated models such as the age-period-cohort model and the model of Lee and Carter [3], let alone twosample situations.
Mortality models are built as a combination of age, period, and cohort-effects, but the likelihood only varies with a surjective function of these time effects. The time effects can be divided into two parts. One part that moves the likelihood function and another part which does not induce variation in the likelihood function. We will argue that all inferences and forecasts should be concerned primarily with the part of the parameter that moves the likelihood function. This does not preclude the researcher from working with the time effects, but it gives some limitations on what can be done. 2 The Scientific World Journal This is important because the motivation and the intuition of mortality models typically originate in the time effects. For instance, in the context of an age-period-cohort model linear trends cannot be identified so time series plots of the time effects need to be invariant to linear trends and extrapolations of time effects must preserve the arbitrary linear trend in the time effects. This applies regardless of whether the identification issue is dealt with in a frequentist manner or by Bayesian methods.
To formalise the discussion slightly return to the ageperiod example. Denote the predictor for the age-period data array by . The age-period model then determines how the predictor varies with a vector summarising age and period effects. That vector is split into two components and so that the predictor only depends on through but not on which cannot be identified by statistical analysis. In the age-period example could reflect the contrasts and the overall level of the predictor , whereas reflects the level of the age effect. The more principled solution is then to work exclusively with and simply consider as a motivation rather than the objective of the analysis. Another solution is to ad hoc identify based on a notion of mathematical convenience or based on a particular purpose given the substantive context.
Once an ad hoc identification of is chosen the identification problem appears to go away, because the likelihood analysis can now go through. The reason is that the variation of is now reduced to the variation of precisely because is fixed. Suppose two researchers choose the same likelihood and the same parametrisation of but different ad hoc identifications † and ‡ . Which of their conclusions will be the same and which will be different? As the likelihood only depends on the fits of the two researchers will be identical. But differences might arise if the statistical inference or forecasting or any other statistical analysis involves in some way.
Indeed, with many extrapolation methods forecasts will be invariant to the choice of . But, there will also be extrapolation methods where this is not the case. Examples arise in the age-period-cohort model, where linear trends have to be handled with care.
We will start by analysing linearly parametrised models at a rather general level. We do this with two aspects in mind. First, we need to step back to a point in the analysis before ad hoc identification is made. Secondly, we also want to avoid the discussion of how to choose and , which tend to be specific to the mortality model in question. Working at the general level we can focus on the mappings between different parametrisations and the invariance properties coming from these mappings. It is then seen that the parameter arises as a maximal invariant. The general setting also allows the formulation of a series of results discussing different types of ad hoc identification, first in a frequentist fashion and then in a Bayesian fashion.
Subsequently, we will consider the age-period-cohort model in detail, both for one-and two-sample situations. Using the general results it becomes easier to see that a number of popular methods inadvertently include features that are not invariant to ad hoc identification. These include the "intrinsic estimator" advocated by Yang et al. [4], the "mixed model approach" by Yang and Land [5], the Bayesian approach by Berzuini and Clayton [6], and the two-sample analysis by Riebler and Held [7]. Finally, we consider the nonlinearly parametrised model of Lee and Carter [3]. The nonlinearity gives a further complication since the mapping from the time effects to the mortality predictor is nondifferentiable. As it turns out the mortality predictor varies in a smooth space, so the nondifferentiability is avoided by working directly with the mortality predictor instead of the original time effects. Instead, a Lee-Carter application should consider whether a certain matrix has rank of unity or zero. Apart from that the analysis is similar to that of linearly parametrised models. Likewise a theory is given for twosample situations.
Throughout the paper our concern rests exclusively with the identification problem and the consequences of ad hoc identification for estimation, plots, inference, and forecasting. In practice, important additional concerns are how to choose appropriate models and forecasting methods. We would like to refer to Girosi and King [8], Pitacco et al. [9] for general discussions of these issues, and also to Kuang et al. [10] and Coelho and Nunes [11] for discussions of forecast methods in the light of structural breaks. Instead, the aim of the paper is to present an overall framework that can help streamlining the identification discussion that has appeared in so many papers in so many fields over so many years. Section 2 of this paper considers standard linear statistical models, which lend themselves to a relative straightforward analysis based on linear algebra. Any ad hoc identification splits the time effect into two components. The first component is an arbitrary component, which is not needed for the identification of the likelihood. The other component is necessary and sufficient to identify the model and hence sufficient for statistical analysis. In Section 3 it is outlined how to analyze the statistical model when the latter component is ad hoc identified. It is argued that this can cause difficulties for estimation, interpretation, and forecast. In Section 4 it is shown that Bayesian analysis shares the same challenges as the frequentist approach. In Sections 5 and 6 we study the two particular examples: the omnipresent age-period-cohort and Lee-Carter mortality models. All proofs are collected in the Appendix.

Statistical Models with Linear Parametrisations
In this section we present the identification problem in a linear framework. The problem is solved by analysing the mapping from the original time effect to the predictor which, in turn, leads to standard statistical analysis. In Section 6 we show how these ideas transfer to a nonlinear context. This contrasts with Section 3 in which we illustrate the analytical challenges and inconveniences arising from ad hoc identification. In Section 2.1 we present the overparametrized linear model for the mortality predictor. The identification problem The Scientific World Journal 3 is defined in Section 2.2 via the likelihood. In an overparametrized linear model two different parameters might produce the same likelihood. In Section 2.3 we analyze the mapping from the overparametrised parameter to the predictor. This mapping enables us to split the overparametrised parameter into two. One arbitrary parameter and one parameter identify the model without being overparametrised. This latter parameter is shown to be a maximal invariant parameter. In Section 2.4 it is demonstrated how any statistical analysis can be based on this maximal invariant parameter alone. In particular we comment that visual data representations, hypothesis testing, and forecasting are simple and well defined. This in turn leads to standard statistical analysis.
The analysis of the linearly parametrised involves projections on linear or affine spaces and on their orthogonal complements. It is therefore convenient to introduce the following notation. A matrix has full column rank if is invertible. In this case the orthogonal complement ⊥ is a matrix so ⊥ = 0 and ( , ⊥ ) is invertible. Thus, when itself is invertible then ⊥ is the empty matrix. It is not difficult to calculate ⊥ in practise, an explicit construction of ⊥ follows from a singular value decomposition of , choosing ⊥ as the eigenvectors associated with the zero eigenvalues. Moreover, let = ( ) −1 so that is the identity matrix, while ⊥ = ⊥ ( ⊥ ⊥ ) −1 .

The Model.
Think of the time effect as our preferred intuitive, but unidentified parameter, and think of the predictor as some function of specifying the model at hand. In a Poisson type model, where the mean specifies the distribution, could be the log of that mean. Such Poisson models are omnipresent in mortality models. We will often think of as containing some time effects. Often forecasting is carried out simply by isolating and extrapolating such a time effect.
Consider a data vector of dimension . This could, for instance, be the vector consisting of the stacked mortality rates for a rectangular age-period array of dimension × in which case = . The statistical model for could be a generalized linear model. This involves an appropriately chosen distribution and a link function, which links the expected mortality rate to an -dimensional predictor, which is denoted by . Taken together this defines a likelihood function L( ; ).
The model for the predictor is constructed in terms of, for instance, age, period, and cohort time effects. These time effects are summarized in a vector , which is of dimension < . Therefore is a surjective function of . For the moment the specification of the predictor is assumed linear so that for some design matrix ∈ R × . We refer to this specification as the mortality model, while the space Θ is the time effect space. The time effect space is chosen as an unrestricted real space in accordance with the starting point of most mortality analyses.
The parameter space for the likelihood function and therefore for the statistical model is given by the range of variation for the predictor ; that is, The likelihood function is assumed uniquely identified on this space in the sense that for all pairs of predictors so † ̸ = ‡ then the likelihood of † , ‡ differ; that is, for in a set with positive probability.

The Identification
Problem. The identification problem of mortality models arises when the mapping from the time effect space Θ to the parameter space is surjective but not injective. With a linear parametrisation this arises when the design matrix has reduced column rank < so is singular. In this situation there exists time effects † ̸ = ‡ with the same likelihood: for all data . Then the time effect space Θ is not useful as parameter space for the statistical model.

Analysing the Mapping
→ . When analysing the mapping from our intuitively preferred parametrisation into the linear predictor , we will be able to rewrite as a sum of two components: one is a function of the predictor and the other is the arbitrary part varying with , but not with the predictor. We provide two methods for analysis.
The first method is to find a basis ∈ R × with full column rank for the design . The design matrix of the mortality model can then be expressed as = for some matrix ∈ R × with full column rank . Introduce a new -dimensional parameter: The parameter space can then be written more parsimoniously as The mapping from to is bijective, so the statistical model can just as well be parametrised in terms of ∈ Ξ = R . Alternatively, the identification problem can be expressed through an invariance argument. This argument relates to the parameterization but resembles the classical invariance argument for reduction of data; see Cox and Hinkley [12, page 157]. With a linear parametrisation the argument involves the orthogonal complement to the matrix . That is a matrix ⊥ ∈ R ×( − ) which has the properties that ⊥ = 0 and that ( , ⊥ ) is invertible. The mortality model (1) is defined by the mapping from Θ = R to . This mapping is surjective in that two different values of may result in the same and therefore the same likelihood. These equivalence classes in the time effect space can be described by the group of transformations acting on Θ for arbitrary ∈ R − . Indeed, it holds that and ( ) will result in the same . The mapping (7) is therefore invariant to the group . We will argue that the parameter = is a maximal invariant to the group acting on Θ, which provides a link with (6). It has to be argued that for any † , ‡ so that † = † equals ‡ = ‡ then ‡ = ( † ), see Cox and Hinkley [12, page 159]. For this argument use the orthogonal projection identity to write In applications it can be difficult to find a basis for the design . It can be easier to find a group and hence ⊥ and then use this information to construct and a candidate basis = , noting that = . This argument leaves it to be proven that is a basis, or equivalently, that the suggested group actually describes the equivalence classes of the mapping from to .
It is useful to note that in the choices of , only the spaces spanned by them are unique since = −1 for any invertible ∈ R × . Likewise, the maximal invariant is only unique up to bijective transformations. This lack of uniqueness has no impact on the analysis of the likelihood albeit it influences interpretations.

Statistical Analysis Using the Maximal Invariant Param-
eter. The statistical model parametrised with the maximal invariant parameter can be analysed by standard statistical techniques. This contrasts to a range of problems that arise when working with an ad hoc identified time effect . In the following the relatively simple standard statistical analysis of the model parametrised by is discussed with respect to likelihood theory, interpretation, plots, hypothesis testing, forecasting, and Bayesian analysis. In Sections 3 and 4 we give an overview of the much more complicated theory underpinning models parametrised by the ad hoc identified time effect . Age-period-cohort examples follow in Section 5.

Exponential Family Theory.
Suppose the likelihood is drawn from a generalized linear model based on an exponential family. Then the model is actually a regular exponential family where the maximal invariant parameter is the canonical parameter since it is freely varying in a real space; see Barndorff-Nielsen [13, page 116]. This opens up for a wealth of convenient statistical properties such as a likelihood equation with a simple expression and explicit conditions for a unique solution. In contrast, ad hoc identified parameters are based on an injective mapping of the canonical parameter into ; see Sections 3.1 and 3.2. It is then more difficult to fully exploit the exponential family theory.

Interpretation and Plots.
The maximal invariant parameter varies freely in R . It can therefore be interpreted as the parameter of any standard statistical model. Since is freely varying the coordinates of can be interpreted independently. When is a collection of time effects then can be organised as a collection of time series. Since the coordinates of are freely varying the time series plots of the components of have the usual interpretation of time series. In contrast, ad hoc identified estimators are constrained to a -dimensional subspace Θ of Θ = R , which is often affine but can be more complicated. A consequence is that plots are complicated to evaluate; see Section 3.4.1.

Hypothesis Testing.
Hypotheses are easily formulated and analysed when using the maximal invariant parametrisation. An affine hypothesis that restricts to vary in adimensional affine subspace can be formulated as = for known matrices ∈ R ×( − ) , ∈ R − . This implies a restriction on the predictor = of (6). Form the orthogonal complement ⊥ and recall the orthogonal projection identity = + ⊥ ⊥ so that = + ⊥ ⊥ . Introduce a -dimensional parameter = ⊥ , a design matrix = ⊥ , and an offset = . The restricted parameter space is In an exponential family context both the unrestricted model and the restricted model form regular exponential families. A variety of nice properties then follow for the estimators and the test statistics from the exponential family theory. Examples are given in Sections 5.3 and 5.5.3. In contrast, the hypothesis derived from restrictions on ad hoc identified parameters and the resulting degrees of freedom are complicated to analyse; see Section 3.4.2.

Forecasting.
Most often the objective of a mortality study is to forecast the future mortality. In the linear context, = , this is done by extending the design and by extrapolating .
It is usually easy to extend the design into the forecast horizon. This involves the construction of a triangular block matrix with an appropriate number of extra rows corresponding to the data over the forecast horizon as well as extra columns representing the extra parameters that would be needed: Extrapolating into a vector̃then gives the forecast The Scientific World Journal 5 The extrapolation of the parameter can be done as follows. The estimated parameter, or part of it, can be thought of as a time series. Any forecast techniques from the time series literature applied directly to can be used, subject to the usual contextual considerations. Ad hoc identified time effects can be extrapolated in a similar way; see Section 3.4.3. This may, however, result in avoidable arbitrary effects in the forecast. Necessary and sufficient conditions for this eventuality are given for age-periodcohort models in Section 5.4.3. The practical examples are mainly Bayesian in nature and are discussed next.

Bayesian
Analysis. The introduction of the canonical parameter shows that the likelihood, in Bayesian notation, is of the form ( | ) = ( | ) where is freely varying. A purist Bayesian analysis can simply introduce a prior on the canonical parameter, ( ). This is updated in a straight forward way, resulting in the posterior ( | ) = ( | ) ( )/ ( ).
In contrast, introducing a prior on ad hoc identified parameters gives various difficulties. Only parts of the prior are updated by the likelihood, so that it becomes unclear which information arises from the data and which information arises from the ad hoc identification. Moreover, avoidable arbitrariness is introduced in the forecast; see Section 4. Introduction of hyperparameters exacerbates the issue. Examples are given in Sections 5.4.4, 5.5.2, and 6.1.6.

Working with the Time Effects
In Section 2 we considered the situations where estimation, hypothesis testing a hypothesis, or forecasting is carried out using the canonical parameter. However, there might be situations, where the original time effect parametrisation is preferred, perhaps because it is felt that this parametrisation is particularly helpful in guiding the intuition. This requires ad hoc identification of the time effect . In this section we will guide the considerations a statistician that has to go through when insisting on an analysis based on some nonunique parametrisations. As in Section 2 we focus on linearly parametrised models. Specific examples follow in Sections 5 and 6.
In Section 3.1 ad hoc identification is defined. As an example we consider a least squares estimation problem with collinear regressors in Section 3.2. For the age-period-cohort model reviewed in Section 5 it is common to ad hoc identify in two steps: first identifying levels then the linear trends. We consider such two-step ad hoc identification in Section 3.3. The consequence of ad hoc identification is considered in Section 3.4. Indeed, when forecasting the time effect, we do not want the forecast to depend on the identification scheme. The same applies to graphical visualisation of our data, where the eye may extract patterns that depend on the identification scheme. Likewise, confusion may arise when formulating a hypothesis directly on the time effect parameters.

Ad Hoc Identification.
In this section the time effect parametrisation is considered. An identification scheme has to be introduced when working with the time effects. This may rest on mathematical convenience or it may be chosen for a particular purpose given the substantive context. We therefore call it ad hoc identification. Here we consider a simple identification scheme but turn to a more common two-step identification scheme in Section 3.3.
Once the canonical parameter has been estimated there is often a wish to return to the original time effect . The two are linked through the surjective mapping from Θ = R to Ξ = R . Indeed, since is constructed as a function of the notation for is often chosen to reflect . The canonical parameter does, however, only give partial information about . The remaining part, say , of will have to be chosen by the researcher and combined with .
A linear ad hoc identification of comes about by the researcher choosing a constraint = (14) for some known ∈ R − and some matrix ∈ R ×( − ) chosen so the square matrix ( , ) is invertible. The time effect space Θ is now reduced to an affine subspace Given we can find , through (13) and (14) as ( , ) = ( , ) . At the same time, given values of , and the invertibility of ( , ), the ad hoc identified time effect is found through In this notation a subindex is introduced to avoid confusion with the time effect in the original mortality model. Indeed, there are now four different parameters in play, namely, the original time effect ∈ Θ, the predictor ∈ , the maximal invariant parameter ∈ Ξ and the ad hoc identified time effect ∈ Θ , each of which has a different interpretation. The mapping from to each of , , and is surjective, while there are bijective mappings between the latter three. The interpretations of the time effect and the canonical parameter will inevitably be different. For a start they have different dimensions. Endowing the spaces with Euclidean norms shows that distances in the two spaces Θ and Ξ will be judged differently. The time effect and the ad hoc identified time effect will similarly have different interpretations. Although they have the same dimensions the Euclidean norms on Θ and Θ will be rather different. Confusion may arise in the interpretation of a mortality analysis if there is no clear distinction between and . In addition an unnecessary arbitrariness may arise when making inference on or extrapolating ∈ Θ . We will return to these issues in Section 3.4.
It is perhaps interesting to note that despite the linear parametrisation the ad hoc identification need not be done 6 The Scientific World Journal in a linear fashion as in (14). Indeed it is common for Poisson models with a log link to ad hoc identify through the original multiplicative scale. That means that the ad hoc identification is done nonlinearly through The fit of the model is unaffected by the ad hoc identification. Indeed the fit is measured in terms of the estimate of the predictor = where = . Since the identification is made so = ; the estimated predictor reduces tô regardless of the choice of ad hoc identification.

A Least Squares Example.
As an illustration of estimation in the presence of ad hoc identification consider a normal likelihood. Different, but equivalent, expressions can be found depending on the parametrisation. The likelihood of the predictor is Rewriting it in terms of the canonical parameter it is while introducing the time effect parameter gives L ( , 2 ; ) The likelihood (20) of the canonical parameter can be analysed by the least squares method since the design has full column rank. The maximum likelihood estimator for and the predictor for the data arê Along with the residual variance this is all the information that is given by the likelihood. The likelihood (21) of the time effect only depends on through = . The lack of identification means that the maximum likelihood estimator for has an arbitrary element, so that it is a set valued estimator. Based on (16) this can be expressed byΘ for any so ( , ) is invertible and for any ∈ R − . The fit, however, remains the same and (18) becomeŝ In order to compute actual estimates then , have to be chosen, which amounts to ad hoc identification. For instance, with the ad hoc identifying restrictions = ⊥ and = 0 then̂can be thought of as the least squares estimator of on using the Moore-Penrose generalised inverse for the singular matrix ; see Searle [14, page 212]. See Section 5.4.1 for an example.

3.3.
Step-Wise Identification. It is common to ad hoc identify parameter in a step-wise fashion. In the first step the time effect parameter is only partially constrained. The full identification then follows in a second step. An example is given in Section 5.4.1 for an age-period-cohort model in which the levels of the time effects are constrained in the first step leaving the ad hoc identification of the linear trends to the second step.
The first step constraints are affine of the type for known matrices ∈ R ×( − ) , ∈ R − . The constrained time effect space is then Thereby the -dimensional time effect space Θ is reduced to a -dimensional variation. The properties of this partially ad hoc identified parameter space depends on the rank of the matrix ( , ). If the number of constraints, − , is at most equal to the number of unidentified components − , it is possible that ( , ) has full column rank. In that case the constraint implies a partial ad hoc identification without constraining the parameter space of the statistical model. This is shown in Theorem 1; see also Section 5.4.1 for an example, while the proof is given in the Appendix. When ( , ) has reduced rank the parameter space is also constrained; see Section 3.4.2 for a discussion.
for arbitrary The partial ad hoc identification by (25) implies that any time series analysis of the time effects has to happen The Scientific World Journal 7 relative to the constrained space Θ rather than the space Θ. This is awkward as discussed in Section 3.4 below. It is also considerably more complicated than working with the freely varying canonical parameter ; see Section 2.4.2.

Consequences of Ad Hoc
Identification. In the following we will look closer at the consequences of working with the ad hoc identified time effect parameter in the context of a linear mortality model of the form = . We consider the consequences for plotting, hypothesis testing, and forecasting.

Plots of Time Effects.
In the mortality model (1) the time effect is the concatenation of age, period, and cohort effects. It seems natural to think of these individual time effects as time series and to plot them against time. As the time effect varies in the unrestricted space Θ = R this maps the -vector into unrestricted time series.
Estimates of the time effects are constructed by combining an estimate of with an ad hoc chosen value for = , see (14). The resulting estimatêis therefore constrained to the space Θ ⊂ Θ. The interpretation of the estimateî s therefore different from the interpretation of the original time effect . Distances on the spaces Θ and Θ are judged differently and the variability of̂is deduced exclusively from̂through (16). The time series components of̂are now restricted through = . Plots of thê-time series are therefore interpreted differently from the imagined plots of the original -time series and from the plots of the maximal invariant parameter discussed in Section 2.4.2. Indeed, if one were to analyse the estimated̂-time series statistically the linear constraint should be taken into account. This is a bit complicated as illustrated below, but it is the consequence of working with the ad hoc identified parameter rather than the canonical parameter .
Attempts to give intrinsic meaning to will be specific to the index set for the data set at hand. For instance, the requirement that the age effect should be zero on average does not carry over when looking at a subsample or when forecasting. It is not obvious that such an ad hoc identification is any more or less arbitrary than saying that, for instance, the first or the last age effect should have a particular value.
Adding confidence bands to a plot of̂is in itself not difficult. If̂is asymptotically normal with mean and variance Σ, then̂is asymptotically normal with mean and variance ⊥ ( ⊥ . This is a normal distribution on the space Θ . The interpretation of these standard errors will therefore be similar to that of̂itself.
Finally, it may be of interest to analyse the estimated̂time series statistically. Denote this time series by . Its sample space is now Θ . A statistical model on Θ can be built as follows. The starting point could be a time series model for unrestricted variables on the sample space Θ. This gives a joint density for ∈ Θ, which can be reduced by marginalisation to a density for ∈ Θ . Whether one is working with the unrestricted model for ∈ Θ or the restricted model for ∈ Θ inferences that are invariant to must be based on those statistics of or that are invariant to . Thus, inferences must be based on the maximal invariant under . For a general overview of invariant reduction see Cox and Hinkley [12, page 175f], whereas Nielsen [15] gives the argument in some detail for an autoregression with a linear trend.

Hypothesis
for known matrices ∈ R ×( − ) , ∈ R − . The constrained time effect space is then To see how the restriction (28) restricts the predictor space ⊂ R recall that the predictor only depends on through = . Thus, the analysis of the restriction (28) depends on the interplay between the matrices , . Theorem A.3 in Appendix A.3 gives a general result to that effect. It shows that the hypothesis (28) restricts the predictor space to adimensional affine subspace of R in so far as it restricts the canonical parameter . In particular, the degrees of freedom of the hypothesis, − , may in general be different from the dimension reduction of the time effect parameter, − . When this is the case the restriction (28) has an element of ad hoc identifying the time effect.

Forecasts.
Forecasts can be made by extrapolating the ad hoc identified time effects . Two researchers choosing different ad hoc identification schemes, but otherwise making the same analysis, may make different forecasts. This can be avoided if the extrapolation method is chosen with some care.
Following the linear approach outlined in Section 2.4.4 the predictor = = is forecasted by extending the design into Extrapolating the ad hoc identified into a vector ( ,̃) then gives the forecast Often both components ℎ 1 and ℎ 2̃d epend on the ad hoc identification. Nonetheless, these dependencies of 8 The Scientific World Journal ad hoc identification may cancel each other so that the overall forecast̃is invariant to the ad hoc identification. Such invariance would seem desirable in most applications unless there is strong substantial reason for the ad hoc identification scheme. Necessary and sufficient conditions for invariance are presented for the age-period-cohort model in Section 5.4.3 and for a nonlinear model in Section 6.1.5.
In contrast, these considerations are redundant when working with the canonical parameter, ; see Section 2.4.4.

Bayesian Models and Random Effects Models
Mortality analysis is often carried out using either Bayesian methods or random effects methods. The mortality model is then altered through the introduction of a prior distribution on the parameters. One might think that the identification problems become less of an issue or even disappear. This is not the case since the Bayesian method and the random effects method is based on the mortality likelihood which only depends on the time effect through the maximal invariant parameter . Thus, the identification challenges remain. The issue is that a prior on the unidentified part, say , of the time effect amounts to an ad hoc identification. Indeed, the conditional prior of given is not updated by the mortality likelihood. A main difference is that a maximum likelihood analysis of the original mortality likelihood usually prompts the researcher when there is an identification issue, whereas both Bayesian methods and random effects methods allow computations to go through despite an identification issue. In Section 4.1 it is seen that introduction of a conditional prior on given is the Bayesian analogue of ad hoc identification. This leads to the same type of forecasting challenges as in the frequentist settings as is seen in Section 4.2. In Section 4.3 we show how the Bayesian identification issues transfer to random effects models.

Bayesian Estimation.
For Bayesian and random effects models we formulate a likelihood and a prior. Thus, consider a likelihood ( | ) = ( ; ). Replacing by , the identification problem implies that The prior on is factorised as ( ) = ( , ) = ( ) ( | ). In the case of Bayesian estimation the following result emerges. (32). Then (i) the predictive distribution does not depend on the conditional prior for :

Theorem 2. Suppose the likelihood satisfies
(ii) the posterior satisfies (iii) the posterior means satisfy Theorem 2 shows that it suffices to give a prior to and ignore as advocated in Section 2.4.5. Indeed the conditional prior for given is not updated. Theorem 2 appears to be well-known; see Poirier [16,Proposition 2] or Smith [17,Section B].
Due to Theorem 2 the Bayesian analyst faces the complications outlined in Section 3.4. Indeed, suppose that two Bayesian researchers choose the same likelihood ( | , ) = ( | ) and the same prior ( ) for , but different conditional priors for given . Their marginal distributions for the data are identical, but any inferences regarding interpretation or forecasting will differ in so far as they involve the unidentified parameter . A Bayesian researcher should therefore be cautious with inference related to . There will of course be situations where the prior knowledge of given is found to be of substantive relevance. In such situations it seems more fruitful to change the likelihood to include that information.

4.2.
Forecasting. Bayesian forecasts involve integrating an extrapolative distribution. This can be done in two ways, either working exclusively with the identified, maximal invariant parameter as in Section 2.4.4, or working with the time effect = ( , ) as in Section 3.4.3.

Forecasting Using the Maximal Invariant Parameter.
Consider first the case where only the maximal invariant parameter is used. In that case the forecast is computed by sampling from the posterior ( | ) and then extrapolatingũ sing the sampled value using some extrapolative methods, say (̃| , ). In combination this gives the forecast
The concern is now as follows. Suppose a second researcher chooses the same extrapolative method, likelihood, and prior for , but different conditional priors † ( | ). In general, this will result in a different distribution forecast: The Scientific World Journal 9 The question is then under which conditions will (̃| ) = † (̃| ) so that the distribution forecasts are invariant to the choice of conditional prior for given ? A sufficient condition is that the extrapolation method does not depend on so Condition (39) (37) is invariant to the choice of conditional prior for given . The forecast then reduces to (36).
To summarise, the identification issues surrounding Bayesian analysis are similar to those outlined in the previous sections. Examples of the problems that can arise are discussed in Sections 5.4.4, 5.5.2, and 6.1.6. There are two solutions to the identification problem. The first is only to formulate a prior on ; see Section 2.4.5. Incidentally, this is what Bernardo and Smith [18, page 218] do in their discussion of the two-way analysis of variance, albeit without linking it to the considerations of Smith [17]. The prior ( ) can of course be constructed by formulating a prior on and then reduce it to a prior on by marginalisation so ( ) = ∫ ( , ) . The other solution is to work with a prior on but avoid those parts of the posterior that depend on .

Random Effects Models.
It is common to combine mortality models with a random effects approach, which effectively forms a new model. An example is given in Section 5.4.6. We consider the consequence of the lack of identification.
The random effect models are typically constructed as follows. Suppose the density of the data given the time effects = ( , ) is of the form ( | , ) = ( | ) as before; see (32). A prior ( | ) is chosen that now depends on a parameter . The prior can be decomposed as ( | ) = ( | ) ( | , ). Theorem 2 implies that the density of the data given is This in turn is used to form the random effects likelihood of as This, effectively, defines a new model. The random effects likelihood only depends on the prior ( | ) through ( | ). Two researchers choosing the same prior ( | ) but different conditional priors ( | , ) will then get the same random effects likelihood and the same maximum likelihood estimator̂.
In mortality modelling it is common to go one step further and estimate the time effects through the mean of the posterior ( | , ) evaluated at =̂. Then the identification problem may show up. Theorem 2 shows that so that the prior for is updated, while the conditional posterior for given is not updated by the data. Thus, in general the estimate for is based, in part, on a prior which is not updated by the data.

Age-Period-Cohort Models
We will now apply the theoretical considerations to analyse the age-period-cohort model. The methodological literature on this model is large and the consequences of the above theory are wide ranging. In Section 5.1 we present the age-period-cohort model along with the maximal invariant parameter. This maximal invariant parameter is also called the canonical parameter because the age-period-cohort model is usually implemented as an exponential family; see Section 2.4.1. When formulating the model we choose a notation matching the age-periodcohort literature rather than the reserving literature. At the same time the exposition takes it starting point in Kuang et al. [19], but the notation deviates.
The implementation of the canonical parameter depends on the type of data array. In Section 5.2 design matrices are given for age-cohort, age-period, and period-cohort data arrays. While they illustrate interesting differences in the structure for these data arrays, they also provide the basis for an immediate implementation via any generalised linear model software. The age-cohort model is expressed as a hypothesis of the age-period-cohort model in Section 5.3. Time effects and forecasting are considered in Section 5.4, while the two-sample age-period-cohort model is discussed in Section 5.5.

The Model and the Canonical Parameter.
Here the ageperiod model is set up and a quite general identification result is reported.
Consider data indexed by ( , ) ∈ I where is the age and is the period. The index set may be a rectangle given by = 1, . . . , and = 1, . . . , so that the cohort = − + runs from 1 to = + − 1. More generally, the index set could be a generalized trapezoid where two corners are cut off the rectangle so that the cohort runs from 1+ℎ 1 to + −1−ℎ 2 for some ℎ 1 , ℎ 2 ≥ 0. The class of generalized trapezoids includes the three types of Lexis diagrams discussed by Keiding [20]. We will return to those special cases below.
The statistical model is defined by the assumption that the variables are independent with an exponential family distribution with predictor given by 10 The Scientific World Journal The time effect = ( 1 . . . , , 1 , . . . , , ℎ 1 +1 , . . ., + −1−ℎ 2 , ) now varies in some time effect space Θ ∈ R where = + + + 1 − ℎ 1 − ℎ 2 . The model (43) is of the form (1) discussed in Section 2. Specifically, the predictors can be stacked in a vector of dimension = dim I and written as = . Thus, the parameter space for the model is of the form = ( ∈ R : = for ∈ Θ) as outlined in (2). The mapping → from Θ to is surjective and the equivalence classes in the time effect space can be described by a group of transformations that are discussed in (8). This group can be represented as for any , , , and . This is of the form (8) with = ( , , , ) although the definition of the matrix depends on the structure of the index set I.
A first clue for the canonical parametrisation is given by Fienberg and Mason [21] and Clayton and Schiffler [22] who pointed out that, on the multiplicative scale, ratios of relative risks are invariant. On the additive scale this amounts to looking at second differences, such as Δ 2 = − 2 −1 + −2 . A graphical illustration of the double differences is given in Figure 1 (graphics were done using R 3.0.2, see R Development Core Team [23]), which is taken from Miranda et al. [24]. Panel  [19] introduces a parameter formed by these second differences as well as three entries of the predictor; that is, The parameter varies in the space Ξ = R where = − 4. If the three points 1 1 , 2 2 , and 3 3 are chosen not to be linearly related then they define the levels and the linear trends in the predictor. The formal condition is that a certain determinant defined from the indices is nonzero; that is, Theorem 4 (see [19], [25,Corollary 2]). Let satisfy (43). (46) is satisfied then the parameter of (45) satisfies the following:

If the condition
(i) is a function of which is invariant to the group in (44); (ii) is a function of ; (iii) the parametrisation of by is exactly identified in that Theorem 4 therefore shows that varies freely in Ξ = R . Moreover, is a maximal invariant of the mapping from to under the transformations . It should be noted that the choice of maximal invariant is not unique. Indeed, any bijective mapping of can serve as maximal invariant. The choice of is convenient since it becomes the canonical parameter in generalized linear models of the exponential family type.
In itself this theorem does not tell how to express the predictor in terms of the canonical parameter . The link depends on the structure of the index set I. The above mentioned paper gives implicit expressions for generalized trapezoid index sets. In the following we report explicit expressions for the 3 principal Lexis diagrams.

Design Matrices for Lexis Diagrams.
The link between the canonical parameter and the predictor is analysed for the 3 principal Lexis diagrams. We start with age-cohort data arrays, which were the focus of attention in Kuang et al. [19]. Such arrays are easiest to analyse because all three time scales increase from the point where = = = 1. As a consequence the results are relatively easier for these arrays.
Age-cohort arrays are in particular used for reserving in general insurance. In that situation, only the triangle 1 ≤ , , ≤ is observed. The issue is to forecast the other triangle in the square 1 ≤ , ≤ . In the reserving literature these triangles are referred to as the upper and lower triangles, since the cohort axis has reverse order. The two-factor age-cohort model for triangular age-cohort arrays is known as the chainladder model; see England and Verrall [26] for an overview. Zehnwirth [27] introduced an age-period-cohort model for such triangular arrays. The identification issue is analysed in detail in Kuang et al. [19,25]. Subsequently, Kuang et al. [28] analysed the Poisson likelihood, while Kuang et al. [10] give an empirical analysis focusing on forecasting.
The age-period-cohort model for the age-cohort arrays is parametrised by The Scientific World Journal The design matrix linking the canonical parameter in (45) and the predictor is essentially an identity linking the two parameters. A natural choice of the three levels points to the predictors that are 11 , 12 , and 21 . We then get the representation with the convention that empty sums are zero, and recalling that second differences are defined as The identity (49) is crucial to the understanding of the age-period-cohort model. It shows that the predictor has a single level expressed as 11 , which in turn satisfies 11 = 1 + 1 + 1 + . The level 11 is therefore estimable, but the individual levels 1 , 1 , 1 , and are not identifiable from the model. Further, the model has two linear trends, here expressed with slopes 21 − 11 and 12 − 11 in terms of the age and cohort indices. These slopes can be expressed as 21 − 11 = Δ 2 + Δ 2 and 12 − 11 = Δ 2 + Δ 2 . They are estimable, but the individual slopes Δ 2 , Δ 2 , and Δ 2 are not identifiable.
Theorem 5 (see [19,Theorem 1]). Let satisfy (48). The parameter of (50) satisfies the following: (i) is a function of which is invariant to the group in (44); (ii) is a function of , because of (49); (iii) the parametrisation of by is exactly identified in that Theorem 5 in turn implies that the parameter varies freely in Ξ = R , while the design matrix given by (51) has full column rank. Originally, the more general Theorem 4 was proved as a corollary to Theorem 5.

Age-Period
Arrays. An age-period data array is rectangular in the age and cohort indices and given by Consequently, the cohort index = − + varies over = 1, . . . , = + − 1. Keiding [20] refers to this Lexis diagram as the third principal set of death. Age-period arrays are commonly used in epidemiology, in mortality analysis, and in sociology. The analysis of identification issue is largely similar to that of age-cohort arrays. However, the representation of the predictor in terms of differs in an intriguing way, because the third time index, the cohort , is the difference of the other two indices.
The age-period-cohort model for the age-period arrays is parametrised by The time effect = ( 1 , . . . , , 1 , . . . , , 1 , . . . , , ) now varies in Θ = R 2( + ) . A representation of the predictor in terms of the canonical parameter is now 12 The Scientific World Journal The representation (54) differs from that of (49) in a subtle way. The three reference points for the levels of the predictor are chosen in the corner = , = 1. From this corner period and cohort indices increase, while age decreases. Hence, the age double differences Δ 2 are now cumulated backwards. This phenomenon arises because the cohort index is the difference of the principal indices of age and period, whereas for the age-cohort array the period index is the sum of the principal indices of age and cohort. The predictor is now = where, with ℎ( , ) = max( − + 1, 0), The identification relies on Theorem 4. It is specialised to ageperiod arrays as follows.

(i) is a function of which is invariant to the group in (44);
(ii) is a function of , because of (54);

Period-Cohort
Arrays. An age-period data arrays is rectangular in the age and cohort indices and given by Consequently, the age index = − + varies over = 1, . . . , = + − 1. Keiding [20] refers to this Lexis diagram as the second principal set of death. Age-period arrays are commonly used in prospective cohort studies in epidemiology and in sociology. The analysis is similar to that of age-period arrays when swapping the role of age and cohort.
The age-period-cohort model for the age-cohort arrays is parametrised by The time effect = ( 1 , . . . , , 1 , . . . , , 1 , . . . , , ) now varies in Θ = R 2( + ) . A representation of the predictor in terms of the canonical parameter is now Thus, the canonical parameter and the design matrix are given by In parallel with Theorem 6 we then have the following identification result.

(i) is a function of which is invariant to the group in
(ii) is a function of , because of (61);

Expressing the Age-Cohort Model as a Hypothesis.
It is often of interest to test the absence of the period effect. An application to analysing asbestos related mortality can be found in Miranda et al. [24]. The hypothesis is that 1 = ⋅ ⋅ ⋅ = , when expressed in terms of the time effect parameters. The restricted model is given by, with = − + , The identification problem simplifies to a question of determining the levels of and . Therefore the (log) relative risk parameters Δ are identified as pointed out by Clayton The Scientific World Journal 13 and Schifflers [30]. In this model the cohort index is present and keeps the difference of the principal age and period indices. Therefore the representation of the predictor involves backward cumulated age differences as before but with a subtle change of sign, so that (54) reduces to The age-cohort model can also be formulated as a hypothesis on the maximal invariant in the age-period-cohort model following Section 2.4.3. The period effects Δ 2 are set to zero through = 0, where = (0, −2 , 0). Applying this to the expression for in (55) gives since in the absence of period effects; then 1 − −1,1 = Δ − Δ 2 and 2 − 1 = Δ 2 . The double differences cumulate to first differences through ∑ =3 Δ 2 = Δ − Δ 2 , so the above expression is seen to be a linear transformation of ac in (67). In other words the age-cohort model arises from the age-period-cohort model by restricting the maximal invariant parameter.

Working with the Time Effect.
There is a large literature seeking to identify the original time effects , , and of the age-period-cohort model from the predictor. Here we look closer at some of those ad hoc identification proposals.

Ad Hoc Identification of Levels.
For the age-periodcohort model it is popular to impose ad hoc identifications in two steps of the type discussed in Section 3.3. Here the first step is concerned with the level of the time effects and the second step is concerned with the linear trend. Examples are given in Sections 5.4.2 and 5.5.4.
A common first step ad hoc identification is to require that This ad hoc identification is specific to the chosen data range. For instance, the constraint ∑ =1 = 0 is not easily transferable to a different data set drawn from the same population but with a different set of age groups. This aspect would have to be kept in mind if a substantive motivation was to be found for this constraint. Other ad hoc identification schemes such as = = = 0 have similar problems. The constraint (69) is a special case of affine constraints of the form = discussed in Section 3.3. The involved dimensions are = 2( + ) and = − 4, while the number of constrains is − = 3. The matrix ∈ R ( − )× is given by the top left {3 × ( − 1)}-block of ⊥ in (58) padded with a column of zeros, while ∈ R 3 is given by = 0. Theorem 1 shows that = ⊥ ∈ R ( − )×( − ) has full rank. Indeed, and its orthogonal complement are given by where, for instance,̆= −1 ∑ =1 = ( + 1)/2. Thus, the constrained group of equivalence classes (27) is (71)

Ad Hoc Identification of Slopes: The "Intrinsic" Estimator.
The "intrinsic" estimator is a popular estimator in the sociology literature; see Yang et al. [4] and see also O'Brien [31,32] and Fu et al. [33] for a recent discussion of its merits. It has its roots in a suggestion by Kupper et al. [34], with an early critique given by Holford [35]. The "intrinsic" estimator is defined in two steps. In the first step, the levels are identified by the ad hoc constraint (69). Three of the -coordinates are then dropped; that is , , and are dropped. In a second step the linear trend is ad hoc identified using a Moore-Penrose inverse as in (23).
We can analyse these steps using the developed framework. The first step identifies the levels by the ad hoc constraint (69), which is a constraint of the form = 0 for the discussed in Section 5.4.1. This is defined on Θ which is a linear subspace with a dimension deficiency of 3. Introduce a selection matrix ⊥ ∈ R ×( −3) that selects all coordinates of except , , and . Thus ⊥ arises as adimensional with 3 columns deleted corresponding to , , and . This is chosen so that ( , ⊥ ) is invertible. Then ⊥ is freely varying in that ⊥ Θ = R −3 . The skew projection identity = ( )  In the second step the linear trend is ad hoc identified through a time effect parameter of the form (23) with , replaced by , so that ad.hoc = ⊥ ad.hoc where ad.hoc = ⊥ ( ⊥ ) −1 + ( ) ⊥ { ( ) ⊥ } −1 for some scalar and some matrix ⊥ ∈ R ( −3)× .
The "intrinsic" estimator is ad hoc identified through the choices = 0 and ⊥ = , while is chosen by (69). It therefore estimates an "intrinsic" parameter: which depends on the choices of ⊥ , , and ⊥ . However, since we can replace ⊥ by ⊥ for arbitrary invertible matrices ∈ R ( −3)×( −3) without changing intrinsic the expression intrinsic does not depend on the normalisation of ⊥ . The "intrinsic" parameter satisfies the following result.

Theorem 8. The "intrinsic" parameter is an injective mapping of the canonical parameter
where ∈ R −3 is uniquely defined up to a scale by Theorem 8 implies that the "intrinsic" parameter should be interpreted as an object varying in the linear subspace Θ intrinsic rather than in the unrestricted time effect space Θ = R . As outlined in Section 3.4 this has consequences for the interpretation of plots of the time effects, hypothesis testing, and forecasts. A consequence of this argument is that different choices of , ⊥ , , and would lead to other ad hoc identified parameters varying in other affine subspaces of Θ. In other words, the "intrinsic" estimator carries the cost of working with the somewhat complicated linear subspace Θ intrinsic . This effort may be worthwhile if the particular choice of , ⊥ , , and can be made on substantive grounds.

Forecasting.
Forecasting of future mortality rates involves an extrapolation of the time parameters. In Section 2.4.4 it was argued that ad hoc identification may introduce an undesirable arbitrariness in the forecast. When working exclusively with the canonical parameter this arbitrariness is avoided. It is, however, also possible to work with ad hoc identified time effects under specific circumstances that we characterise here for age-period arrays. This builds on the theory developed in Kuang et al. [25] for age-cohort data arrays.
In the context of an age-period data array I ap it is often of interest to forecast ℎ periods ahead. Suppose it is of interest to forecast the mortality at age in period + ℎ, so that the Identification plays a role when extrapolating the estimates obtained on the data array I ap . The identification issues can be ignored if the investigator simply extrapolates Δ 2 and Δ 2 . In the context of ad hoc identified time effects arbitrary linear trends are introduced in the model. The forecast of the predictor , +ℎ is invariant to these if and only if the chosen extrapolation method for , preserves these linear trends so that they can cancel with the arbitrary linear trend in . The next result gives a precise formulation of this statement. It applies both to point forecasts and distribution forecasts.

Theorem 9. Consider the predictor
for , ∈ I ap as given in (53). Suppose the time effects , , and are ad hoc identified. Consider the class of ℎ periods-ahead forecasts over J ap constructed as̃, +ℎ =̂+̃+ ℎ +̃+ +ℎ− +̂, wherẽ +ℎ +̃+ +ℎ− is a function of the ad hoc identified estimatê. Let be the group (57). Invariance of the forecast̃, +ℎ with respect to the ad hoc identification is equivalent to either of the following: To illustrate the use of Theorem 9 consider the extrapolation methods̃+ ℎ =̂and Δ̃+ ℎ = Δ̂. The first forecast is a random walk forecast and it is seen to violate (ii). The second forecast is a cumulated random walk and satisfies (ii). The reason is that +ℎ = + ∑ ℎ ℓ=1 Δ +ℓ . Since Δ̃+ ℓ = Δ̂, theñ+ ℎ =̂+ ℎΔ̂. Further examples of forecasts that are linear trend-preserving as well as some which are not are given Kuang et al. [25, Table 1].
Kuang, Nielsen, and Nielsen [10] apply this to reserving data organised in an age-cohort array I ac and discuss the issue of robustification of forecast with respect to structural breaks at the forecast origin. Miranda et al. [24] give an application to asbestos related mortality using an age-period array I ap .

Bayesian Ad Hoc Identification Using a Dynamic Prior.
A Bayesian ad hoc identification using a dynamic prior does not solve the identification problem as discussed in Section 4 and the same care has to be exercised to avoid the problems outlined in Section 3.4. Berzuini and Clayton [6] suggest such an ad hoc identification approach. On page 831 they write "Identificability problems may be solved by imposing an arbitrary linear constraint on the log-linear trend components of age, period and cohort effects. Happily, such an arbitrary constraint has no effect on the predictions of the model. " The previous analysis suggests that this is far from innocent.
The Berzuini-Clayton suggestion is to ad hoc identify the model (53) through A dynamic prior is chosen so that the double differences Δ 2 , Δ 2 , and Δ 2 are independent zero mean normal with variances = ( 2 , 2 , 2 ) that have 2 -type prior. The purpose of this is in part to facilitate extrapolations Δ 2 , Δ 2 , and Δ 2 for > , > , and > , which is done through further draws from normal distributions. The level/trend effects level = ( 1 , 2 , 1 , 2 , 1 , 2 ) have independent uniform priors on some large intervals.
We will analyse the Berzuini-Clayton model as applied to an age-period data array I ap . Decompose the canonical parameter from (54) into two parts: the slope and level parameters, say = ( 1 , 1 − −1,1 , 2 − 1 ) , and the collection of double differences, say Δ . The assumed prior for Δ is a simple collection of independent normal distributions with variances . The assumed prior for is a linear combination of not only the independent uniform variables level , but also on Δ , since the age double differences Δ 2 are cumulated backwards in (54), but forwards in (77). Thus, the prior for = ( , Δ ) depends on the level construction.
We get a hyper-parameter hyper = ( , ), where is some three-dimensional ad hoc identified level/trend effect dependent on level , Δ . We will argue that the ad hoc identified level/trend effect will wash out in the Berzuini-Clayton model. However, the level/trend parameter is a function of the level construction that is tailored to the ad hoc identification. That construction remains in the analysis.
In the presentation of the posterior Berzuini and Clayton are careful only to consider the double differences Δ and stay clear of the ad hoc identified level/trend effect level . Theorem 2 yields the posterior ( | ) = ( | ) ( )/ ( ). Thus, the marginal posterior for the double differences is This links Δ to and in turn to the level construction.
The extrapolative method is based on double differences so it only depends on hyper through due to Theorem 9 and the subsequent discussion. Thus, the extrapolative method is of the form (̃| , hyper , ) = (̃| , , ). By construction it does not reduce to (̃| , ) so that condition (39) for Theorem 3 is not satisfied. The distribution forecast is of the form which, apart from depending on the level construction, also depends on the conditional prior ( | ), which is not updated by the likelihood. In summary, it appears that the Berzuini-Clayton analysis depends on the level construction as well as the conditional prior ( | ). The dependence on the level construction could of course be addressed by introducing priors directly on , which in turn would be updated by the likelihood. Since the conditional prior ( | ) cannot be updated by the likelihood that its sole justification rests on the substantial context.

A Functional Form Hypothesis.
It is instructive to consider functional form restrictions on the time effects. Such hypotheses can be analysed using the results outlined in Section 3.4.2. As an example restrict the age effect to be quadratic in a similar way to Yang and Land [5] so that This restriction on the time effect can be analysed by writing it on the form = , see (28), and then applying 16 The Scientific World Journal Theorem A.3. Alternatively, in this particular case, we can show that the restriction actually only affects the ad hoc identified time effect through the canonical parameter, so a simpler analysis can be made. A quadratic polynomial has constant second order derivative. Therefore the restriction (79) implies This expression has one free parameter. Thus, it is useful to consider the third order difference: This gives −3 linear restrictions on the canonical parameter. The age time effect then has three remaining parameters, say 1 , 2 , and 3 . These are freely varying since the parameters 0 , 1 , and 2 are freely varying. If the constraint is imposed directly on the canonical parameter, the restricted model is a regular exponential family with the advantages outlined in Section 2.4. However, if the analysis is done with the time effect the levels and trend will have to be ad hoc identified while bearing in mind the issues discussed above.

The "Hierarchical Age-Period Cohort Regression Model".
In some cases a random effects approach can be used to get an overview of the many parameters of the age-period model. When applied to the time effects this implies an ad hoc identification. An example is the "hierarchical age-period cohort regression model" by Yang and Land [5]. In that paper the age effect is given a quadratic structure, but that does not have to be the case. The model is then given by Since random effects are only introduced for some of the time effects, the analysis of Section 4.3 has to modified in a similar way to the analysis in Section 5.4.4. From (80) it is seen that the model restricts Δ 2 = 2 2 . Thus, divide the canonical parameter into three elements: the slope and level parameters, say = ( 1 , 1 − −1,1 , 2 − 1 ) , the age-double differences = (Δ 2 3 , . . . , Δ 2 ), and the remaining double differences , . Here is restricted by the hypothesis 2 2 and , is linear function of the normal random effects, while is a three-dimensional linear function of 2 and of the six-dimensional object ] = ( 0 , 1 , 1 , 2 , 1 , 2 ) . This leaves a three-dimensional ad hoc identified level/slope parameter which is also a function of ] but not entering the likelihood. Let = ( 0 , 1 , 2 , 2 , 2 ).
The random effects likelihood are constructed in three steps. First, we have the usual age-period-cohort likelihood ( | ). Secondly, the random effects distribution for , , , and is multivariate normal, while is deterministic function of . Thus, decompose the prior as ( , , , | ) = ( , , | ) ( | , , , ). Thirdly, following Section 4.3 the random effects likelihood will not depend on ( | , , , ) and it is given by The prior ( | , , , ) is not updated by the data. Plots and inferences based on the posterior ( | , ) will then suffer from the ad hoc identification issues outlined in Section 3.4.

A Two-Sample Age-Period-Cohort
Model. When confronted with two samples for women and for men it may be of interest to apply the age-period-cohort model (43)

Analysis of the Unrestricted Two-Sample Model.
The unrestricted two-sample model is simply analysed as two copies of the one sample model of Section 5.1. The time effects of each copy are only defined up to linear trends. The group of transformations characterizing the identification problem combines two copies of the one sample group (44). The maximal invariant parameter is = ( 1 , 2 ) ∈ R where = 4( + − 2) and each of are of the form (45). The benefits of Section 2 hold when working with that parameter.

Bayesian Ad Hoc Identification Using a Dynamic Model.
An application of the unrestricted two-sample model can be found in Cairns et al. [36]. The two samples are the population of England and Wales and the subpopulation of assured lives, so the substantive question is whether there is a selection effect for the assured lives. A Bayesian model with dynamic prior is used. It shares some features with the Berzuini and Clayton [6] model discussed in Section 5.4.4 although the details of the ad hoc identification of the levels and slopes are slightly different. When it comes to forecasting the extrapolative method appears to depend on the ad hoc identified parameter as well as the hyperparameters. This complicates the analysis of the forecast relatively the discussion in Section 5.4.4.

The Hypothesis of Common Period Parameters.
The two-sample model allows the possibility for adding crosssample restrictions on the parameters. As an example we consider the hypothesis of common period parameters.
Working with the canonical parameter the hypothesis is This is a simple linear restriction as that discussed in Section 2.4.3. It is readily seen that the degrees of freedom of the hypothesis are − = − 2 so the dimension of the restricted model is = 4 + 3 − 6. The canonical parameter under the hypothesis is then = (. . . , 11 , 21 − 11 , 12 − 11 , Δ 2 , Δ 2 , Δ 2 , . . .) .

(86)
The same result arises when writing the hypothesis in terms of time effects so that Such hypotheses on the time effect were discussed in Section 3.4.2. It can be analysed using the general result in Theorem A.3. However, we will take the simpler route of arguing that this only restricts the canonical parameter given a hypothesis of the type (85). The argument relies on noting that analysing the restriction for the predictors 1 and 2 is equivalent to analysing the restriction for the predictors 1 and 2 − 1 , where the cross-sample differenced predictor is of the form Now, the restricted model for the cross-sample differenced predictor 2 − 1 is an age-cohort model: Following the analysis of Section 5.3 the (87) therefore implies the − 2 linear restrictions given by (85). At the same time the predictor for the first sample 1 is left unrestricted by (87). In summary, the restrictions (85) and (87) are equivalent. The restriction has an interesting implication for the interpretation of the involved double differences. For the unrestricted model it was found that only plain double differences, such as Δ 2 , are identified. Under the restriction the cross-sample differenced predictor is of age-cohort form (89) so also the cross-double differences Δ( 2 − 1 ) and Δ( 2 − 1 ) are identified.

Step-Wise Ad Hoc Identification under the Hypothesis.
The analysis of Riebler and Held [7] finds that the difference 2 − 1 is identified under the hypothesis (85). This is not consistent with the above analysis showing that the crosssample differenced predictor is an age-cohort model under the hypothesis, for which levels such as 2 − 1 are identified.
The apparent difference comes about because Riebler and Held follow a step-wise identification approach along the lines of Sections 3.3 and 5.4.1. In a first step the time effects , , and are constrained to have zero-sums as in (69). In a second step the slopes are ad hoc identified using a Bayesian approach similar to that of Berzuini and Clayton [6]; see Sections 4 and 5.4.4 for a discussion of the consequences.
The identification in the first step implies that 2 − 1 has a zero sum. Under the hypothesis (85) this is exactly what is needed to ad hoc identify the levels in the age-cohort model (89). In other words a different level identification in the first step leads to a different level for the difference 2 − 1 .

Models with Nonlinear Parametrisations
Some additional issues arise when looking at models with nonlinear parametrisations. A prominent example is the mortality model proposed by Lee and Carter [3] and which is the current benchmark in mortality studies done by government agencies and pension funds. For this model the time effect space Θ has a nondifferentiability which can actually be avoided by working directly with the parameter space .
We analyze the Lee-Carter model in Section 6.1. In Section 6.2 we turn to a two-sample problem where some additional difficulties can arise when forecasting.

The Lee-Carter
Model. The mortality model proposed by Lee and Carter [3] has predictor of the form The time effects = ( 1 , . . . , , 1 , . . . , , 1 , . . . , ) vary in Θ = R 2 + . Lee and Carter pointed towards two identification issues of the model. If , , and are one solution to (90), then − , , + is also a solution for any scalar , just as , / , and are a solution for any ̸ = 0. Consequently, they proposed the ad hoc identification: This is, however, not the full story about the identification issues. To get at this we follow the outline from the linear parametrised models and start by finding the parameter space for the predictor .
where , , and are vectors concatenating , , and and where = (1, . . . , 1) ∈ R . Postmultiply by the projection identity = + ⊥ ⊥ to get where the orthogonal complement ⊥ can be chosen so that ⊥ = (Δ 2 , . . . , Δ ) but could also be chosen otherwise. Equation (93) shows that the model is composed of two matrices with rank one. Thus, the parameter space is given by 18 The Scientific World Journal Note that does not depend on the normalisation of ⊥ since is freely varying. The space is a manifold since the space of matrices with an upper bound to the rank is a manifold as opposed to the space where has rank of unity. This space can be parametrised parsimoniously by varying in the manifold The is the candidate for the maximal invariant describing the equivalence classes of the mapping from the time effect to the predictor . The next step is to analyse the time effect space Θ. It is convenient to decompose into two disjoint sets depending on the rank of . These sets are Correspondingly, the time effect space Θ can be decomposed into two disjoint sets: Note that = 0 if and only if ∈ Θ 0 . Consider first the time effect space Θ 1 , which is implicitly what Lee and Carter had in mind. The mapping → on Θ 1 to is invariant to the group of transformations: acting on Θ 1 for all ∈ R and all ̸ = 0. The parameter = ( , ) is invariant under 1 acting on Θ 1 . Now, consider the space Θ 0 with deficient rank. Then , , and map into + where = is constant in , so that = ⊥ = 0. This mapping is invariant to the group of transformations: acting on Θ 0 for all ( 1 , . . . , ) ∈ R .
Theorem 10. Let ∈ . The parameter ∈ Ξ of (95) satisfies the following: (i) is a function of ∈ Θ which is invariant to the groups 0 , 1 in (99) and (100); (ii) is a function of ; (iii) the parametrisation of by is exactly identified in the Theorem 10 shows that varies freely on the space Ξ and it gives a unique parametrisation of . As a function of it is invariant to 0 , 1 ; hence it is a maximal invariant.
It is interesting to compare the properties of the spaces , Ξ, and Θ. The spaces and Ξ are spaces of matrices with deficient rank. These are smooth spaces, but they are not vector spaces since the sum of matrices with rank one may have rank larger than one. In contrast Θ is a vector space. The mapping from Θ to will inevitably be nondifferentiable. This nondifferentiability is avoided by working directly with . Likewise, in a Bayesian setting it would seem more difficult to introduce a meaningful prior of Θ with its nondifferentiability than on .

Maximum Likelihood Estimation.
The maximum likelihood estimator for can be derived analytically in the normal case.
Consider a situation where the data array is of age-period form so for ( , ) ∈ I ap . Suppose are independent normal with mean and variance 2 . Organise the data in a matrix . Then the log likelihood is of the form The maximum likelihood estimator is of the following form. Subsequently, this is related to the estimator suggested by Lee and Carter.
Theorem 11. For a normal age-period array parametrised by (94) the maximum likelihood estimators arê where svd 1 (⋅) is the singular value decomposition truncated to one factor.
Thus, is estimated by the row-averages of the data matrix, while is estimated by the singular value decomposition of the row-wise demeaned data matrix. Inserting the maximum likelihood estimators from Theorem 11 yields the estimators proposed by Lee and Carter. However, the disentangling of the singular values and singular vectors of̂is done by the ad hoc identification = 1 and = 0. These estimators are therefore specific to the considered data array and data set in parallel with the discussion in Sections 3.2 and 5.4.1.

Consequences of the Possible Rank Deficiency.
The parameter space was split into spaces 1 and 0 depending on the rank of . The space 0 is a Lebesgue null set relative to . Broadly speaking, there are two consequences of the possible rank deficiency. The first consequence is an estimation problem arising in the vicinity of 0 . The second consequence is that the usual normal asymptotic distribution theory does not apply in the vicinity of 1 . Whether this becomes a problem in practice depends on the data. One solution is to ensure that the time effect really is present when using the Lee-Carter model.
Investigate whether the time effects are present amounts to estimating the rank of . For a given data set two Lee-Carter models can be estimated. The first model with predictor space is the unrestricted model in which rank( ) ≤ 1. The second model has predictor space 0 so = 0. Twice the difference of the likelihood values gives a likelihood ratio test statistic which is asymptotically 2 . If the smaller model, 0 , is accepted this is used in subsequent analysis. However, if the smaller model 0 is rejected then it is likely that the predictor is not located in the vicinity of 0 and it is then safe to work with the predictor space 1 .
The consistency of this step-wise procedure is discussed in a cointegration context by Johansen [37,Section 12]. Even when this procedure points towards working with the parameter space 1 the rank deficiency may still affect inference under 1 . Analysis of simple canonical correlation models suggests that inference under 1 will be nearly similar if the distance to 0 is sufficiently large. A problem is that the distribution for the test statistic will have poor finite sample properties when the parameter value is close to 0 . A simple way to get around this problem is to test for 0 using a test with lower level than the conventional level. A more complicated way to address this is to employ a finite sample correction when seeking to test for 0 . See Nielsen [38,39] for further discussion of this issue in the context of simple canonical correlation models.
The rank deficiency issue is typically not encountered in a standard Lee-Carter analysis. The reason is that the analysis is typically applied to data where there is a marked improvement in mortality rates over time. A Lee-Carter analysis could however run into trouble if it were applied to data without a strong calendar effect. The issue becomes more pertinent when extending Lee-Carter model with a cohort component such as see Renshaw and Haberman [40]. If the cohort effect is modest the latter matrix is nearly rank deficient and the likelihood will be nearly flat in certain directions. This is presumably the reason for the estimation problem noted by Cairns et al. [41].
6.1.5. Forecasting. The purpose of Lee-Carter model is usually to forecast future mortality. This issue is considered for the model with parameter space 1 . The standard approach is to extrapolate , ad hoc identified through, for instance, = 0. The ℎ-step ahead extrapolation of +ℎ based on some forecast methods is denoted bỹ+ ℎ (̂). Combined with the estimateŝ ,̂this gives the overall forecast , +ℎ (̂) =̂+̂̃+ ℎ (̂) .
The identification question is then for which extrapolation methods this equals The condition for avoiding adverse impact of the ad hoc identification is as follows.
An alternative approach to forecasting would consider the predictor of the model for a particular age ground, say . That predictor iŝ= (̂+̂⊥), where is the th unit vector. From this we can generate forecasts̃, +ℎ using any time series method. The resulting forecast will in general depend on̂as well aŝ,̂and it is therefore more general than the forecasts discussed in Theorem 12, which only depends on̂. The forecast for another age group, say † , should be the same up to a linear transformation dictated by the Lee-Carter structure. Thus, the ℎ-step ahead forecasts for the entire array arẽ+ for an indexȋs chosen so that̂⊥̸̆ = 0.

Bayesian Ad Hoc Identification Using a Dynamic Model.
A Bayesian model with dynamic specification of the prior has been suggested by Pedroza [42]. Dynamic priors are presented for the time effects = ( , ) involving a hyper parameter . The ad hoc identification (91) is imposed so that analysis is made for an ad hoc identified time effect . Pedroza presents posteriors for . When evaluating this posterior one should bear in mind that the conditional prior ( | ) is not updated by the data; see Theorem 2. The presented extrapolative method does not depend on . Even so, the forecast will depend on conditional prior ( | ) which is not updated by the data; see Theorem 3.

The Two-Sample Lee-Carter Model.
We now turn to applications of the Lee-Carter model in two-sample problems. Suppose two samples are for women and men. One approach would be to fit separate Lee-Carter models to the two datasets. These Lee-Carter models are of the form = + for , ∈ I ap , = 1, 2.
The objective is now to extrapolate the period effects . Extrapolating the two models separately using separate random walks is often seen to be volatile, so methods that seek to combine information from both estimated serieŝare sought after. The next result describes the invariance problem in forecasting.

Theorem 13. Let
∈ 2 for = 1, 2. The forecast̃, +ℎ,1 for sample = 1 is invariant to ad hoc identification if the extrapolation method̃+ ℎ,1 preserves location/scale for sample 1, but is invariant to location and scale for sample 2. That is for all 1 , 2 ∈ R and all 1 , 2 ̸ = 0; theñ For one sample the standard forecasting technique appears to be the random walk with a drift as in (108). For the two-sample problem a suggestion could be that women and men should share a common random walk with a drift but deviate from this by a stationary process. In econometrics this idea is referred to as cointegration as proposed by Engle and Granger [43]; see also Johansen [37] for a likelihood based vector autoregressive approach. It is tempting to require that the calendar effects should cointegrate with coefficients of unity, so 1 − 2 should be stationary. However, that apparently intuitive choice violates Theorem 13 because the locations and scales of are different and arbitrary. There are two fixes to this problem. The first solution is to work directly with the mortality predictors for an arbitrary age group as outlined for the one-sample case in connection with (109). Since no identification is involved it is permitted to impose that 1 and 2 cointegrate with coefficients of unity. The forecast for age group is then carried over to other age groups. The second solution is to work with the estimated serieŝbut estimate the cointegrating coefficients from the data. In other words, the cointegrating relation̂1 −̂2 − should be zero mean, stationary, with coefficients , estimated from the data. This can, for instance, be done by Johansen's approach for a bivariate vector autoregression; see Hendry and Nielsen [44, Section 17].

Conclusion
Ad hoc identification is intimately linked to interpretation, inference, numerical analysis, and forecasting. The ad hoc identification will often introduce an arbitrary element in the statistical analysis, whether it is based on frequentist or Bayesian methods. This arbitrary element is entirely avoidable and is in our view best avoided unless there is a clear substantial motivation for ad hoc identification. For decades there has been a debate over how it is best to ad hoc identify mortality models. Our proposal is to bypass this discussion by analysing the surjective mapping between the unidentified time effect parameter and the predictor of the model and then deduce a maximal invariant parametrisation. In our experience there are typically two substantial benefits. First, it simplifies estimation and other statistical computations which is what we have focused on here. Secondly and perhaps more importantly, it helps to focus the substantial question that gives rise to the analysis in the first place.
The issue of dealing with two time scales also occurs in other statistical models, such as the Cox regression model; see Cabrera et al. [45] for a recent application. In future research it would be interesting to consider whether the analysis presented here has any bearing on that problem.

A.2. Proofs of Main Theorems
Proof of Theorem 1. Since ( , ) has full column rank then Lemma A.2 shows it has orthogonal complement ⊥ ⊥ so that ( , , ⊥ ⊥ ) is invertible. The orthogonal projection identity shows Premultiply by and to see that intrisic = and intrisic . This does, however, not describe the full variation of intrisic since ( , ) is not a square matrix. We must extend the matrix ( , ) with columns so that it is square and invertible.

22
The Scientific World Journal To see this hold, premultiply by {( , ), ⊥ } to see that an invertible upper triangular matrix arises.
To analyse the properties of intrisic it suffices to analyse {( , ), ⊥ } intrisic , since there is a bijective mapping between the two. Choose = ⊥ ( ⊥ ⊥ )( ⊥ ⊥ ) ⊥ + . Then it holds that ⊥ = ( ⊥ ⊥ )( ⊥ ⊥ ) ⊥ so that ⊥ ⊥ ⊥ = ⊥ and therefore The term T 1 has a minimum of zero if and only if = . In T 2 replace for a moment ⊥ by a matrix with rank of at most one. The altered T 2 is minimised when is the singular value decomposition of ⊥ ⊥ truncated to rank one due; see Golub and van Loan [46, Theorem 2.5.2]. That singular value decomposition has the property that it is zero when multiplied by . Therefore it is also the minimiser of the original problem.
Proof of Theorem A.3. Apply Lemma A.1 (i, iii) to see that rank( , ) = + − is equivalent to rank( ⊥ ) = . We can then write ⊥ = for some matrices ∈ R × and ∈ R × with full column rank.
Exploit the projection identity = ⊥ ⊥ + to get

Conflict of Interests
The authors do not have any conflict of interests.