UNRAVELLING ECOLOGICAL ANALYSIS

Ecological analysis involves analysing aggregate data for groups of individuals to make inferences about relationships at the individual level. Often the results of such analyses give badly biased estimates. This paper will consider the sources of bias in linear regression analysis using aggregate data. The role of variation of the individual level relationships between groups and the consequent within-group correlations and how these are related to auxiliary variables that characterise the differences between groups is considered. A method of adjusting ecological regression for the effects of auxiliary variables is described and evaluated using data from the 1991 Australian Census.


Introduction
Ecological analysis involves analysing aggregate data such as the means of a set of groups to make inferences about individual level relationships.An advantage of ecological analysis is that it uses data that are already available at relatively low cost.In an ecological analysis information from different sources may be brought together using aggregates for the same geographical areas.
Ecological analysis is a potentially valuable statistical tool, but it is subject to the ecological fallacy, which arises when the results are incorrectly assumed to apply to relationships at the individual level.Ecological analysis may produce seriously biased estimates of individual level relationships, which limits its practical use.
In Section 2 we consider the targets of inference and how ecological analysis can be considered as a form of multi-level modelling.In Section 3 we consider ecological linear regression analysis within a multi-level framework and clearly identify the sources of the biases.In Section 4 we describe a method of adjusting ecological regression using individual level data for auxiliary variables.Sections 5 and 6 give an evaluation of the aggregation effects and the adjustment method.Section 7 gives a discussion.

Targets of inference and ecological analysis in multi-level populations
In ecological analysis the population is composed of groups of individual units and has a multi-level structure.Statistical analysis should be based on a statistical model that reflects this structure, that is, a multi-level model.Consider a simple two level population, the first level being the individual and a population of N individuals in which the ith individual has a vector of response variables y i and a group indicator c i .The population comprises M groups and the number of individuals in the gth group is N g .
A simple two-level model for the population is where ν g is a vector of random group level effects and i is a vector of individual level effects.The standard assumptions are V ν g = Σ (2) , 3) For this model Cov y i , y j = Σ (2) for c i = c j , i = j. ( Multi-level models provide a useful framework for any situation in which the process that generated the data involved groups, either through sampling or aggregation, or both.In standard multi-level modelling the targets of inference are the fixed mean parameter μ and the parameters of the distribution of the random components Σ (1) and Σ (2) .Taking the group level variance components into account enables efficient estimation of μ and calculation of appropriate estimates of standard errors.The parameters Σ (1) and Σ (2)  indicate the relative importance of purely individual and purely group level effects.Estimation of the parameters usually requires a sample of groups and a sample of individuals within them and indicators that indicate to which group each individual belongs (see Goldstein [3]).
In ecological analysis the targets of inference are at the unit level but the main data available consist of group level means.The assumption is that if a simple random sample of individual units was available, the researcher would be happy to analyse it completely ignoring any groups in the population.The parameters of interest describe the relationships between variables marginal to the groups, which in the model given by (2.1) to (2.3) would be μ and Σ.The ecological fallacy would occur when a covariance matrix is calculated from the group means and provides a biased estimate of Σ and functions of it, such as regression and correlation coefficients.The marginal relationships may be relevant, for example if the government is planning a policy that will be applied across the whole population.
In many geographical applications there is no direct interest in individual level relationships.The focus of interest may be at a level above the individual, that is, Σ (2) , but the area level analysis will include a component due to the individual level relationships in the population, that is, Σ (1) .
While these three situations have different objectives, they all require estimation of the variance components Σ (1) and Σ (2) .How this can be attempted depends on the data available.The information available for analysis can consist of unit level or aggregate data, or a combination of both.Steel et al. [10] consider how multi-level models can be analysed in a number of different cases of data availability at the individual and group level.
Conventional multi-level modelling is carried out using a unit level data set which has group indicatives.We will consider the case of aggregate data consisting of group means.The group means are often based on a census of groups and individuals within them, but can also come from a sample.Assume that there exists a sample data set s of size n and that these individual data have been aggregated to provide a set, s 1 , of m group means, y g , g = 1,...,m, which are available for analysis.The number of sample individuals in each area, n g , is also known.The overall sample mean is y = Σ g∈s1 n g y g /n.
The source of the ecological bias can be identified from the model given by (2.1) to (2.3).Consider S (1)  yy = Σ i∈s (y i − y)(y i − y) /(n − 1), the covariance matrix calculated from the unit level sample data and S (2)  yy = Σ g∈s1 n g (y g − y)(y g − y) /(m − 1), the covariance matrix calculated from the group level means, using the group sample sizes as weights.
The key relationships are (see Steel and Holt,[8]) E S (2)  yy = Σ (1) + n * Σ (2) , E S (1) 2) . (2.5) Here n = n/m is the average number of sampled individuals per group in the sample and is the square of the coefficient of variation of the group sample sizes.These results have some important implications.If Σ (2) = 0 then the group and individual level covariance matrices have the same expectation.However, there will be a large difference in the expectations when n * is large even if the elements of Σ (2) are much smaller than those of Σ (1) , but not zero.Census Collection Districts have approximately 500 people in them, and geographical groups with much larger populations are used in ecological analysis.In these cases S (2)  yy contains very little contribution from Σ (1) , but is mainly determined by Σ (2) .Using S (2)  yy /n * to produce estimates of Σ (2) will not be badly biased if n * is large.As an estimate of Σ the bias of S (1)  yy is O(m −1 ) and it will be a reasonable estimate of the marginal individual level relationships provided m is not small.However, using S (2)  yy to estimate Σ will result in a bias of (n * − 1)Σ (2) .The bias arises because the group level covariance matrix has expectation that is a linear combination of Σ (1) and Σ (2) with the wrong implicit weights given to the two components (see Holt et al. [4]).To remove the bias requires estimation of Σ (1) and Σ (2) .

Explaining biases in ecological linear regression using a multi-level model with auxiliary variables
If individuals are allocated to groups at random there is no ecological fallacy for linear statistics, and parameters such as means, variances, regression and correlation coefficients can be unbiasedly estimated from group level data.Variances of statistics are mainly determined by the number of groups in the analysis (Steel and Holt,[9]).
In practice, individuals who live in the same area exhibit positive intra-group correlation for a variety of socio-economic characteristics.The homogeneity within groups is a key factor in the ecological fallacy.Suppose that there is a set of auxiliary variables, z, that characterize the way in which individuals are clustered within the groups and, conditional on z, the observations for individuals in area g are influenced by random group level effects.The auxiliary variables in z will be called grouping variables and may only have a small effect on the individual level relationships and may not be of any direct interest.However, because of their strong within-group homogeneity they may affect the ecological analysis greatly.The matrices z = [z 1 ,...,z N ] , c = [c 1 ,...,c N ] give the values of all units in the population of size N.The ith individual has a vector of response variables y i and a vector of explanatory variables x i .
We will focus on the cases when there are aggregate group level data available and when there is also a limited amount of individual level data on a few variables without any group indicators.
Steel and Holt [8] considered the implication of a multi-level model with auxiliary variables for the ecological analysis of covariance matrices and correlation coefficients.They also developed a method for adjusting the analysis of aggregate data to provide less biased estimates of covariance matrices and correlation coefficients.Holt et al. [4] evaluated this method and were able to reduce the biases by about 70 percent by using limited amounts of individual level data for a small set of variables that help characterize the differences between groups.We consider the implications of this model for ecological linear regression analysis.
The data available consist of group level covariance matrices S (2)  yy , S (2)  xx , and S (2)  xy calculated using the group sample sizes as weights.These covariance matrices may be combined in S (2)  ww, the covariance matrix for all of the variables, where w = (x , y ) .The ecological regression coefficients relating y to x are estimated by B (2)  yx = (S (2)  xx ) −1 S (2)  xy .The model given in (2.1) to (2.3) is expanded to include x and z by assuming the following model conditional on z and the groups used: where This model implies The matrix Σ (2)  ww|z has components Σ (2)  xx|z , Σ (2)  xy|z , and Σ (2)  yy|z and β wz = (β xz ,β yz ) .Assuming V (z i ) = Σ zz the marginal covariance matrix is which has components Σ xx , Σ xy , and Σ yy .The target of inference is β yx = Σ −1 xx Σ xy .Under this model, Steel and Holt [8] showed ww|z . (3.5) Providing that the variance of S (2)  ww is O(m −1 ) the expectation of the ecological regression coefficients can be obtained by replacing S (2)  yy and S (2)  xy by their expectations, to give, to O(m −1 ), The resulting bias, conditional on z and c, can be shown to be (Steel,[7]): The first term in the bias in (3.8) will disappear if either β xz = 0 or β yz|x = 0, that is, if the explanatory and grouping variables are unrelated or if the response variables have no relationship with the grouping variables once the explanatory variables included in the model are taken into account.Since zz the first term in (3.8) is due to the bias of B (2)  zx in estimating β zx .The second term in the bias in (3.8) will disappear if Σ (2)  xy|z = Σ (2)  xx|z β yx|z , that is, if conditional on the grouping variables, the covariance between the values of the response and explanatory variables for different individuals in the same group is solely due to the covariance of the explanatory variables within the same group and the relationship between y and x for the same individual.This condition is equivalent to the population regression coefficients relating y to x, conditional on the grouping variables, being the same at the individual and group level, that is, yx|z .The second term in (3.8) will also disappear if Σ (2)  xy|z = 0 and Σ (2)  xx|z = 0, when there are no random effects conditional on z.The second term in the bias involves n * which can be very large, for example when the group means are based on all individuals in the groups.
The effect of aggregation has been considered for some aggregation criterion (see Blalock,[1]).In this model this idea can be represented by all the grouping effect operating through the auxiliary variables and there being no group level effects.In this case A = Σ xx + β xz (S (2)  zz − Σ zz )β xz and the bias is [E[B (2)  zx | z,c] − β zx ]β yz|x and is entirely due to the effect of aggregation on the implied estimate of β zx .
The model here allows for group effects in two ways that explain the effect of aggregation.The form of the bias in (3.7) and (3.8) suggests that it will not be possible to reach any general conclusions about the size or likely direction of the biases.However, the general formulas for the bias can be applied to some special cases.
In this case of one explanatory variable and no grouping variables where The effect of aggregation is to shift the weight given to the population regression parameters towards the group level.Even a small value of δ xx can lead to a considerable shift if n * is large (see Holt et al. [4]) For the case of several explanatory variables and one grouping variable where zz Σ zx are the population multiple correlation coefficient between z and x, and x and z, respectively, and Q z = S (2)  zz /S (1) zz .If Σ (2)  xx|z = 0 then E B (2)  zx | z,c = β zx and the factor will exceed 1 provided Q z exceeds 1.There is an amplification effect on the contribution of β zx β yz|x .This has been noted before (e.g., Smith, [5]) but it relies on the grouping being one dimensional.

An adjusted ecological regression method using auxiliary variables
The discussion above has identified the causes of the ecological fallacy as the grouping effects associated with the auxiliary variables and the remaining group level variance components.We now consider methods to produce estimates of β yx from aggregate data.One approach is based on the variance structure for the group means implied by the model when there are no auxiliary variables ww /n g + Σ (2)  ww . (4.1) For example the IGLS procedure embodied in MLwiN can be used (see Goldstein,[3]).This approach relies on there being reasonable variation in the sample sizes between the groups and on the variance structure originally assumed at the individual level leading to variances which have a component that is constant and one which is proportional to 1/n g .At each step of the iterative process the method regresses (w g − μ w )(w g − μ w ) against 1/n g where μ w is the current estimate of μ w and the estimates of Σ (1) ww and Σ (2)  ww are the resulting regression coefficients.
Another approach is to assume that a set of z variables can be identified that explain much of the aggregation effect on the variables of interest.If individual level data on these variables are available, the aggregation bias due to these z variables may be estimated.Under (3.1) E[B (2)  wz | z,c] = β wz where B (2)  wz = (S (2)   zz −1 )S (2)  zw .If an estimate of the individual level population covariance matrix for z were available, possibly from another source, Steel and Holt [8] proposed the following adjusted estimator of Σ ww , Σ ww (z) = S (2)  ww + B (2)  wz Σ zz − S (2)  zz B (2)  wz = S (2)  ww|z + B (2)  wz Σ zz B (2)  wz , ( where Σ zz is the estimate of Σ zz calculated from individual level data.This estimator corresponds to a Pearson-type adjustment (Smith, [6]) and for Normally distributed data is the MLE when Σ ww|z = 0 and Σ zz is also the MLE.This estimator removes the aggregation bias due to z. Adjusted regression coefficients can then be calculated from Σ ww (z), that is, The adjusted estimator replaces the components of bias in (3.7) due to β xz (S (2)  zz − Σ zz )β xz and β xz (S (2)  zz (2)  xx|z .Then the bias of Suppose that Σ zz is an estimate based on a individual level sample involving m 0 first stage units.Then for many sample designs Σ zz = Σ zz + O(m −1 0 ), and so to O(1/m 0 ) the bias of It is not necessary for the individual level data to contain group identifiers, only that it permitted estimation of Σ zz .If Σ xx|z = 0 then the bias of β yx (z) is O(m −1 0 ).The adjusted estimator can be rewritten as where β zx (z) = Σ −1 xx (z)B (2)  xz Σ zz .Corresponding decompositions apply at the group and individual levels: The adjustment is correcting for the bias in the estimation of β zx by replacing B (2)  zx by β zx (z).
The bias due to the conditional variance components Σ (2)  ww|z remains.The two approaches can be combined.Multilevel modelling with aggregate data can be used to produce estimates of Σ (2)  ww|z , Σ (1) ww|z and maximum likelihood estimates of β yz and β xz .These can be combined to produce an estimate of β yx which accounts for the conditional variance components.That is, calculate ww|z + Σ (2)  ww|z + β wz Σ zz β wz (4.8) and then use the relevant components of Σ ww , that is, B yx = Σ −1 xx Σ xy .However, this approach still relies on the use of purely aggregate data to estimate variance components.

Evaluation of aggregation effects in ecological regression
5.1.The data.An empirical investigation into the effects of aggregation on multiple regression analysis was carried out using data from the Australian 1991 Population Census for the city of Adelaide.Group level data were available in the form of totals for the 1711 census collection districts (CDs).The analysis was confined to people aged 15 or more and there was an average of about 450 such people per CD.To enable an evaluation to be carried out we also used data from the census households sample file (HSF) which is a one percent sample of households, and the people within them.
The evaluation concentrated on the dependent variable of personal income.This variable is collected in 14 ranges but was treated as a continuous variable by giving each person the mid point of the range.The following variables were considered as possible explanatory variables: marital status, sex, possessing a degree, employed-manual occupation, employed-managerial or professional occupation, employed-other, unemployed, born in Australia, born in UK and four age categories.The auxiliary variables considered were: age 45 to 59, age 60+, owner occupied, renting from government, housing type.

Aggregation effects on variances and bivariate statistics.
The aggregation effect on the variance of each variable, which is the ratio of the group level to unit level variance, that is, Q a = S (2)  aa /S (1) aa are given in Table 5.1, along with the associated estimate of the intra-CD correlation δ aa .All the variables experienced some aggregation effect, ranging from 2.65 for sex to 171.1 for renting from government.A small amount of within-group correlation can lead to very large aggregation effects on variances because of the large number of individuals within the areas.The variables considered as potential auxiliary variables generally have the larger aggregation effects.This is one reason for selecting these particular variables.It is usually possible to calculate Q a for a range of variables since a reasonable idea of the individual level variance can often be obtained from other published data.For a dichotomous variable all that is required is an estimate of the population proportion.
Tables 5.2, 5.3, and 5.4 summarize the effect of aggregation on the analysis of bivariate covariances, correlations and regression coefficients between income and each of the explanatory and auxiliary variables.The CD level correlations are generally of the same sign but larger than the corresponding individual level correlation.In most cases the change is sufficient to affect the substantive interpretation.There a number of cases in which the correlations have different signs at the two levels.The effect of aggregation on the regression coefficents are similar.

Mutivariate aggregation effects.
The aggregation effect on each of the variables or each pair of variables does not completely characterize the grouping and aggregation in a multivariate situation.Steel and Holt [8] introduced the idea of canonical grouping variables (CGVs) as a way of identifying the important variables associated with the grouping of a population.Suppose unit and CD level covariance matrices S (1) and S (2) have been calculated for a set of variables.The CGVs for CDs are obtained from the eigenvectors d (2)  1 ,...,d (2)  p of (S (1) ) −1 S (2) with associated eigenvalues θ (2)  1 ,...,θ (2)  p .Let D (2) = [d (2)  1 ,...,d (2)  p ]; then the CGVs are defined by U = D Y and have covariance matrix diag(θ (2)  l ) at the CD level and I p at the individual level.Subject to the constraints of being mutually uncorrelated at the individual and CD level the CGVs have successively the maximum aggregation effect and therefore maximum intra-CD correlation.
The matrix (S (1) ) −1 S (2) is an extension of the univariate aggregation effect Q a and the eigenvalues give the aggregation effect of each of the mutually orthogonal grouping dimensions in the set of variables being considered.Summary measures of the aggregation effects in multivariate data are given by θ = l θ (2)  l / p = trace[(S (1) ) −1 S (2) ]/ p and l − 1 is the amount of aggregation effect that can be associated with the first q CGVs and is an upper limit to the aggregation effect that any q adjustment variables can remove.
Considering all the variables together, that is, (y,x,z), gave θ = 27.7 and Q = 33.7.The first four CGVs accounted for 83 percent of the total aggregation effect.The coefficients for the first four CGVs, showed that the first corresponds to renting from government, the second is owner occupied and housing type, the third is aged 60+ and the fourth is a combination of income, degree and managerial or professional occupation.
The results of a CGV analysis of (y,x) gave θ = 13.6 and Q = 14.5.The first five CGVs accounted for 82 percent of the total aggregation effect.The coefficients for the first five CGVs, showed that the first corresponds mainly to aged 60+ contrasted with married, the second is a combination of income, degree and managerial or professional occupation, the third is aged 60+ and born UK, the fourth is married contrasted with born UK and the fifth is aged 45-59.

Aggregation effects on multiple regression.
Multiple regression models were estimated using the HSF data and the CD data, weighted by CD population size.The results are summarized in Table 5.5.The R 2 of the CD level equation, 0.880, is much larger than that of the individual level equation, 0.496.However, the CD level R 2 is indicating how much of the variation in CD mean income is being explained.Generally the regression coefficients estimated at the two levels are of the same sign with the exceptions being married, which is non-significant at the individual level, and the coefficient for aged 20-29.The values can be very different at the two levels, with the CD level coefficients being larger than the corresponding individual level coefficients in some cases and smaller in others.The differences are often considerable, for example the coefficient for degree increases from 8471 to 21700.The average absolute difference was 4533.
The difference between the two estimated models can also be examined by comparing their fit at the individual level.The fitted value based on the individual level model is y (1)  i = B (1) yx x i and that based on the CD level model is y (2)  i = B (2)  yx x i .The usual estimate of the residual variance is i∈s (y i − y (1) i ) 2 /(n − p), which was 10351 2 and this can be compared with i∈s (y i − y (2)  i ) 2 /(n − p), which was 12113 2 .Using the CD level equation to predict individual level income gave an R 2 of 0.310 compared with 0.496 for the individual level regression equation.
Other variables could be added to the model but the R 2 obtained was considered acceptable and this sort of model is indicative of what researchers might use in practice.The R 2 obtained at the individual level is consistent with those found in other studies of income (e.g., Davies, et al. [2]).There are likely to be variables with some explanatory power omitted from the model, but this reflects practical data analysis.We were concerned with looking at the effect of aggregation and the effectiveness of methods for adjusting for aggregation effects when a reasonable but not necessarily perfect statistical model is being used.The log transformation was also tried for the income variable but did not result in an appreciably better fit.
The estimates and associated estimated standard errors obtained at the two levels are different and so is the assessment of their statistical significance.Using a ten percent significance level the coefficients for married, aged 45-59 and aged 60+ were nonsignificant in the individual level equation.In the CD level equation the coefficients for unemployed, manual occupation, aged 15-19 and aged 45-50 were non-significant.The estimated standard errors of coefficients at the CD level were between 1.19 and 3.65 times larger those estimated at the individual level.The changes in the estimated residual mean squared error and the degrees of freedom imply an increase of 3.23.For all the coefficients except female the increase is less than 3.23, which is due to the effect of aggregation on S xx .

An evaluation of the adjusted CD level regression method
The CGV analysis suggests which variables have strong grouping effects.In considering potential adjustment variables we also need to consider those variables for which it is reasonable to expect individual level data might be available.Because the adjustment relies on obtaining a good estimate of the unit level covariance matrix of the adjustment variables we need to keep the number of variables small.By choosing variables that characterize much of the difference between CDs we hope to have variables that will perform effectively in a range of situations.Based on these considerations the evaluation concentrated on the following auxiliary variables: owner occupied, renting from government, housing type, aged 45-59 and aged 60+.
To assess how well these variables perform in removing aggregation effects Σ ww (z) was calculated.The resulting adjusted aggregation effects Q a (z) = Σ aa (z)/S (1)  aa are given in column five of Table 5.1.The ratio Q a (z)/Q a is given in the last column of Table 5.1 and indicates that these adjustment variables remove between 9 and 75 percent of the aggregation effect.For income the reduction is 32 percent and the average reduction across the variables is 52 percent.These values tell us the effect of the adjustment for each variable separately.A CGV analyses based on (S (1) ww) −1 Σ ww (z) gives an overall assessment of the amount of the aggregation effect of the dependent and explanatory variables that is removed by these adjustment variables.Because they are also used as adjustment variables the explanatory variables aged 45-59 and aged 60+ were not included in this CGV analysis.The reduction in θ was 51 percent.Examination of the coefficients resulting from the CGV analysis showed that the first CGV remaining after adjustment was mainly associated with income, degree and a managerial or professional occupation.The second CGV was mainly associated with being born in the UK.The first two CGVs accounted for most of the remaining aggregation effects.An analysis of the CGVs based on (S (1)  xx ) −1 Σ xx (z) gave similar results, with income disappearing from the first CGV.These results suggest that the adjustment variables considered account for about half of the aggregation effects.Comparing the results of the CGV analysis of (y,x) before and after adjustment suggests that the auxiliary variables used have accounted for the first grouping dimension but not the second.Much of the remaining aggregation effects are associated with income and indicators of relatively high socio-economic status such as having a degree or managerial or professional occupation.For these variables the reduction in the aggregation effects of In the example considered, using a limited number of auxiliary variables, it is possible to explain about half the aggregation effects in income and a number of explanatory variables.Using individual level data on these adjustment variables enables the aggregation effects due to these variables to be removed.However, the resulting adjusted regression coefficients are no less biased.This suggests that for this adjustment approach to work well it is necessary to find adjustment variables that account for a very large proportion of the aggregation effects.The CGV analysis shows that after allowing for the auxiliary variables considered there were residual grouping effects that were associated with indicators of higher socio-economic status.We could attempt to find further auxiliary variables that account for these grouping effects and for which it would be reasonable to expect that the required individual level data to be available.However, there are always likely to be some residual group level effects and so we need methods that can satisfactorily account for them.
The problems affecting ecological analysis are due the variation of relationships between groups which may be related to the explanatory variables and homogeneity of variables within groups.To unravel ecological analysis we first need realistic models at the individual level that reflect these features.Two main avenues for doing this are to include other variables that partly explain the between-group variation and within-group homogeneity and structures for the random components that include group level effects.Methods that use only one of these avenues are unlikely to be successful.Our results for linear regression suggest that including a small number of auxiliary variables can explain a lot of the within group homogeneity but suggest that a significant amount will always remain.Hence, methods to account for the remaining homogeneity due to group level effects need to be developed.

Table 5 .
1. Summary of aggregation effects on variances.

Table 5 .
2. Summary of aggregation effects on covariances between income and other variables.

Table 5 .
3. Summary of aggregation effects on conditions between income and other variables.

Table 5 .
4. Summary of aggregation effects on regression between income and other variables.

Table 5 .
5. Comparison of individual CD level and adjusted CD regression equations.