Application of Multilevel Models to Morphometric Data. Part 1. Linear Models and Hypothesis Testing

Morphometric data usually have a hierarchical structure (i.e., cells are nested within patients), which should be taken into consideration in the analysis. In the recent years, special methods of handling hierarchical data, called multilevel models (MM), as well as corresponding software have received considerable development. However, there has been no application of these methods to morphometric data yet. In this paper we report our first experience of analyzing karyometric data by means of MLwiN – a dedicated program for multilevel modeling. Our data were obtained from 34 follicular adenomas and 44 follicular carcinomas of the thyroid. We show examples of fitting and interpreting MM of different complexity, and draw a number of interesting conclusions about the differences in nuclear morphology between follicular thyroid adenomas and carcinomas. We also demonstrate substantial advantages of multilevel models over conventional, single‐level statistics, which have been adopted previously to analyze karyometric data. In addition, some theoretical issues related to MM as well as major statistical software for MM are briefly reviewed.


Introduction
Morphometric data usually derive from a so-called multistage sampling, or sub-sampling [31]. In the simplest example, one takes a random sample of patients (tumors) and then randomly sub-samples cells (nuclei) within each patient. 1 A hierarchical, multilevel data structure ensues, with cells (level-1 units) being "nested" within patients (level-2 units). The importance of this feature for the morphometric research was recognized already in seventies [28] and further dis-cussed in early eighties [13][14][15]. This discussion was however restricted to a simple decomposition of the total variance of one single morphometric feature into different components attributed to each level of hierarchy. A possibility was also shown to incorporate some factors into the analysis by using nested analysis of variance (ANOVA), for example, to explore differences between nuclear features in benign and malignant lesions [2]. However, due to many limitations of classical nested ANOVA and lack of other methods of analysis, two other approaches to data analysis have commonly been used in the morphometry, as described in our previous paper [33]. The first approach ("pooling" method) treats cells as independent units of analysis, that is to say cells from different patients are pooled, e.g., into benign and malignant "cell populations". In the second approach (summary statistics method), morphometric data are summarized within each patient by calculating mean values, standard deviations, etc., which the subsequent statistical analy-sis is performed on. The aim of both approaches is to eliminate the hierarchy of the data and make use of traditional, "single level" statistical methods. However, both approaches, and especially the "pooling" method, have certain disadvantages and cannot be universally recommended [33].
In recent years, considerable progress in managing multilevel data has been achieved. In particular, an entire class of regression models, most widely known as multilevel models (MM), has been specially developed to handle hierarchical data structures [4,12,18,25,29,32]. A number of user-friendly and well-documented statistical programs for MM are now also available, and overwhelmingly increasing number of examples of using MM can be found, mainly in epidemiological and socio-medical research [3,[7][8][9]21]. However, MM have not yet been used in the morphometry.
In this paper we report our first experience with MM in the analysis of morphometric data. The present study is a logical continuation of our previous work [33]. In Part 1, we focus on linear models, i.e., models where morphometric nuclear features serve principally as dependent variables. A comparison with traditional, single-level statistics is made in order to demonstrate the key advantages of MM. In addition, some theoretical issues related to MM as well as major statistical software for MM are briefly reviewed.

Materials and methods
Seventy-eight follicular neoplasms of the thyroid (34 adenomas and 44 carcinomas, archival material) were analyzed. The diagnoses were based on WHO criteria [17]. All specimens were fixed in buffered 10% formalin and conventionally embedded in paraplast. For karyometry, monolayer preparations of cell nuclei were made according to the method of van Driel Kulker et al. [34]. Briefly, 50 µm thick sections were cut in each neoplasm from a block with representative tumor areas, deparaffinized in xylene, rehydrated and incubated in 0.05% pronase solution for 30 minutes at 37 • C with intermittent vortex mixing. After filtration through 50-µm nylon gauge, centrifugation and re-suspension of the sediment in 2% carbowax solution (MW 1500, Merck), monolayer preparations were obtained using a Shandon cytocentrifuge at 2000 rpm for 15 min. The slides were air-dried and stored at 4 • C. Nuclei were stained after Feulgen in modification thionine-SO 2 , using CAS (Becton Dickinson) staining kits and protocols. Prior to staining, carbowax was Features marked by asterisk * were computed based on the gray values, not on the optical density values, because they much better approximate to the normal distribution. SAD is equivalent to nuclear surface area density as defined in [20] divided by the number of pixels within the nucleus. dissolved in distilled water for 30 min. Measurements were performed by means of a semiautomatic system for image analysis, composed of Eclipse E600 microscope (Nikon, Japan) with 100/1.4 Plan Apochromat oil immersion objective used for measurements, 3CCD video camera DXC-930P (SONY, Japan), IntriguePro image frame grabber (Integral Tech., USA), and personal computer (PII, 266 MHz, 64 MB RAM). The system was controlled by Optimas 6.5 image analysis software, with customized macros written in ALI (Analytical Language of Images) to automate karyometric measurements. A total of 8 features (see Table 1) were identified for 147 to 258 (200 on average) nuclei in each slide. Nuclear contours were determined automatically using a customized combination of two autothresholding algorithms ("minimize variance" and "search for minimum") implemented in Optimas [1]. The system was calibrated with a micrometrical scale in order to obtain NA values in µm 2 . Densitometric and texture features were measured in arbitrary units based on gray levels, and performed according to the recent consensus report [16]. Calibration of illumination as well as background and glare correction was used to avoid any artificial changes in object brightness. For the glare correction we used our own, improved algorithm of pixel-based pre-correction (not published).

Some theoretical considerations and multilevel software
Discussion of the multilevel theory is beyond the scope of this paper. For comprehensive review of the topic we refer the reader to a number of books [12,18,29] and journal articles [4,25,32]. Here, we just address some main issues. We will consider 2-level models only.
First of all, it has to be stressed that we are speaking about modeling morphometric data. It means that we want to explain the level and variation of a certain nuclear feature (outcome) in dependence of some factors (effects, predictors, covariates). The background behind such modeling is to get a biological understanding of certain lesions. No classificatory models, i.e. discriminating between benign and malignant conditions are considered here. This is currently the topic of our ongoing research, and the results will be published in the near future.
Two kinds of effects exist in statistical modeling: fixed and random. An effect is fixed, if it affects the population mean, and random, if it affects the variance of the dependent variable. Fixed effects are represented in the model by coefficients (intercept and slopes), and random effects by corresponding variance components attributed to these effects. Note that a variable can represent both fixed and random effects simultaneously. For example, a very common hypothesis is that malignant tumors have, on average, larger nuclei in comparison with benign. Tumor dignity is considered a fixed effect here. However, one can also hypothesize that malignant tumors demonstrate more pronounced anisokaryosis than do benign tumors. That is, tumor dignity might also affect the variance of the outcome variable, which is a random effect. Corresponding specification of the model is necessary to account for both effects.
There are also variables that represent purely random effects. In particular, subject effects associated with the sub-sampling in the morphometry contribute only to the variance of nuclear features and thus are random [30,37]. That means that prior to adding any covariates in the model, the hierarchy of the data must be accounted for by obligatory inclusion of a random effect representing level-2 (e.g., patient's ID). As a result, the between-patient variance is separated form the remaining, within-patient variance. This is essentially the way that MM work: the variance of the dependent variable is split into components corresponding to the levels of hierarchy. Thus, even the "empty" two-level model, i.e. a model, with just a constant term and "no" additional covariates, actually does include a covariate identifying level-2 units. Having this basic structure, one can add other covariates to account for some effects that are either fixed or random at different levels of hierarchy. Note that assumptions, residuals, and interpretation of the results are all level-related in multilevel modeling, which may offer additional challenges. Some further issues (variable interactions, correlated errors from different levels) can cause MM to be extremely complex. While presenting our results, we will step through the major types of MM in the ascending complexity, in order to show which information can be gained from them and how the model parameters should be interpreted.
There are different methods of estimation model parameters, which can roughly be subdivided into simulation and non-simulation techniques.

Non-simulation techniques:
• IGLS (iterated generalized least squares), • RIGLS (restricted iterative generalized least squares), Simulation techniques: For normal responses, IGLS is equivalent to ML and RIGLS is equivalent to REML. RIGLS and REML produce slightly better variance estimates than IGLS and ML [12,25,29]. Simulation methods are more precise, especially as far as random parameters are concerned [12,25], but they are also much slower and computationally intensive. In our study, we used RIGLS estimation.
Like all linear models, linear MM have assumptions of normality, linearity and homoscedasticity [4,12,18,25,29,32]. All these assumptions apply to every level of hierarchy and consequently must be checked for each level separately. In addition, there are two assumptions specific to multilevel designs: independence of level-1 and level-2 residuals and no autocorrelation at level-1 [25,32]. Normality, homoscedasticity, independence and autocorrelation of residuals can be inspected after a model is fitted using residual plots. MM allow an easy estimation and plotting of residuals for each level. Linearity assumption can be verified by including quadratic, cubic, etc. terms of corresponding covariates in the model. There are both general statistical packages with some multilevel procedures included and dedicated software for multilevel modeling. In Table 2, we listed all such programs that we are currently aware of. They differ considerably in the number of modeling possibilities, estimation methods and user-friendliness. Some of these packages have been reviewed in the literature in rather detail [6,30,36,37,39]. There is, however, no single program where all the estimation methods listed above would be implemented. To develop MM in our present work, we used MLwiN 1.10 [27], which presently seems to be the most advanced and, at the same time, very user-friendly program [6,36,39].

Modeling hierarchy in "empty" model
An "empty" two-level model includes a constant term (as a fixed effect) and two variance components corresponding to level-1 (within-tumor) and level-2 Table 3 "Empty" model for NA and log NA Notes: −2LL -negative doubled log likelihood. SE -standard error. Significance levels lower than 1.0E-99 are presented as 0. (between-tumor). In MLwiN, the model is specified by allowing the constant term to vary at both levels [27]. Note that level-2 variance represents the random effect associated with nesting (subject effect). The "empty" model is essentially equivalent to the variance component model available in all major statistical packages like SPSS or SAS [30,37]. It can be useful in three different ways (see Tables 3 and 4 for the results). 1. The most interesting feature for us is the possibility to explore the hierarchy itself -i.e., to judge how strong the clustering effects within the level-2 units are and whether we really do need to account for this. Table 3 shows a complete example of an "empty" model for the variable NA. As can be seen, the level-2 vari-ance in this model is highly significant, which means the hierarchical structure of the data cannot be ignored. The between-patient variance was highly significant also for all other karyometric features measured ( Table 4). This agrees with our results reported previously [33] and stresses once more the great importance of objects' hierarchy for analysis of morphometric data.
We can further get an idea about the magnitude of the clustering effects in our data -i.e., the degree of similarity of cells within each tumor. For this, we compute the intra-class correlation coefficient (ICC) using the known formula σ 2 p /(σ 2 p + σ 2 c(p) ) [10,12,29], where σ 2 p is the between-patient, and σ 2 c(p) the within-patient variance. As demonstrated in Table 4, the obtained ICC values were mostly between 0.5 and 0.6, which is rather high -higher than our previous results for oxyphilic thyroid tumors [33].
2. Constant term in the "empty" model is interpreted as the mean NA value in the entire population of follicular thyroid tumors. Table 3 shows that mean NA computed in MM as well as in both single-level approaches was the same. However, the SE of mean in the "pooling" method was greatly (about 10-fold) underestimated, whereas in the summary statistics method it was correct. This fully agrees with our previous data [33].
3. The assumption of normality can be checked already in the "empty" model by examining plots of standardized residuals vs. normal scores. Figure 1A,C shows an example of NA, for which normality assumption was violated at both levels, especially at level-1. After log transform, only minor deviance from the normal distribution was seen (Fig. 1B,D). Note, however, that interpretation of the model coefficients after log transform is different. Constant term of 3.78 (see Table 3) means that the mean NA in the population is e 3.78 = 43.8 µm 2 . This is lower than the mean estimated on the raw data, because the distribution of Table 5 Model for NA and log NA with DIAGNOSIS as fixed coefficient covariate Notes: −2LL -negative doubled log likelihood. SE -standard error. Significance levels lower than 1.0E-99 are presented as 0.
NA is left-skewed. Note also that the relation between level-2 and level-1 variance has changed, and ICC increased from 0.47 on the raw data (184.7/(184.7 + 206.8) = 0.47) to 0.54 on the log scale (see Table 4).
Violation of normality assumption was also detected for IOD, so the log transform was required here, too. All other nuclear features showed nearly normal distribution.

Exploring simple research hypothesis
Now we can extend the "empty model" to explore some research hypotheses. In our particular example, let us formulate the question of interest as that about magnitude and significance of the differences in nuclear features between follicular adenomas and carcinomas. Thus, we add a new term, DIAGNOSIS (coded as 0 for adenomas and 1 for carcinomas) as a fixed effect to the "empty" model created at the previous step. The full example of the model for NA is shown in Table 5, and the results are interpreted as follows.
2. Constant term shows the value of NA at 0 value of DIAGNOSIS. Since in our data set adenomas were coded as 0 and carcinomas as 1, the value of 41.1 corresponds now to the mean NA for adenomas only. For log transformed data, the estimated average NA is e 3.66 = 38.86 µm 2 .
3. The coefficient for DIAGNOSIS indicates that nuclei in carcinomas are, on average, 10.9 µm 2 larger than in adenomas, and thus, the mean NA in carcinomas is 41.1 µm 2 + 10.9 µm 2 = 52.0 µm 2 . As for log transformed data, the coefficient for DIAG-NOSIS means that NA in carcinomas is, on average, e 0.22 = 1.246 times as high as that in adenomas and is, thus, 48.42 µm 2 . Note that, again, both single-level approaches yield correct mean values, but the standard error in the "pooling" method is greatly underestimated, leading to an extreme inflation of the significance value for DIAGNOSIS. 4. Level-2 variance remained highly significant, but, in comparison with the "empty" model, considerably decreased. By contrast, level-1 variance remained exactly the same. This is due to the fact that DIAGNOSIS is uniform within each level-2 unit and is, therefore, a level-2 covariate. In some other situations, e.g., while studying gynecologic smears with a mixture of benign and malignant cells on each slide, DIAGNOSIS can be level-1 covariate, so both level-1 and level-2 variances would change. Table 6 Model for log NA with DIAGNOSIS as random coefficient covariate Notes: −2LL -negative doubled log likelihood. SE -standard error. Parameters marked by asterisk * were pre-constrained to 0, because they were redundant [27]. Significance levels lower than 1.0E-99 are presented as 0.

Complex variance structure
Until now, the coefficient for DIAGNOSIS was modeled as fixed. This implies that both between-and within-tumor variance is the same for adenomas and carcinomas. It is, however, logical to hypothesize that due to usually higher cellular atypia in malignant tumors (a) variation of nuclear size (and also other nuclear features) might be higher within carcinomas than within adenomas, and (b) carcinomas might differ more from each other in their (average, predominant) nuclear size (and other nuclear characteristics) than do adenomas.
To check it, we allow the coefficient for DIAGNO-SIS to vary at both levels. In other words, we consider now DIAGNOSIS to represent not only a fixed effect, but also two random effects: one corresponding to the question (a) above and affecting level-1 variance, and another one corresponding to the question (b) and thus affecting level-2 variance. Using terminology common to multilevel modeling, the coefficient for DI-AGNOSIS is specified as random 2 at both levels [12,2 The term "random coefficient" is often used in text books on multilevel modeling [12,29] and designates, in fact, a variable representing an admixture of a fixed effect and at least one random effect at any of the levels. 29,27]. Due to additional random effects, a complex variance structure at each level results (see below). Table 6 shows model parameters for log NA. The interpretation is as follows: 1. −2LL decreased drastically (by 175, at 2 additional degrees of freedom), which means a great improvement of the model fit.
2. The fixed model coefficients (intercept and slope) remained practically unchanged, and their meaning is the same as in the previous step.
3. At each of the levels, we have 3 parameters now instead of 1: intercept variance, slope variance (variance due to the random effects of DIAGNOSIS) and covariance of intercept and slope. This allows us to compute the variance of NA separately for adenomas and carcinomas. Note that for this purpose, any two out of the three parameters are sufficient. This is why the variance of the slope (DIAGNOSIS) was preconstrained to 0 [27]. The variance for adenomas has already been computed: it is the variance of the intercept. The variance for carcinomas is the variance for intercept plus twice the covariance between intercept and slope and is, thus, 0.068 for level-2 and 0.072 for level-1. Note, however, that the level-2 covariance is non-significant; the level-1 covariance is, by contrast, highly significant. We can thus state that the withintumor variation of NA (i.e., nuclear polymorphism) is significantly higher in carcinomas than in adenomas.
For the between-tumor variance of NA, this tendency is also observed, but is not significant.
The summary of the models with complex variance for all nuclear features measured is presented in Table 4. Differences in mean values between adenomas and carcinomas were not significant for SDGV and IOD and highly significant for all other nuclear features. Complex variance at level-2 was not significant for all features, except KurtGV. At level-1, however, the situation was completely different. As can be seen, carcinomas showed much higher within-tumor variation of nuclear size and form factor, DNA amount and chromatin coarseness, and lower variation in nuclear staining intensity and its derivates.
Having separate variances for adenomas and carcinomas, we can further compute separate ICC for each tumor group. We can even construct confidence intervals for these ICC values, and thus compare them. In our study, however, ICC were not significantly different between carcinomas and adenomas, mostly due non-significant covariances at level-2, with an exception being, again, KurtGV (data not shown).

Exploring complex research hypothesis
In a similar way to the previous steps we can incorporate in the model some further covariates, which may be relevant to our research question. Again, their coefficients can be modeled either as fixed or random at each of the levels, depending on the nature of the features and the interpretability of the model parameters.
As an example, we present in Table 7 a model for NA as dependent variable, and DIAGNOSIS and IOD as covariates with random coefficients at both levels. The reason for constructing such a model is the hypothesis that changes of NA might be secondary to the changes in nuclear DNA amount. Thus, we want to explore the difference of NA between carcinomas and adenomas while controlling for IOD. Both NA and IOD were log transformed to meet the assumption of normality. Furthermore, log IOD was then centered (by subtraction) around the value of 2.48, which corresponded to the nearly-diploid peaks on the DNA histograms. This greatly improved both convergence of the model and interpretability of the results.
1. The dramatic change of −2LL (by 8630 at 7 degrees of freedom) indicates a crucial improvement of model fit and confirms our surmise that NA is closely related to the nuclear DNA amount.
2. Constant term corresponds to the NA at 0 value of DIAGNOSIS (i.e., adenomas) and 0 value of log IOD (i.e., nearly-diploid DNA amount, due to the centering) and is equal to e 3.60 = 36.6 µm 2 .
3. The coefficient for DIAGNOSIS shows that, for the same DNA amount, malignant nuclei are e 0.26 = 1.3 times as large as benign nuclei. Thus, nearlyeuploid nuclei in carcinomas are, on average, 36.6 µm 2 × 1.3 = 47.6 µm 2 in size. Note that, in Table 7 Model for log NA with DIAGNOSIS and IOD as random coefficient covariates Notes: −2LL -negative doubled log likelihood. SE -standard error. Parameters marked by asterisk * were pre-constrained to 0, because they were redundant [29]. Significance levels lower than 1.0E10-99 are presented as 0. comparison with the previous model (Table 6), the coefficient for DIAGNOSIS increased, whereas its standard error remarkably decreased. This means that differences in nuclear size between adenomas and carcinomas became even more apparent if we control for nuclear DNA amount.
4. The coefficient for IOD indicates that NA is directly related to the nuclear DNA amount. The association function is of the power type; however, the coefficient is very close to 1, so the relationship is nearly linear, at least within the observed range of IOD values. For example, within the same tumor type (i.e., either adenomas or carcinomas), triploid nuclei are e 0.95×ln(1.5) = 1.5 0.95 = 1.47, and tetraploid nuclei e 0.95×ln(2) = 2 0.95 = 1.93 times as large as diploid nuclei.
5. The random parameters of the model allow us to evaluate the between-and within-tumor variance separately for adenomas and carcinomas in dependence on log IOD. This variance function is comfortably computed and plotted in MLwiN (see Fig. 2). As can be seen, the lowest NA variance corresponds to nearly diploid DNA amount, and the more is the deviance from the euploidy, the higher is NA variance. This is true for both levels, but is more pronounced at the level-2. Another point to note is that at high ploidy levels (beginning approximately from triploid), carcinomas show significantly higher nuclear pleomorphism (within-tumor variation of NA) than do adeno-mas (in Fig. 2B, the confidence limits do not overlap). Between-tumor variance seems to be equal for both tumor types, irrespective of nuclear DNA amount.

Discussion
In the present study, we could demonstrate some substantial advantages of MM over usual, single-level statistics in application to morphometric data. On the basis of the models fitted, we could draw a number of interesting conclusions: (a) Data hierarchy in the morphometry is a very important factor, with ICC values ranging mainly between 0.5-0.6 (see Table 4), which is rather high -higher than in our previous study [33] and in other research fields with published ICC data [10]. (b) There are significant differences between follicular adenomas and carcinomas in nuclear size and form factor, staining intensity, and certain chromatin properties. In particular, malignant nuclei, in comparison with benign, are larger, more irregularly-shaped, stain paler, and have coarser chromatin. The magnitude of these differences is described by corresponding coefficients (see Table 4), and the example of their interpretation was given above.
(c) There seems to be no difference in nuclear DNA amount between adenomas and carcinomas. This finding is consistent with most other reports [5,23,26,38]. Furthermore, SDGV (which describes the variation of staining intensity across different points in a nucleus) appears also to be the same for both tumor types. (d) Within-tumor variation of nuclear features is different for adenomas and carcinomas (see Table 4). Malignant tumors show higher nuclear pleomorphism, higher variation of nuclear DNA amount and higher variation of chromatin coarseness. By contrast, variation of nuclear staining intensity and its derivates is lower. 3 The latter finding is especially interesting with regard to the lower average staining intensity typical for malignant cells. It seems that malignant transformation affects most cells within the tumor in a uniform way in that they loose their tincture properties. (e) Between-tumor variation of nuclear features, excepting KurtGV, appears to be the same for adenomas and carcinomas. (f) NA is directly related to nuclear DNA content.
When controlled for tumor dignity, this relationship is nearly linear -e.g., tetraploid tumors are almost 2 times as large as diploid. Note, however, that this holds for monolayer preparations used in our study, where nuclei are very much flattened. In paraffin sections, this relationship may be of other magnitude. Many of these conclusions would be impossible using single-level statistics, or we would end up with wrong conclusions at all. The "pooling" method leads to a large bias due to reasons considered in our previous work [33] and is, therefore, unacceptable. Sum-mary statistics method is precise and effective enough for simple hypotheses, as can be seen on many examples in the literature [22,24,35]. However, it is ineffective for complex research questions [4]. In particular, estimation and comparison of within-and betweentumor variances is still possible (although very uncomfortable), if there is, at most, one categorical covariate (e.g., DIAGNOSIS). If we want to control for other, and especially continuous, covariates and explore the complex variance-covariance structure, like in our last model, MM are necessary.
The advantage of MM becomes even more apparent if there are more than 2 levels of nesting. For example, in a multicenter study, the centers (institutions) would represent the third level of the hierarchy. Or one can account for measurement error by taking repeated measurements from the same nuclei and treating them as the lowest-level units. It is clear that the "pooling" approach becomes even more imprecise, and summary statistics approach even more ineffective in this setting.
There are some extensions of traditional statistical methods, e.g., nested ANOVA and mixed-effect general linear models (GLM) capable of handling random effects. It is thus possible to manage certain kinds of multilevel data using GLM procedure available, for instance, in SPSS or SAS [30,37]. However, GLM produce estimates based on sums of squares and therefore achieve their optimal performance only on completely balanced designs. On unbalanced data, a bias occurs, requiring a careful choice of weighting [25,30]. GLM also treat all (even random!) effects as fixed and construct F statistics based on the ratio of the appropriate sums of squares. On the contrary, variance components, especially those attributed to random effects, are not estimated directly, and no standard errors are produced for them [30,37]. Consequently, it is in fact impossible to correctly test the significance of random effects using GLM. In contrast, MM use maximum likelihood estimation, which is asymptotically efficient even on unbalanced data [30]. Variance parameters are estimated directly in MM, together with corresponding standard errors [25,30]. Variance can also be modeled as a function of other covariates, like in our last model. This is impossible in GLM. In addition, MM generate −2 Log Likelihood and other indices like AIC (Akaike's Information Criterion) and BIC (Schwarz's Bayesian Information Criterion), which describe the model fit and can be used to search for "the best" model [12,29,30,37]. Finally, MM provide a natural way to handle multivariate outcomes [25], and we are going to demonstrate one of possible uses of these models in Part 2.
On the other hand, MM are much more complicated than conventional, single-level models. MM, due to their nature, split all variance components into different levels. Correspondingly, one has to check assumptions and interpret variance parameters separately for each of the levels. The general interpretation rule (on example of karyometric data) is that level-1 variance/covariance refers to cells, or "within-tumor" variation, and level-2 variance/covariance to tumors, or "between-tumor" variation of the dependent variable. If two or more covariates are present in the model, the meaning of the parameters for each variable becomes conditional on the presence of other variables in the model, which increases even more interpretation challenges. In addition, all the problems common to biostatistical modeling (e.g., sufficient number of study objects and adequate ratio "observations-to-variables", multiple comparisons problem, selection of the "best" model, checking assumptions, etc.) are no less and often even more acute in multilevel designs.
A particular issue to be addressed is the robustness of multilevel estimates against small samples and violation of assumptions, especially the one of normality. When using MM, one should consider sample size at both levels, i.e. number of level-2 units and number of level-1 units per level-2 unit. In the morphometry, a sufficient number of level-1 units per level-2 unit (e.g., several hundreds of cells per tumor) is usually easy to achieve. It is also desirable to have more than 100 level-2 units in a sample, especially accounting for the great magnitude and importance of the between-subject variation [13][14][15]28,33]. Yet, quite often the number of level-2 units is rather small, for example, due to the rarity of a particular tumor. As a result, it is difficult or even impossible to rule out non-normality using plots of level-2 residuals. It is, however, reasonable to assume a normal distribution of level-2 variables due to the central limit theorem [11]. In addition, the most recent study by Maas and Hox [19] based both on a comprehensive literature review and own simulations shows that multilevel estimates are generally robust to moderate non-normality and small sample sizes. Samples with about 50 level-2 units produce unbiased estimates and standard errors, except for the standard errors of variance components at level-2 [19]. To eliminate this bias, bootstrap or simulation techniques can be effectively applied [12,25,27]. MLwiN offers very comfortable facilities for parametric and non-parametric bootstrap, Gibbs sampling and MCMC.
There are also some purely technical, computational problems related to MM and/or to the corresponding computer program. Quite frequently, we experienced numerical errors and even program crashes while running MLwiN, even if the model was specified correctly. This was particularly typical for models with complex variance structure. Newer versions of the software might remedy the problem. According to our experience, if severe numerical errors occur or coefficients do not converge, switching between different estimation options (IGLS/RIGLS, negative variances at each level allowed/not allowed) may solve the problem. In all situations, centering the data around the grand mean as well as transforming them in order to improve normality are helpful. However, the interpretation of the models fitted to transformed and/or centered data is different and somewhat more complicated, as demonstrated in our study.
In conclusion, MM represent a very powerful and flexible way for modeling and analyzing morphometric data. They are superior over other statistical approaches that have been used in the morphometry until now. However, MM are more complex than common single-level statistics, and require very careful and thorough specification and interpretation. We hope that our experience described in this paper will be useful for those morphometrists who are going to apply MM in their research. For systematic reading on MM, a number of sources can be recommended [4,12,18,25,29,32].