Primary Unit for Statistical Analysis in Morphometry: Patient or Cell?

In a series of 16 oxyphilic follicular neoplasms of the thyroid (8 adenomas and 8 carcinomas), three different approaches for the analysis of morphometric data were evaluated. It was shown that the statistical design of morphometric studies is by nature nested due to subsampling of cells within each patient. Therefore, the most appropriate analysis would be to account for this hierarchical structure. However, related statistical methods are not at present well established, especially as far as classification rules are concerned. Therefore, the nested design is converted into the simple factorial one by considering only one kind of statistical unit – either patients or cells. The results of the study presented indicate that ignoring the patient as unit of analysis leads to a substantial error in statistical output, regardless of the particular procedure applied. Moreover, the size of the error can be neither diminished nor controlled. Choosing patients as primary units assures accurate results and also has an advantage of gaining some additional information by calculating several distributional estimates in each patient. However, this approach often requires a reduction of dimensions and, furthermore, is not encouraged in certain fields of quantitative cytology. Advantages and disadvantages of all approaches have been summarized and practical recommendations for their use have been worked out.

Primary unit for statistical analysis in morphometry: patient or cell?
In a series of 16 oxyphilic follicular neoplasms of the thyroid (8 adenomas and 8 carcinomas), three different approaches for the analysis of morphometric data were evaluated. It was shown that the statistical design of morphometric studies is by nature nested due to subsampling of cells within each patient. Therefore, the most appropriate analysis would be to account for this hierarchical structure. However, related statistical methods are not at present well established, especially as far as classification rules are concerned. Therefore, the nested design is converted into the simple factorial one by considering only one kind of statistical unit -either patients or cells. The results of the study presented indicate that ignoring the patient as unit of analysis leads to a substantial error in statistical output, regardless of the particular procedure applied. Moreover, the size of the error can be neither diminished nor controlled. Choosing patients as primary units assures accurate results and also has an advantage of gaining some additional information by calculating several distributional estimates in each patient. However, this approach often requires a reduction of dimensions and, furthermore, is not encouraged in certain fields of quantitative cytology. Advantages and disadvantages of all approaches have been summarized and practical recommendations for their use have been worked out. Keywords: Computer assisted image analysis, statistical analysis, nested design, thyroid, oxyphilic neoplasms

Introduction
Recent advances in computer science have led to a substantial progress in morphometry [27,43]. However, numerous methodological problems still remain, statistical issues among them [27,43]. We came across many of these problems while performing our recent work on thyroid neoplasms [44]. In our hands, one of the most challenging problems deserving special attention was an appropriate choice of the primary unit for statistical analysis.
The problem starts with the fact that usually two different sampling units exist in morphometry: patients (tumours) and cells (nuclei). 1 With regard to the statistical evaluation, three different methods of analysis ensue, which all can be illustrated by corresponding examples from the literature. The first approach (approach I) takes into account both units, i.e., a socalled "nested" design is considered [6,19], where patients represent "higher-level" units and cells "lowerlevel" units ( Fig. 1A). Some special techniques, such as mixed-effect analysis of variance (ANOVA), have been adopted for this design [6,8,13,24].
The second approach (approach II), which is widely used in combination with discriminant analysis, is to treat separate cells as primary units [8,15,22,24,34,35,40], thereby ignoring patients as units of analysis. That is, nuclei from all tumours are pooled, for instance, into "malignant" and "benign" groups ( Fig. 1B), and then, a multivariate "cell classifier" is developed. As a result, each individual nucleus is assigned by the program to a particular diagnostic group. However, to achieve practically meaningful results a subsequent classification of tumours has to be made [12]. This is usually based either on the mere prevalence (50%) of cells assigned to a given diagnostic group [22] or on the use of more flexible cut-off values [12,34,35]. Thus, in approach II a full run of multivariate analysis is carried out exclusively on separate cells and only thereafter a tumour is regarded as a whole.
In the third approach (approach III) patients are considered as primary units (Fig. 1C). Commonly, average values and standard deviations of nuclear features are determined in every tumour [3,21,25,37,38,41,44,45]. A number of other distributional indices (such as skewness, kurtosis, Gaussian and Fourier components) can be calculated as well [5]. Values obtained comprise the "final" data set, on which all further statistical analysis, including the development of classification models, is carried out.
The correct choice among these approaches is difficult. To our surprise, we failed to find any discussion of this problem in the literature while doing the previous work on non-oxyphilic thyroid neoplasms [44]. We therefore decided to carry out a methodological investigation on an independent exemplary data set derived from 16 oxyphilic thyroid tumours. This is also the reason why only papers devoted to surgical pathology of the thyroid are cited above for presentation of the approaches. Nevertheless, the same situation can readily be observed in any other field of diagnostic morphometry, and the problem is thus of general charac-ter. Hence, the major aim of our study was to compare the described ways of analysis and to work out a "safe choice" strategy, which can help to avoid some erroneous conclusions in future morphometric investigations.

Case selection and measurements
Sixteen oxyphilic follicular neoplasms of the thyroid (8 adenomas and 8 carcinomas, archival material) were analyzed. The diagnoses were based on WHO criteria [29]. All specimens were fixed in buffered 10% formalin and embedded in paraplast. In every case, new 5-µm thick Feulgen stained sections were made from a block with representative tumour areas. Measurements were performed by means of a semi-automatic system for image analysis (CIRES, KONTRON, Germany), as described previously [44]. In each tumour, 200 to 350 randomly selected nuclei were measured using the following seven parameters [14]: nucleus area (NA); mean optical density within a nucleus (MOD); standard deviation of optical density within a nucleus (SDOD); skewness in the frequency distribution of op-tical density within a nucleus (SkewOD); excess in the frequency distribution of optical density within a nucleus (ExcOD); minimal optical density within a nucleus (MinOD); maximal optical density within a nucleus (MaxOD).

Statistical analysis
Statistical analysis was performed using SPSS for Windows, version 6.1.3a (SPSS Inc., USA). To simplify the calculations, we created a completely balanced design, i.e., we left for the analysis only 200 nuclei per patient by a random elimination of all surplus cells. Two methods, analysis of variance (ANOVA) and linear discriminant analysis (DA), were used to evaluate possible differences between the tumour groups. The underlying assumptions (normality and equality of covariance matrices) were always checked graphically, both before analysis (on frequency histograms) and after (on residual plots) [1,2,18,39]. The three approaches, which we will continue to call "approach I", "approach II" and "approach III", were explored independently from each other as follows.
Approach I. To account for the nested structure of the data, a two-way ANOVA with mixed effects was applied. In this model (model 1), patients represented the first, random effect (i.e., a factor with infinite number of levels), which was nested within the second, fixed factor, i.e., diagnosis [6,13] (Fig. 1A). As for DA, no well-established adaptations of this technique to nested design seem to exist, so this kind of analysis was not attempted.
Approach II. After pooling cells within each diagnostic group, there were k × n = 1600 observations in the "benign" and 1600 in the "malignant" group ( Fig. 1B). For hypothesis testing, a one-way ANOVA (model 2) was performed. In DA, stepwise selection of variables, based on minimization of Wilks' lambda (F for entry 3.84, F for removal 2.71), was adopted [18,39]. The fit of the final discriminant function was estimated on the same data set using the "jack-knife" (leave-one-out) method. That is, nuclei belonging to a particular patient were excluded from the analysis and then classified by the discriminant function computed from remaining observations, with repetition of the procedure for each patient in turn [18,39]. The subsequent tumour classification as well as probabilities of the predicted diagnoses were based on the prevalent percentage of cells assigned by the model to a given category.
Approach III. In this approach (Fig. 1C), values of all parameters were averaged separately for each patient. For comparison with the first two approaches, we subjected the mean values to one-way ANOVA tests (model 3). To extract more information from the data, several other indices (standard deviation, skew, kurtosis) characterizing the shape of frequency distribution for every feature within a given tumour were also calculated [5]. Thus, the final data set contained four times as many variables (28) as the original one. However, we could not perform multivariate ANOVA on such a data set directly, because there were more variables than observations and the covariance matrices were therefore redundant. For DA, the ratio observations vs. variables (ROV) was even more critical. Indeed, for discriminant function to be accurate enough, the ROV value must be at least 2 (optimally 5-10 or more) [4,18]. It must be stressed that the calculation of ROV involves not the general number of observations, but only the number in the smallest group [4,18]. Hence, the initial ROV in our study (8 : 28) was unacceptable. To overcome this problem, we performed a reduction of dimensions not involving the outcome variable [28,36]. At first, a hierarchical clustering of variables using Ward's method was performed. As a dissimilarity measure, the squared Euclidean distance was specified. To equalize effects of differently scaled variables, all values in the data set were standardized to a range of 0 to 1 [39]. As shown in Fig. 2, all variables fell into three clusters of about equal size. Each separate cluster was then subjected to factor analysis. The number of extracted principal components was determined on scree plots (Fig. 3). To save maximum information, the beginning of the "scree" was always chosen as the last downright "knee" of the plot, at eigenvalues lower than 1 (arrows in Fig. 3); afterwards, all factors lying before this point were extracted. For the same reason, a non-orthogonal rotation of the factor matrix (Oblimin, delta = 0) was adopted [39]. As a result, 10 new variables, which accounted for an overwhelming part of the initial variance (94.7, 99.4 and 89.9% in the first, second and third clusters, respectively), were calculated. The multivariate ANOVA was already possible (model 4), but in DA, a further dimensionality reduction had to be performed, because the ROV value was still too small (8 : 10). This was achieved by a stepwise selection of variables (for settings, see approach I). With the two selected variables (fac-2.3 and fac-3.1), the ROV was quite reasonable (8 : 2). To validate the model, the "jack-knife" method (with subsequent exclusion of every patient) was used.

ANOVA
Results of ANOVA tests are given in Table 1. As can be seen, the first two models yielded quite different results. Indeed, model 2 demonstrated highly significant differences between "benign" and "malignant" cell populations in both multi-and univariate tests, but model 1 indicates that most of the differences can be explained by patient-to-patient variability (all tests for the patient level were highly significant), so that the overall contribution of the second factor -diagnosis -was non-significant. The controversy was especially striking in the multivariate tests. On the contrary, models 1 and 3 appeared almost identical with regard to the diagnosis level, since they produced practically equal F -and p-values. As for model 4, it incorporated a good deal of extra information brought in with the additional indices calculated. This informa-tion turned out to be important for separation of the tumour groups, because the results of multivariate tests in model 4 were statistically significant, in contrast to those in model three.

Discriminant analysis
The primary output of DA is presented in Table 2. Both discriminant functions computed were highly significant. However, the Wilks' lambda in approach II was noticeably large, suggesting a low practical efficiency of the corresponding discriminant function [4,18,39]. In reality, this flaw expressed itself through rather inconclusive probabilities for correct diagnoses when classifying individual tumours in the "jack-knife" procedure ( Fig. 4). Indeed, most probability values were between 0.6 and 0.7, which is fairly low. On the contrary, the discriminant function in approach III had a relatively small Wilks' lambda ( Table 2) and classified the patients rather unequivo- Fig. 3. Scree plots obtained in factor analysis procedure: A, in the first cluster; B, in the second cluster; C, in the third cluster. Arrows denote the beginning of the "scree". cally (Fig. 4). A comparative performance of the classification models in terms of misclassification rates and some related indices (i.e., sensitivity and specificity for malignancy detection) after the "jack-knife" validation are given in Table 3. Here, the advantage of the approach III is not as striking, but still evident.

Discussion
As mentioned in the introduction, there are two main sampling units in morphometry, patients and cells, which together build a hierarchically organized structure 2 (Fig. 1A). This structure is usually called "nested design" [6] and occurs as a result of the corresponding sampling procedure: first, one carries out a random sampling of patients in each diagnostic group, and then a subsampling of cells within each tumour. It should be noted that the subsampling of observations within the same objects is not rare in medicine. For instance, in periodontal research scientists also have to deal with hierarchically organized data, since many teeth (sites) per patient are usually measured. Unlike morphometry, the statistical issues in periodontology have been discussed in detail [10,16,17,20,32,33], and we shall widely use results alluded to in the discussion in our paper.
It is striking, that different ways of analyzing essentially the same data set produce quite different results. We shall therefore discuss the underlying causes for this discrepancy, considering each of the approaches used.

Approach I
Since this approach accounts exactly for the innately nested structure of karyometric data, it would by far be the best way of statistical analysis in morphometry. Unfortunately, the range of statistical methods, which have been adopted to the nested design, is to date extremely narrow. In fact, the only well-established procedure, which completely fits this purpose is an ANOVA with mixed effects, or the variance component model [6,13,19], but even this is often absent in popular statistical packages [19]. As for other procedures, including DA and similar techniques, they were extended to the nested design only recently. In periodontal research, for example, methods such as generalized estimating equations, weighted least-squares analysis or regression models for correlated observations have been suggested [16,33]. There is also an entire class of multilevel statistical models, developed precisely for analyzing hierarchical data [26]. Theoretically, it is possible to fit discriminatory multilevel models which allow classification of both tumours and cells [26]. In practice, however, none of the methods mentioned can be utilized by non-professionals, especially when classification rules are concerned. A further adaptation of these techniques for morphometric research is necessary.
It is probably this lack of easy-to-use methods that led to a wide implementation of two other approaches (approach II and approach III) in karyometry. In essence, both of them represent an attempt to simplify the nested design, i.e., to convert it into a factorial one, by removing either patients (approach II) or cells (approach III) from the analysis. It is, however, very important to know prior to such a conversion, how the precision and correctness of statistical analysis will be influenced. It can be stated that this increase in significance is artificial and occurs due to a violation of one of the basic assumptions required by usual statistical methods, namely the assumption of independence of analytical units. An explanation for this statement, borrowed mainly from similar studies in periodontology, follows below.

Approach II
As mentioned earlier, a nested design in morphometry occurs as a natural consequence of the corresponding sampling procedure. This procedure can be defined as a two-stage cluster sampling [42], where tumours represent naturally occurring clusters of cells. The natural origin of clusters here implies that their members tend to be rather alike. This similarity can be expressed quantitatively by an intraclass correlation coefficient (ICC) as follows [26,32,33]: where σ 2 p denotes the between-patient variance, and σ 2 c(p) the within-patient variance of cells; both these values are obtainable from the standard variance components table (see, for example, model 1 in Table 1, where hypothesis SS corresponds to σ 2 p and error SS to σ 2 c(p) ). If ICC is greater than zero, the objects (cells) represent dependent observations [10,16,17,20,26,32,33]. Under this condition, the traditional statistical formulas used in approach II underestimate the standard error of the mean [10,26,32]. The size of the underes- timation is often referred to as a deflation factor (F d ) and can be calculated as follows: where k is the number of repeated measurements taken on each patient [32]. As a consequence, F -values in ANOVA, for instance, become F d times as high as the true ones, which greatly reduces the corresponding p-levels [10,32]. We calculated ICC and F d for each karyometric feature in our study 3 (see Table 4). The obtained ICC values were comparable with those in periodontology [20], but the F d was always much higher due to the greater number of repeated measurements per patient [32].
The bias described has its consequences not only in ANOVA. In DA, for example, it leads to erroneous results in stepwise variable selection. As can be seen from Table 2, inclusion of each additional variable into analysis after MOD caused only a very slight decrease of the Wilk's lambda. In other words, the contribution of these variables to the discrimination between the tumour groups was extremely small, and yet they all entered the analysis due to the "highly significant", but, in fact, just inflated F -values. By analogy, the significance level of the discriminant function itself is also strongly biased.
A further question arising in this situation is to which extent our conclusions can be generalized, i.e., whether the error will always be so large on different tissue/tumour types. As can be seen from Eq. (2), F d approximates 1 if either ICC or k is close to zero. However, in a morphometric study k should be at least 100, otherwise cell samples are not representative of their tumours [23]. As for ICC, its magnitude presumably may vary across different tissue types. It can be suggested from the analysis of Eq. (1) that ICC will be relatively large, for example, in highly differentiated neoplasms (such as those in our study), due to a low intra-tumour variability of cells (σ 2 c(p) ). On the contrary, in poorly differentiated tumours with marked nuclear atypia, where σ 2 p is much larger than σ 2 c(p) , ICC will probably decrease. However, ICC can never be equal to zero, because some patient-to-patient variability exists in any case [42], thus the denominator in Eq. (1) is always more than 0. Moreover, it is well known in statistics that this type of variability (between primary sampling units) is usually the largest [30]. Indeed, with only a few exceptions [7], the contribution of the patient level was highly significant not only in our study, but also in many other karyometric investigations, regardless of the type of material used [8,9,24,31]. But even if the contribution of the patient level has been found to be non significant and the ICC value is very low (say, 0.05), the resulting error will still be considerable (F d of app. 6 at k = 100) due to the large number of cells required for measurement in each patient.
Thus, it can be concluded that hypothesis tests in approach II unavoidably produce falsely lowered pvalues, and although the size of the error may, probably, vary from tissue to tissue, it can hardly be neglected. The problem in approach II is, however, much worse than simply inaccurate hypothesis testing. The described bias will remain virtually the same in all common statistical methods even if they do not involve hypothesis testing at all (e.g., factor or principal component analysis), because they all require the analytical units to be statistically independent and, in addition, imply very similar computational routines [4,32,33,39]. We would like to emphasize that the primary cause of this problem is the disregard of the patient as an analytical unit, which ". . .creates a target population statistically, that literally does not exist in nature" [33]. Because of this, an accurate statistical inference becomes a priori impossible, and it does not matter whether hypothesis testing was used or not [26]. Furthermore, the situation becomes especially piquant when, together with karyometric features, some other variables such as gender, age or tumour size must be considered. Indeed, all these parameters refer not to cells but to patients, i.e., to the higher level units [26], and it is unclear how they should be treated then.

Approach III
In this approach, the nested design is converted into a simple factorial one by averaging karyometric features across each tumour. In other words, nuclear parameters derived from separate cells are regarded as repeated measurements taken from corresponding objects (tumours). Such a definition appears to be quite sensible and, as the results of our study show, this approach produces rather accurate results: both models 1 and 3 yielded almost identical F -and p-values 4 (see Table 1). Evidently, this method of analysis leads to a correct statistical inference and, furthermore, allows the easy incorporation of any additional tumour and/or patient characteristics. Although, in comparison with approach I, it is not possible here to test the significance of inter-tumour variation, an intra-tumour variability of karyometric features (i.e., nuclear pleomorphism) can be evaluated instead. This is done by computing, in addition to the mean, several auxiliary distributional estimates [5]. In our study, these newly calculated variables turned out to be important for distinguishing between the tumour groups (see Table 1, model 4 vs. model 3). Thus, approach III also has an advantage of gaining some additional information, which can be of biological importance.
There are, however, two substantial disadvantages of this approach, which limit its use to some extent. The first problem is the usually small number of patients in each diagnostic group. The matter in question is not even the patient number itself, but rather the ratio of observations vs. variables (ROV, see above), which becomes particularly small when all the aforementioned distributional indices have been calculated. As a result, the covariance matrices in MANOVA become redundant, and discriminant function(s) computed in DA extremely unreliable [4,13,18]. Of course, the best solution would be to encompass more patients in the study, but in practice this often cannot be done, for example, due to the rarity of a disease [2]. Consequently, one usually has to lower the ROV from the opposite side, i.e., to abridge the number of dimensions (variables) in the analysis. It should be stressed at this point that no methods of variable screening based on statistical significance (e.g., forward or backward variable selection) are suitable for this purpose, because they involve a multiple comparison problem and, moreover, may lead to a substantial loss of practically impor-tant information [18,28]. Other procedures, which do not utilize the outcome variable, must be used [28,36]. Marshal et al. [36] have found that the best method is a nearest-neighbor clustering of variables followed by principal component analysis with orthoblique rotation. We also used this technique (slightly modified) in our study and, indeed, achieved a three-times dimensionality reduction without any substantial loss of information. Only after that, a limited stepwise variable selection may be carried out [28]. For further discussion of methods of data reduction we refer the reader to a number of excellent articles [18,28,36].
The second problem is that with taking average values in every tumour, a unique information characterizing each cell as a separate vector of all karyometric features gets irretrievably lost. In this respect, a clear distinction should be made between two types of pathological material to be measured. The first type comprises virtually all histologic and some kinds of cytologic specimens (for example, fine needle thyroid aspirates), where all cells selected for measurement belong to one and the same class (e.g., either benign or malignant). In this case, approach III is appropriate, as also shown by the results of our study. The second type of material comprises a major part of cytological preparations (exfoliative cytology, pleural and abdominal effusions), where varying proportions of cells of different types (from normal through dysplastic to malignant) may be present. Typically, the cells of interest (usually malignant) are rather scant here, but, ideally, even a single such cell should be reliably detected [7,9,31,43]. Under this condition, taking average values can obscure biologically important events, so approach III is not encouraged in this situation.

Approach II vs. approach III
As discussed above, the ANOVA-model in approach II produced wrong significance levels; on the contrary, approach-III supplied accurate results. Apart from hypothesis testing, however, it is also interesting to compare these approaches with respect to the classification models developed. As our results show (see Fig. 4 and Table 3), approach III tends to produce more confident and reliable grouping of the tumours than approach II. It should be noted, however, that an exact evaluation of the model performance requires a larger sample size and stricter evaluation techniques, such as "hold-out" method or bootstrapping [18,28]. Furthermore, creation of a classification model is, in contrast to ANOVA, a complicated interactive process depending both on statistical settings and biological properties of the material. It is therefore difficult to predict whether changing tissue type will always influence both approaches to the same degree. Thus, our finding may not necessarily hold in all instances and requires further study.

Conclusions and final remarks
The advantages and disadvantages of the approaches discussed above are summarized in Table 5. We also included in this table our suggestions concerning the use of these approaches with the hope that they could be useful for future morphometric investigations. On the whole, approach I seems to be the most appropriate for morphometry, but the corresponding statistical techniques are at present not well developed. If choosing between approaches II and III, the latter should be preferred wherever possible. The use of approach II must be restricted only to those situations, where approach III is inappropriate and there is no other way except to violate the assumption of independence. However, the researcher should be aware of the drawbacks of such analysis and ". . . consider their possible effects on the results and their interpretation (rather) than. . . ignore them in the hope that they will not be noticed" [1].
Finally, the practical meaning of the revealed differences between oxyphilic thyroid adenomas and carcinomas shall briefly be discussed. Certainly, the reclassification results obtained in the approach III (see Table 3) are very encouraging and, moreover, they are consistent with the results obtained in our previous series [44]. However, we are not inclined to make any definitive conclusions on this basis, mainly because the patient number was far too small in the present study, even with regard to the dimensionality reduction adopted. Besides, it is highly advisable to use more objective procedures for validation of multivariate models, as mentioned above. Thus our investigation should be considered in this respect as a feasibility study only. Further work is necessary to verify the efficacy and stability of the model created.