The Reproducibility of Nuclear Morphometric Measurements in Invasive Breast Carcinoma

The intraobserver and interobserver reproducibility of computerized nuclear morphometry was determined in repeated measurements of 212 samples of invasive breast cancer. The influence of biological variation and the selection of the measurement area was also tested. Morphometrically determined mean nuclear profile area (Pearson’s r 0.89, grading efficiency (GE) 0.95) and standard deviation (SD) of nuclear profile area (Pearson’s r 0.84, GE 0.89) showed high reproducibility. In this respect, nuclear morphometry equals with other established methods of quantitative pathology and exceeds the results of subjective grading of nuclear atypia in invasive breast cancer. A training period of eight days was sufficient to produce clear improvement in consistency of nuclear morphometry results. By estimating the sources of variation it could be shown that the variation associated with the measurement procedure itself is small. Instead, sample associated variation is responsible for the majority of variation in the measurements (82.9% in mean nuclear profile area and 65.9% in SD of nuclear profile area). This study points out that when standardized methods are applied computerized morphometry is a reproducible and reliable method of assessing nuclear atypia in invasive breast cancer. For further improvement special emphasize should be put on sampling rules of selecting the microscope fields and measurement areas.


Introduction
Intuitively, quantitation is connected with the idea of high accuracy, reproducibility and reliability of the results. However, also quantitative methods involve sources of variation which influence the results and the conclusions. Sources of variation may make it difficult to adopt the method for use in clinical practice as a decision support in differential diagnostics and prognostically problematic cases.
Histological grading of breast cancer is widely used but it could be made more useful with better standardisation of the method [7]. As part of our effort of producing a quantitative grading system for breast cancer the purpose of this investigation is to examine the reproducibility of nuclear morphometry. Special emphasis is placed on intraobserver and interobserver variation of nuclear morphometry, on training of morphometric measurement technique, and on the selection of the measurement area.

Material
The study comprises 212 histologically verified cases of invasive breast cancer diagnosed at Turku University Hospital during the years 1990-1991 (Table 1). Of the samples 110 were fixed in buffered formalin and embedded in paraffin whereas 102 samples were first frozen and embedded in paraffin after frozen section diagnosis. Sections were cut at 5 µm and stained with haematoxylin and eosin. According to the original routine histopathologic diagnosis 174 cases were of ductal, 23 cases of lobular and 15 cases of special types of infiltrative breast cancer. Histological grades [6] were: grade 1, in 59 cases; grade 2, in 114 cases; and grade 3, in 39 cases.

Measurement of nuclear profile area
Measuring instrument. The nuclear profile area of the samples was measured using an image overlay drawing system run by the Prodit morphometry program (Prodit 3.1, Promis Inc, Almere, The Netherlands). The system includes a microscope, a personal computer (Compaq Deskpro 386/20e; Compaq Computer Corporation, Houston, TX, USA), a video camera (JVC TK-870U; JVC, Japan) and a digitizer board (PIP-512B video digitizer board; Matrox Electronic Systems, Dorval, Quebec, Canada). Digitized images of nuclear profiles were outlined on the monitor screen with a computer mouse [27]. At the beginning of each measuring session the system was calibrated with a micrometer slide. Measurements were performed with ×40 objective magnification which when added to the 10× video ocular and 2× internal magnification resulted in an image of ×2500 magnification on the monitor screen.
Sampling rule. The analysis of each sample began with selection of the measurement area at the clearly invasive border of the tumour in which the most cellular area was chosen for analysis. Necrotic and inflammatory areas were excluded. All distinguishable tumour cell nuclei in a microscopic measurement field were systematically selected starting from the upper-left corner of the measurement field. Altogether 6-15 adjacent fields were analysed until a total of 50 nuclei were measured.

Training of the observers
Observers 1 and 2 (CH andÜT) were originally unfamiliar with the measuring method. The measuring method was described to the observers step by step whereafter they rehearsed the measuring procedure thoroughly. The trainer (PK) always selected the measurement area and marked it with ink. The measuring performance was under regular supervision by the trainer and conversations concerning the histological interpretation and measuring technique took place daily. Towards the end of the first training week (primary training phase) the observers worked together both measuring eight unselected samples twice. The results of the primary training phase were analyzed on the fifth day of the rehearsal period. At that time, means to reduce the influence of the sources of variation in the measurements were thoroughly discussed. In the next training phase of three days (final training phase) the observers worked independently. Applying the sampling rules earlier agreed on they carried out repeated measurements of additional 12 unselected samples. The results of the training phase were analysed at eight days of training.

Analysis of reproducibility of morphometry results
After the training period observers one and two measured nuclear profile areas of 192 samples to analyse the reproducibility associated with the morphometric measuring performance. The observers measured the nuclei from the same measuring areas selected and marked by the trainer. Measurement design of observers one and two is summarized in Table 1.

Analysis of the sources of variation
In a morphometric system the variation sources are intraobserver variation (V m ), interobserver variation (V o ) and inter-area variation (V a ) ( Table 2) [10]. When variation is expressed as variances the figures are additive and the total variation in a histopathology laboratory (V t ) can be calculated For estimation of V m , one pathologist (PK) measured 20 unselected samples three times. V o was estimated after measurements of the same 20 samples by three observers. For estimation of V a the three pathologists (YC, TK, PK) studied the same 20 cases and marked the measurement area of their choice with an ink mark encircling an area of approximately 3 mm in diameter on each slide according to the criteria presented earlier. The cases were assessed independently and the ink marks were always wiped off before the slides were passed on to the next pathologist. In each of the chosen areas one pathologist (PK) carried out the morphometric measurements.
The variances V m , V o and V a were determined by estimating the average coefficient of variation (CV = SD/mean) of the measurements in each situation and then calculating the corresponding SD for Table 2 Analysis of sources of variation in measuring nuclear profiles areas. A total of 20 samples were measured to estimate the size of interobserver (Vm), intraobserver (Vo), and inter-area (Va) variation

Variation source
Number of observers Number of measurement sessions Intraobserver variation (Vm) 1 3 Interobserver variation (Vo) 3 1 Inter-area variation (Va) * 1 3 * The measurement area was independently selected by three pathologists. a theoretical situation in which the mean nuclear area was determined 40 µm 2 , and the SD of nuclear area was 10 µm 2 . Variances were squares of these SDs. The true interobserver variation of the system is calculated by subtraction of the intraobserver variation from the total variation after measurements by three different observers from the same fields. The true inter-area variation of the system is determined by subtraction of intraobserver variation from the total variation after measurements from three measurement areas selected by different pathologists. As a consequence, the true total variance (V t ) is the sum of the intraobserver variation, the true interobserver variation and the true inter-area variation. This approach allows us to determine the influence of each source of variation as a fraction of the total variation.

Statistical analysis
Pearson's correlation coefficients and Spearman's rank-order correlation coefficients were used to compare the measurements of observers one and two. Grading efficiency (GE) [11,13,14,28] was calculated to estimate the fraction of samples which can be correctly graded by the method under study. The GE's were determined from 2 × 2, and 3 × 3 tables which estimate the results of two observers in classifying samples in two and three groups, respectively. In a similar approach, kappa coefficients were calculated to determine the relation of the observed agreement between the measurements to the expected at random agreement. Kappa values gave an estimate of the internal consistency of the method applied [26,31,38].

Results
The results of the measurements in the material of 192 breast cancer cases are outlined in Fig. 1 for mean nuclear profile area, and in Fig. 2 for SD of nuclear profile area. The averages of the results were identical in the measurements of both observers -mean nuclear profile area 41.9 µm 2 (median 36.9 µm 2 ) and standard deviation of nuclear profile area 12.9 µm 2 (median 11.5 µm 2 ). The correlation coefficients (Pearson's r and Spearman's rho) are presented in Table 3. The results show a clear improvement in intra-and interobserver reproducibility during the training phase. The results classified into two and three groups are presented in Tables 4 and 5 in terms of GE's and kappa coefficients and demonstrate a corresponding trend. Figure 3 shows the distribution of GE's for mean nuclear profile area and Fig. 4 the distribution of SD of nuclear profile area at different cutpoints. The minimum GE was 0.93 for mean nuclear profile area (at the cutoff point of 34 µm 2 ) and the minimum GE for the SD of nuclear profile area was 0.89 (at the cutoff of 10 µm 2 ).
In Table 6 the variations are expressed as coefficients of variation (CV). The CV values were the lowest for intraobserver variations and by far the highest for inter-area variations. The true interobserver variation as expressed in variances is:   Also the average SD was the same (12.9 µm 2 , median 11.5 µm 2 ) for both observers. Table 3 Correlation coefficients (Pearson's r) and rank-order correlation coefficients (Spearman's rho) of the morphometrically determined nuclear sizes (mean nuclear profile area; standard deviation of nuclear profile area in brackets). In the material of 192 breast cancer samples the two observers (Obs1 and 2) performed the measurements once. Before the measurement of the above cases, 20 breast cancer cases were used for training in two phases. The latter measurements were performed twice by each observer

Single measurements
Repeated measurements by two observers by one observer (interobserver reproducibility) ( Table 4 Grading efficiency of the morphometrically determined mean nuclear profile area (standard deviation of nuclear profile area in brackets). In the material of 192 breast cancer samples the two observers performed the measurements once. The thresholds for classifying the material into two groups were 35 µm 2 for mean nuclear profile area and 10 µm 2 for standard deviation of nuclear profile area. The corresponding thresholds for classifying the material into three groups were 30 µm 2 and 40 µm 2 , and 7 µm 2 and 13 µm 2 for mean nuclear area and standard deviation of nuclear profile area, respectively. Before the measurement of the above cases, 20 breast cancer samples were used for training in two phases. The latter measurements were performed twice by each observer

Single measurements
Repeated measurements by two observers by one observer (interobserver reproducibility) ( Table 5 Kappa coefficients of the morphometrically determined mean nuclear profile area (standard deviation of nuclear profile area in brackets). In the material of 192 breast cancer samples the two observers performed the measurements once. The thresholds for classifying the material into two groups were 35 µm 2 for mean nuclear profile area and 10 µm 2 for standard deviation of nuclear profile area. The corresponding thresholds for classifying the material into three groups were 30 µm 2 and 40 µm 2 , and 7 µm 2 and 13 µm 2 for mean nuclear area and standard deviation (SD) of nuclear profile area, respectively. Before the measurement of the above cases, 20 breast cancer samples were used for training in two phases. The latter measurements were performed twice by each observer   Table 6 Variation of nuclear morphometric measurements by one observer in one measurement area, by one observer in three different measurement areas (selected by three pathologists), and by three measurements from the same area. This experiment was done to estimate the influence of field selection on the measurement results. 40 µm 2 has been used to represent the theoretical average nuclear area, and 10 µm 2 to represent the SD of nuclear area of all

Conclusions
Barry and Sharkey [3] studied observer reproducibility of nuclear morphometry in 96 mammary ductal carcinoma samples embedded in plastic, sectioned at 1 µm and measured with 63× oil-immersion objective. The correlation coefficients between measurements of two observers ranged from 0.840 to 0.910, and between repeated measurements of one observer from 0.889 to 0.915 (Pearson's r). In a similar morphometric measurement setting, the results of nuclear morphometry in lymphocytic diffuse malignant lymphoma cells [8], however, showed a still better intraobserver consistency (Pearson's r) ranging from 0.977 to 0.995. Our results were well in line with the former study even though we used standard histological processing. This suggests that detailed measurements of individual nuclei do not markedly decrease variation in morphometry. On the other hand, the cell size in lymphocytic lymphoma is very uniform which explains the extremely high correlation observed. Concerning other methods of quantitative pathology, the reproducibility of the assessment of mitotic activity has been most thoroughly investigated [1,5]. Kujari with co-workers [28] studied the reproducibility of volume fraction-corrected mitotic index (M/V index, also called standardized mitotic index, SMI) [12] in 144 breast cancer specimens by four independent observers and two methods of analysing the epithelial fraction. The resulting correlation coefficients among methods ranged from 0.568 to 0.677 and between methods from 0.484 to 0.734. The corresponding mean interobserver GE varied between 0.90 and 0.93 (minimum 0.83). In another study on mitotic activity [19] the reproducibility of the assessment of MAI (mitotic activity index) in terms of Pearson's correlation coefficient ranged from 0.81 to 0.96. The latter results were based on mitotic counts in 13 pathology laboratories. Also, the inter-laboratory agreement of flow cytometric DNA-analysis in breast cancer is at the level of the present results on nuclear morphometry [2,4,20,21,29,30,32,33,37]. The conclusion is that morphometric measurements are accurate enough for creating a morphometric grading system, which -when biologically relevant -could compete with other well established methods of quantitative pathology.
In a study by Delides and co-workers [17] a perfect agreement among six pathologists of subjectively assessed final grade was observed in only 14.5% of the cases. In Scarff-Bloom-Richardson gradings of breast cancer an agreement among six observers was found for 74.3% of all specimens [15,16,[23][24][25][35][36][37]. Data on the reproducibility of the subjective assessment of nuclear pleomorphism in breast carcinoma have not been available. In our unpublished material Pearson's r of subjective nuclear gradings of invasive breast cancer ranged from 0.56 to 0.68 (mean 0.62) in repeated gradings by one observer. The corresponding mean GE and K were 0.86 and 0.58, respectively. Between observers, Pearson's r ranged from 0.44 to 0.59 (mean 0.53), and GE and kappa were 0.77 and 0.39, respectively. The results of the present paper favour the morphometric approach over subjective assessment to reach accuracy of grading.
In the early phases of starting morphometry, training clearly improved the reproducibility. The variation associated with the SD of nuclear area improved the most. By the end of the rehearsal, however, the assessment of the nuclear area and nuclear area variation were found almost equally reproducible. During the training period interobserver reproducibility was almost systematically higher than intraobserver reproducibility. In the primary training phase this can be explained by the fact that this part of rehearsal was teamwork with the observers discussing the possible applications of the measuring criteria in each sample. In part, the relatively low intraobserver reproducibilities were caused by the observers constantly correcting their measuring techniques to produce between themselves as uniform measurement results as possible. In the results, it may also seem confusing that the improvement of the reproducibilities of the results does not continue throughout the whole experiment but decreases while measuring the 192 cases. This finding may partly be explained by the biologic variation of the analysed feature and partly with increasing intraobserver variance in a larger material [3,8]. The fact that the reproducibility of results did not markedly improve during the latter part of the rehearsal also suggests that the training period of eight days can be considered sufficient.
Collan et al. [9] and Van Diest et al. [18] summarize the sources of variation in the nuclear morphometric method which involve factors related to the sample as well as tissue processing and instrumentation. According to our results, intraobserver and interobserver variation both concerning mean nuclear area and SD of nuclear area were low ranging from 7.3% to 28.9%. The inter-area variation, in turn, was responsible for the majority of the total variation associated with the measuring method (83.0% in mean nuclear profile area and 54.8% in SD of nuclear area). In light of these figures, choosing the measurement areas was the most important factor influencing the reproducibility of the measurements. The consistency of the selection of the measurement areas in this study can be considered good. In fact, in 16 out of the 20 samples the three pathologists actually chose and marked the very same areas of the specimen for measurement and also in the rest of the samples the selected areas were located along the same invasive border of the sample. Thus the main part of the measurement associated variation does not seem to be due to the selection of the measurement area but instead is associated with the inconsistency between the microscopic fields analysed in the marked measurement area. The main conclusion is that the biological variation present within the sample is responsible for the majority of the total variation. This suggests that the reproducibility of morphometrically determined nuclear size cannot markedly be improved with more efforts in standardisation of the measuring method. Rather, stress should be put on sampling methods associated with choosing the measurement area and consistent selection of nuclei from the chosen area. Sampling rules including systematic sampling from many fields over the whole section could result in more stable results and higher range of nuclear size variation associated features. For this, many alternatives are available [22] including the application of stereology [34], but the disadvantage of laborious implementations is obvious especially to diagnostic pathologists. The development of measurement support programs may soon make these approaches more attractive.