Tissue Counter Analysis of Histologic Sections of Melanoma: Influence of Mask Size and Shape, Feature Selection, Statistical Methods and Tissue Preparation

Background: Tissue counter analysis is an image analysis tool designed for the detection of structures in complex images at the macroscopic or microscopic scale. As a basic principle, small square or circular measuring masks are randomly placed across the image and image analysis parameters are obtained for each mask. Based on learning sets, statistical classification procedures are generated which facilitate an automated classification of new data sets. Objective: To evaluate the influence of the size and shape of the measuring masks as well as the importance of feature selection, statistical procedures and technical preparation of slides on the performance of tissue counter analysis in microscopic images. As main quality measure of the final classification procedure, the percentage of elements that were correctly classified was used. Study design: H&E‐stained slides of 25 primary cutaneous melanomas were evaluated by tissue counter analysis for the recognition of melanoma elements (section area occupied by tumour cells) in contrast to other tissue elements and background elements. Circular and square measuring masks, various subsets of image analysis features and classification and regression trees compared with linear discriminant analysis as statistical alternatives were used. The percentage of elements that were correctly classified by the various classification procedures was assessed. In order to evaluate the applicability to slides obtained from different laboratories, the best procedure was automatically applied in a test set of another 50 cases of primary melanoma derived from the same laboratory as the learning set and two test sets of 20 cases each derived from two different laboratories, and the measurements of melanoma area in these cases were compared with conventional assessment of vertical tumour thickness. Results: Square measuring masks were slightly superior to circular masks, and larger masks (64 or 128 pixels in diameter) were superior to smaller masks (8 to 32 pixels in diameter). As far as the subsets of image analysis features were concerned, colour features were superior to densitometric and Haralick texture features. Statistical moments of the grey level distribution were of least significance. CART (classification and regression tree) analysis turned out to be superior to linear discriminant analysis. In the best setting, 95% of melanoma tissue elements were correctly recognized. Automated measurement of melanoma area in the independent test sets yielded a correlation of r=0.846 with vertical tumour thickness (p < 0.001), similar to the relationship reported for manual measurements. The test sets obtained from different laboratories yielded comparable results. Conclusions: Large, square measuring masks, colour features and CART analysis provide a useful setting for the automated measurement of melanoma tissue in tissue counter analysis, which can also be used for slides derived from different laboratories.


Introduction
Though automated image segmentation procedures often work well in cytological preparations [6,8,10], there are often problems in histologic specimens requiring interactive measurements [14]. In order to avoid segmentation prior to obtaining measurements, tissue counter analysis has been designed for complex digital scenes, particularly at the histologic level [11,12]. Instead of attempting an a priori discrimination of certain structures and subsequent measurements of these structures, images are overlayed with regularly distributed measuring masks of equal size and shape, representing circular or square partial areas, each of them representing a tissue element. Image analysis parameters obtained for each element are recorded and stored in a database. In the learning step, an expert interactively classifies each element as belonging to a particular class of elements, as for example, "collagen of the reticular dermis" or "other tissue elements". Subsequent statistical analysis provides rules and thresholds which characterize each class. This information can finally be used for the user-independent classification of new elements and can be implemented into a fully automated image analysis system [11,12]. As a final result, the procedure "counts" the number of elements falling into each class. Therefore, the term "tissue counter analysis" was termed.
Previous studies have shown that tissue counter analysis can be applied to histologic specimens of normal and diseased skin [7] including diagnostic assessment of benign and malignant melanocytic skin lesions [12] and also to clinical and surface microscopic slides [9,11]. In the case of melanocytic skin tumours, tissue elements obtained from images of common nevi and of malignant melanoma were subjected to multivariate discriminant analysis in order to facilitate a distinction between "benign" and "malignant" elements [12]. In this study, 85.6% of all elements were correctly classified as either being derived from a benign or a malignant lesion, and based on the relative proportion of "benign" and "malignant" elements a correct diagnostic classification had been achieved in 40 cases each of common nevi and malignant melanoma. In the studies concerning tissue counter analysis, various shapes and sizes of measuring masks have been used, and linear discriminant as well as CART (classification and regression tree) analysis have been alternatively applied.
The present study examines the influence of various factors on the performance of tissue counter analysis in histologic images: shape of the measuring masks, size of the measuring masks, subsets of image analysis features, and statistical procedures are evaluated with histologic sections of cutaneous malignant melanoma obtained from three different laboratories serving as a target example.

Specimens
75 specimens of cutaneous melanomas were consecutively sampled from the Dermatopathology files of the Department of Dermatology, University of Graz, Austria. Inclusion criteria were that the lesion had been completely excised, and that vertical tumour thickness was at least 1 mm. Doubtful lesions and lesions associated with pre-existent nevi were excluded from the study. Mean age of the patients was 63 ± 16 years (range: 28 to 91 years), with 50.3% females and 49.7% males. Mean vertical tumour thickness measured from the granular layer down to the lowermost melanoma cell in the depth of the skin section (so-called Breslow index [4]) was 2.22 ± 1.57 mm (range: 1.0 to 8.0 mm). 4 µm sections were prepared and stained with hematoxylin and eosin with an automated staining device (DRF 701, Fakura, Japan). Furthermore, 20 cases each of malignant melanoma with the same inclusion criteria were obtained from the Department of Dermatology and Venereology, University of Luebeck, Germany, with a Breslow index of 2.78 ± 2.06 (1.0-9.0 mm) stained manually according to a standard protocol, and from the "Dermatohistopathologische Gemeinschaftspraxis", Friedrichshafen, Germany, with a Breslow index of 2.87 ± 1.48 (1.0-7.0 mm), stained automatically using a Medite Linearstainer (Medite, Burgdorf, Germany). From each specimen, the section showing the largest vertical diameter was selected as index slide and used for further evaluation.

Image analysis system
Slides were examined with an Axioskop 2 Imaging microscope (Zeiss, Oberkochen, Germany) mounted with a motor-driven scanning stage, an automated focusing device, and a three chip colour video camera (Sony, Tokyo, Japan). Images were fed into a KS 400 3.0 image analysis system (Zeiss Vision, Hallbergmoos, Germany) which also served as a control system for the motorized functions of the microscope. Examinations were performed with a 10× objective, yielding a final magnification of 1.3 µm per pixel. The size of individual images was 764 × 573 pixels.

Scanning procedure
For each slide, the corners of a meander were defined which included the whole melanoma area of the particular section. Subsequently, the meander was automatically scanned. In the learning procedure, 12 randomly selected fields were evaluated in each case. In the automated test procedure, the total meander was scanned. In very large specimens the number of fields evaluated was limited to 200 by restricting the analysis to every second or third field. Each field of vision was automatically focused. Illumination was kept constant to a grey level of 195 ± 10 in a white background field, and additive shading correction was performed for each image with a white, 10 × 10 meanfiltered background image. No further image enhancement steps were carried out.

Learning procedure
25 melanoma specimens were used to generate a learning data set. A learning data set comprises a table of elements along with the image analysis parameters of each element and a user-defined class label, which serves as the "gold standard" of classification. For this purpose, 12 fields were overlayed with a grid of 20 regularly distributed square measuring mask of 8 × 8 pixels in diameter in each case. Each element was interactively classified as belonging to one of three classes: (1) background of the slides outside the section; (2) melanoma tissue; (3) other tissue. An element was considered to represent melanoma tissue, when the measuring mask contained tumour cells. Technically, the user selects one class after the other and performs a single mouse click to each element belong-ing to the particular class. For convenience, the frame of each classified element is highlighted in a classspecific colour in the image overlay [1]. Besides the 8 × 8 measuring mask used for interactive classification, the image was also overlayed with square measuring masks with 16, 32, 64 and 128 pixels in diameter, as well as with circular masks of the same set of diameters. All these masks were centred around the originally classified 8 × 8 square mask, and the class label of this original mask was assigned to all other masks at the same location. That means that a mask was classified as belonging to a particular class, when the 8 × 8 pixed wide center of the mask is occupied by elements of that class, even when the larger mask contains also other structures. By this procedure, all masks including masks with more than one component were used for the learning set, because in automated procedures also all masks -being homogeneous or not -will have to be evaluated. For each measuring mask, a set of densitometric, colour, texture features and statistical moments -the latter describing statistical features of the grey level histogram [1] (Table 1) was assessed and stored along with the interactively defined class label. For each type of measuring masks, thus a learning data set comprising 6000 elements was created.

Statistics
The learning sets were submitted to the following statistical tools in order to provide algorithms for the recognition of the three classes of elements: On the one hand, multidimensional stepwise linear discriminant analysis [5] was performed using the SPSS software package (SPSS Inc., Sunnyvale, USA). This procedure leads to linear combinations of subsets of variables which finally yield canonical variables with discriminant values facilitating a discrimination of the classes of elements. On the other hand, CART (classification and regression tree) analysis using the CART 3.6 program (Salford Systems, San Diego, USA) was used [3,15]. In brief, classification and regression tree analysis tries to separate certain classes of elements by searching the data sets for features providing optimal binary splits in separating the database into groups with a predominance of one or the other class of elements. These subgroups are termed "nodes", and each node is tested for further split criteria in order to create two daughter nodes. When in a particular node no further split criterion is found, the node is called a "terminal node". To create a reliable tree model, the program randomly divides the data into a preliminary learning set and a preliminary test set, and repeats the whole procedure ten times. Only splits which are reproduced in all trees enter the final classification tree, thus providing a reliable classification [3,15]. In all classification procedures, the percentage of correctly labelled elements was assessed as a measure of the quality of the procedure.
Relationships of measuring conditions on the one hand and the percentage of correctly classified elements on the other were evaluated by Wilcoxon's matched pairs signed rank test and by Spearman's rank correlation test [5] where appropriate.

Test procedure
The measuring procedure which had turned out to yield the best classification was selected for further application. The split criteria defining the melanoma tissue elements were implemented into an automated measuring program. This program was designed to scan whole sections, thereby recognizing the elements (test areas defined by measuring masks) classified as melanoma elements based on the criteria of the learning set. The elements recognized as melanoma elements are shown in an overlay image, and finally a zoomed image of the whole measuring area, again with the melanoma elements in the overlay, is displayed. The amount of tissue classified as melanoma elements is given in mm 2 section area.
The automated measuring procedure was applied to 50 cases of melanoma from the same laboratory as the learning set, and to 20 cases each obtained from two different laboratories, which served as a test sets. The procedure was used to measure the section area occupied by melanoma tissue without user interaction. Finally, the estimates of melanoma area (given in mm 2 ) were tested for correlation with Breslow index (given in mm) of the same lesions calculating the correlation coefficient between both values using linear regression analysis [2].

General observations
The interactive classification producing the learning sets takes about 5 min per case. Since the 8 × 8 square masks are considerably small, they can be unambiguously labelled as belonging to a particular class of elements. In the rare instances where the mask was ly- Table 2 Percentage of correctly classified elements of melanoma tissue, other tissue components and background in haematoxylin-eosin-stained slides. Influence of image analysis features, size and shape of elements, and statistical procedures (n = 6000 elements per test; LDA: linear discriminant analysis; CART: classification and regression tree). The best result was obtained with a 64 × 64 pixel square mask, using all features evaluated by CART analysis ing on the border of, e.g., melanoma and other tissue, it was labelled according to the structure which comprised the majority of the contents of the mask. Automated measurement took between 2 and 15 min, depending on the size of the section evaluated, with user interaction limited to about 1 min for defining the corners of the meander and the white background image.

Effects of the measurement and classification settings
The percentage of correctly classified elements ranged from 28%, when only the statistical moments were considered in circular masks with a diameter of 16 pixels and linear regression analysis, to 95% when all parameters were taken into account and a square mask of 64 × 64 pixels was used with CART analysis. As far as the size of the measuring mask is considered, the percentage of correctly classified cases increased with the size of the measuring mask ( Table 2). The com-parison of square and circular masks of equal diameter showed usually a slight advantage of the square masks (Table 2). This fact, however, could also be due to the larger area of square masks of the same diameter. In multivariate analysis taking into account simultaneously mask area and mask shape, the latter did not significantly correlate with the classification results. CART analysis turned out to be slightly superior to multidimensional linear discriminant analysis ( Table 2). When the various subsets of measuring features were concerned, the colour features were almost as good as the whole data set, while densitometric features and Haralick texture features produced less reliable results. The least significant contribution came from the statistical moments of the grey value distribution.
When for each element the data derived from two measuring masks of different size were combined (Table 3), no marked advantage to the use of a single type of measuring mask was found. Table 3 Percentage of correctly classified elements of melanoma tissue, other tissue components and background in haematoxylineosin-stained slides. Influence of the combination of measuring elements of different size (n = 3000 elements per test; LDA: linear discriminant analysis; CART: classification and regression tree). There is no advantage compared to the use of a measuring elements of a single size

Automated melanoma area measurements
For further evaluation, the setting with a square measuring mask of 64 × 64 pixels in size, with all measuring features, and CART analysis, was used. The binary criteria defining the melanoma elements were (MEANDG > 84.205 AND SUMQR > 8.19055E + 007 AND SUMQR ≤ 9.44269E + 007) OR (SUMQR ≤ 8.19055E + 007), with MEANDG denoting the mean grey value in the green image and SUMQR denoting the sum of the squared grey values in the red image (Fig. 2). In the test set of 50 cases of melanoma, the area measurements based on these CART criteria revealed a median of 11.6 mm 2 (range: 0.5 to 118.6 mm 2 , interquartil range 4.8 to 23.6 mm 2 ). Spearman's rank correlation analysis with Breslow index yielded a highly significant relationship between the automatically obtained area measurements on one hand and vertical tumour thickness on the other (r = 0.834; p < 0.001). For the two test sets from different laboratories melanoma area was 30.6 ± 50.9 mm 2 (range: 2.3 to 225.3 mm 2 ) and 20.6±12.2 mm 2 (range: 3.3 to 41.7 mm 2 ), with a correlation with Breslow index of r = 0.910 (p ≤ 0.001) and r = 0.913 (p ≤ 0.001), respectively. In a subset of 10 cases the area measurements were carried out twice at different occasions and turned out to be highly reproducible (r = 0.999; p ≤ 0.001).

Discussion
Our study shows that the process of tissue counter analysis can be successfully applied to histologic sections of malignant melanoma in order to detect the melanoma component. The settings of the procedure, however, have significant influence on the reliability of the classification. At first, the parameters must carry sufficient information to achieve a useful classification. In the H&E sections used in this study, colour features turned out to be the most useful subset of measuring features. When all measuring features were combined, the results did only improve marginally. The crucial importance of colour features in the present example rises the question of how the results may depend on the staining procedure. With the two other laboratories tested, however, satisfactory results were obtained. The fact that texture criteria were of minor importance may be due to the magnification used in this study. Particularly Haralick parameters focus on the relationship of neighbouring pixels and are likely to miss texture changes at a larger scale.
The second important point is the size of the measuring mask. There seems to be an advantage of large masks over small masks, probably due to the larger information content in the larger elements. Whether square or circular masks are used seemed to be of minor importance, but at equal diameters square masks were usually slightly superior to circular masks.
Among the two statistical classification processes used, CART analysis performed somewhat better than linear discriminant analysis. Besides this slight advantage as to the percentage of correctly classified elements, CART yields simple binary criteria defining certain subsets, while linear discriminant analysis provides large formulae with linear combinations of terms which are more difficult to implement in the KS 400 3.0 image analysis system.
It is remarkable that the simultaneous combination of the information obtained with a small and a large mask did not improve the results significantly. Obviously the main information is found in the large mask, and the particular features of the central area of the mask do not add to the classification process.
Once a reliable classification process is defined, it can be used for automated classification of new tissue elements. The application of the detection of melanoma elements on a test set of melanoma slides yielded highly reproducible results, with a correlation coefficient between measurements taken twice at different occasions was close to 1. When the relationship between the area measurements and vertical tumour thickness was tested, a correlation coefficient around 0.9 was found. This is slightly better than previous studies which yielded correlation coefficients between tumour thickness and area assessed by conventional methods of r = 0.770 [13] or r = 0.760 [16].
There are several limitations to the present study: All material had been prepared by standard protocols and automated staining facilities, thus variability in staining intensities and hue have been limited to a minimum within each laboratory, but the slides were obtained from different institutions. The study did only include melanoma lesions of at least 1 mm tumour thickness, and doubtful cases were excluded. Furthermore, the study was more or less focused on the detection of melanoma tissue on haematoxylin-eosin stained sections, and other target structures or other staining procedures would probably favour other feature subsets or differently sized measuring masks.
In conclusion, our results show that the exact setting of the measuring procedure, including size and shape of the test area, image analysis parameter set, and statistical classification tool, have a marked influence on the performance of tissue counter analysis. Furthermore, the usefulness of tissue counter analysis for the detection of tissue components, is demonstrated, even when slides derived from different laboratories with some variability of the staining procedure are used.