Shape and Boundary Similarity Features for Accurate HCC Image Recognition

Nucleus morphology is of great importance in conventional cancer pathological diagnosis, which could provide information difference between normal and abnormal nuclei visually. Therefore, this paper proposes two novel kinds of features for normal and hepatocellular carcinoma (HCC) nucleus recognition, including shape and boundary similarity. First, each individual nucleus patch with the fixed size is obtained using center-proliferation segmentation (CPS) method. Then, nucleus shape library is constructed based on manual selection by pathologists, which is utilized to measure nucleus shape similarity via Dice, Jaccard, precision, and recall coefficients. Meanwhile, boundary similarity is evaluated through triangles composed of some boundary feature points for each nucleus. Finally, the conventional random forest (RF) is used to train and test the classification model for HCC nucleus recognition. Extensive cross-validation tests could facilitate the selection of the optimal feature set and the experiment comparison results demonstrate that our proposed morphological features are more beneficial for classification compared with other traditional characteristics.


Introduction
Cancer is a leading cause of death in the world. In particular, in less developed countries, liver cancer is the second most common cancer compared to other cancers, in which the majority of primary liver cancer arises from liver cells and is called hepatocellular carcinoma (HCC). Throughout the treatment of liver cancer, probability of success cure will be hugely increased in the early stages. However, symptoms of early liver cancer are not obvious for patients and doctors to discover. Thus, early detection and diagnosis are of great significance for decreasing the mortality of HCC effectively.
Generally, a common method to confirm the diagnosis of HCC is through needle biopsy, which extracts some cells or a small piece of tissue from the affected area of the liver for analysis under a microscope. However, this diagnosis process is subjective, laborious, and time-consuming for operators. As is well-known, diagnosis from pathology images remains the "gold standard" for most cancers [1]. Therefore, the computer-aided diagnosis (CAD) for pathology image analysis has become a research hotspot in which the recognition of nucleus is regarded as a prerequisite. The accurate classification results could provide objective quantitative evaluations and facilitate the final diagnosis.
With the development of machine learning, several CAD models have been developed for pathology image process, which mainly include three parts, nucleus segmentation, feature extraction, and cell classification. For nucleus segmentation, Jung et al. [2] addressed the overlapped nuclei with an unsupervised Bayesian classification scheme. Distance transform, topographic surface, and the expectation-maximization (EM) algorithm were employed and the regular shape of clumped nuclei was viewed as a priori knowledge. Vink et al. [3] proposed an efficient nucleus detector which was merged by a large feature set and modified AdaBoost using a globally optimal active contour algorithm. The method improved the computational efficiency and also refined the border of the detected nuclei. In feature extraction stage, Huang and Lai [4] used a dial morphological grayscale reconstruction to achieve the accuracy of nuclear shapes. Fourteen features were extracted and a SVM-based decisiongraph classifier was proposed for HCC classification. Liu et al. [5] regarded moment, Daubechies wavelets, and Gabor wavelets as three features of vital importance for the classification of cells. As for cell classification, Lorenzo-Ginori et al. [6] proved that cell classification just in the characteristics of nucleus could come into effect as well. A combination between morphological characteristics and Haralick texture features was obtained from the nucleus' gray-level cooccurrence matrix. A new heuristic search algorithm, Maximum Minimum Backward Selection (MMBS) was proposed in [7]. The Weighted Discernibility of Feature Subsets (WDFS) evaluation criteria were defined as the evaluation strategy of MMBS to solve the unbalanced samples, which contributed to a better feature subset. The experiment results showed a good classification performance for liver pathological image.
Recently, Gautam and Bhadauria [8] used four features of white blood nuclei and then some values of each feature, which were maximum and minimum, extracted for every class of white blood class. If the value of features for particular nucleus lays between the maximum and the minimum value of features values stored for particular class, then the segmented nucleus belonged to that class. Qi et al. [9] extracted 128-dimensional SIFT features from thousands of large patches which were densely sampled in multiple scales and were called RootSIFT. PCA was applied to the RootSIFT and IFV encoding was applied to the PCA-after features with prelearned GMM parameters for a better classification result. Xia et al. [10] defined three atypia features and provided some shape features, fractal dimension features, several gray features, and Tamura features. By using a HCC image classification model based on random forests and combined with VRRF, the method showed a good performance. Gallegos et al. [11] proposed an alternative method called feature subset selection (FSS). Feature subset selection (FSS) helped to decrease the cost of acquiring data and also made the classification model easier to understand by using the set of typical testors, taking out irrelevant or redundant features, reducing the number of features.
However, accurate recognition for normal and HCC nuclei still remains a significant challenge because of two main reasons. One is that since hematoxylin and eosin (H&E) pathology images vary distinctly in color, this may reduce the effectiveness of most texture features. The other one is that precise morphological measurements of each nucleus require accurate nucleus segmentation as an important prerequisite. To address these issues, this paper proposes two novel kinds of features for the recognition between normal and hepatocellular carcinoma (HCC) nuclei. In contrast to other methods that used features of different image channels, our goal is to classify normal and HCC nuclei based only on binary results of nuclei. Besides, for nucleus segmentation accurately, center-proliferation segmentation (CPS) method [10] is utilized to segment each individual nucleus. The main contributions of this paper are as follows. First, nucleus shape library is constructed based on manual selection by pathologists, which contains normal and HCC nuclei. Then, all nuclei are adjusted into a uniform standard space with same center, area, and orientation. Dice, Jaccard, precision, and recall coefficients are calculated for measuring the shape similarity among nuclei. Second, 12 boundary feature points are determined according to the same interval angle, 220

Features.
Cell image features are one of the most obvious attributes for classification. Good features can not only influence the performance of classification but also improve the accuracy. Three kinds of characteristics are presented in Table 1. Intensity features are mainly obtained by computing the pixel value of the whole image [14]. Morphology features express the spatial relative position of each pixel [15]. Texture is the most important group of features for classification [16][17][18].

Random Forest Classifier.
Random forest (RF) [19] is a joint prediction model composed of multiple decision tree, which can be used as an efficient and effective classification model. The principle of the classifier is to build a forest consisting of multiple decision tree with no association randomly. When a new sample comes, data utilize bootstrap method to extract in row and column and judge by every decision tree in the forest to see whether it belongs to this class or not. Final predict result generates by voting. As shown in Figure 1, the flowchart of RF is described. RF increases the diversity between two classification models by constructing different training sets and holds the advantages of handling overfitting and resisting noise. After training, every decision tree model has its voting rights to choose the classification result as shown in the following formula: where ( ) indicates the final classification result, ℎ ( ) is the classification results of each decision tree, is the output target, and (⋅) is indicative function.

Dataset Description and Nucleus Segmentation.
Experimental data coming from a renowned hospital in Shenyang, China. A hundred and twenty-seven (127) liver pathology cases and labels consisting of the images are given under the supervision of the professional pathologists. Throughout the discussions with the pathologists, the concrete processes of obtaining the experimental data are as follows. Firstly, tissue slices are acquired through paraffin-embedding. Then, the H&E slides are cut at 4 m thickness by a microtome and stained using hematoxylin and eosin for 7.5 min. Finally, fast slide scanners are used to generate digital images for image capture at 20x magnification. In addition, each image's resolution is 0.35 mm/pixel. Figure 2 shows the normal and HCC pathology images. Figure 2(a) is a normal image and Figure 2(b) is a HCC image. Our intention of this paper is to extract the morphological features which include shape and boundary similarity features; we thus employ center-proliferation segmentation (CPS) to obtain the segmentation result of each nucleus. The specific steps are introduced as follows.
Step 1. Choose a suitable threshold to get the binary nucleus coarse segmentation results and receive the connected region according to coarse segmentation results.
Step 2. Select the connected region by computing the circularity and set the threshold to a value larger than 0.85.
Step 3. Locate and acquire the center of each nucleus through the selected connected region and map the centers' coordinates into the corresponding H&E pathology images.
Step 4. Acquire × pixel nucleus patches from each center to the four directions.
Following this method, the segmentation result of each nucleus is acquired. For the reason of guaranteeing every nucleus in the well-distributed scale position of the images, we resize each image to 100 × 100 pixels in the experiments. Figure 3 shows the examples of nucleus image patches. The first row is the segmented patches of H&E image and the second row represents the corresponding binary segmentation results. To alleviate the influence of color difference, our experiments only utilize the binary patches to extract the morphological features.

3.2.
Methods. The diagram of our proposed nucleus recognition framework is shown in Figure 4. It first decomposes an H&E pathology image into some patches with the fixed size using the CPS method, which satisfy that one patch contains one nucleus. Then, the corresponding binary masks are accordingly obtained via morphological operations. Next, two novel kinds of features, shape similarity feature (Section 3.2.2) and boundary similarity feature (Section 3.2.3), are extracted to construct the feature vectors. Finally, conventional random forest (RF) is used to recognize normal nucleus or HCC nucleus.

Classification
Steps. As shown in Figure 5, the proposed classification model is described, which composed of the following steps. Note that the red rectangular frames represent the innovation work of this paper.
Step 1. For each liver pathology image, the center-proliferation segmentation (CPS) method [10] is utilized to obtain  the segmentation result (binary result) of each individual nucleus. For the sake of effective image processing, each segmented nucleus is located at the same size of patch with 100 × 100 pixels.
Step 2. 160 segmented nuclei are manually selected to construct the nucleus shape library under the guidance of the pathologists, which contain 80 normal nuclei and 80 HCC nuclei, respectively. Then, a shape alignment method is  employed to adjust these nuclei into a uniform standard space.
Step 3. In order to calculate shape similarity accurately, the remaining nuclei are also adjusted into the uniform standard space with the same center, area, and orientation. Dice, Jaccard, precision, and recall indexes are calculated between each nucleus and all nuclei of shape library. Based on these indexes, shape similarity features for each nucleus are formed with 640 dimensions in total.
Step 4. According to the major axis and minor axis' lengths of each original segmented nucleus, the corresponding ellipse is obtained and we select 12 boundary points with the same interval ( /6). The initial point is defined at the positive direction of the horizontal axis and the rotational direction is anticlockwise. Following these 12 points of each ellipse, 12 boundary feature points of each corresponding nucleus are determined using the minimum value of Euclidean distance between each boundary point and 12 ellipse points. Every 12 boundary points could construct 220 different triangles and we use the similarity between triangles to measure boundary similarity features. Notice that boundary similarity features for each nucleus are 220 dimensions in total.
Step 5. Finally, the conventional RF classifier is used to train and test all nucleus images. In order to remove the redundancy of the features, 10-fold cross-validation experiments show that combining the selected boundary similarity features (56-dimension) with Jaccard shape similarity features (80-dimension) into the classifier could achieve the best results in terms of ACC, SEN, and SPE.

Shape Similarity Feature.
Generally, nuclei with the same label (normal and HCC) are similar in shape, boundary, and appearance. A basic approach is to use normal and HCC nuclei to establish statistical shape models and then measure the difference to classify a nucleus as normal or HCC. However, the statistical shape model for each class is unreliable. To improve this issue, we select 160 nuclei to construct a nucleus shape library under the guidance of the pathologists, which include 80 normal nuclei and 80 HCC nuclei, respectively. Our intention in this section is to extract the shape similarity features using the similarity measurement between each nucleus and all nuclei of the shape library. Further, in order to reduce the influence of shape difference caused by translation, rotation, and scale, a simple registration method is performed on each segmented individual nucleus into a uniform standard space. A new point is located using the transformation: where and are an original point and the corresponding transformation point, respectively. is the scaling factor, is the rotation matrix, and denotes the translation vector.
In our experiments, each nucleus needs to be adjusted to the uniform standard space defined as follows.
Translation . Align each nucleus centroid to patch center and calculate coordinate translation vector: where (Δ , Δ ) represents the corresponding coordinate translation vector between patch center ( , ) and each nucleus centroid ( , ).
Scaling Factor . Zoom or shrink all nucleus areas to 1000 pixels approximately: where is the scaling factor and area is each nucleus area represented by the number of pixels.
Rotation Matrix . Adjust the principal axes of all nuclei into the horizontal axis: where denotes an angle between the principal axe of each nucleus and the positive direction of horizontal axis. Following this method, all segmented individual nuclei are adjusted into the uniform standard space of the same center, area, and direction. Some examples illustrating this alignment method are shown in Figure 6. Figures 6(a), 6(b), and 6(c) are the segmented nuclei. Figures 6(d), 6(e), and 6(f) present the corresponding registration results.
Next, four commonly used similarity metrics are considered: where Num(⋅) represents the number of all pixels in the nucleus regions. pixel SR and pixel TR are the pixels in the segmentation regions (SR) and ground truth regions (TR), respectively. DI(SR, TR) is Dice index which describes the similarity degree directly. JI(SR, TR) denotes Jaccard index that measures the difference between the segmentation results and ground truth. P(SR, TR) and R(SR, TR) represent precision and recall, respectively. Finally, the shape similarity features for each nucleus are formed by calculating (6), (7), (8), and (9) with all nuclei of shape library. Obviously, the number of features is the same as the number of nuclei of the shape library. In this paper, we denote DI, JI, P, and R shape similarity features as DI feature, JI feature, P feature, and R feature, respectively. It can be seen that these four kinds of shape similarity features are 160 dimensions, respectively.

Boundary Similarity Feature.
It is our understanding that there is a wealth of geometric information with regard to the boundary of an object. Many classification tasks using boundary information could achieve remarkable achievements, such as the average value or standard deviation between the distances of all boundary points and the center point, and this can be regarded as a kind of statistical characteristics. The other distinguished boundary feature is concave-convex, which has demonstrated significant superiority via measuring monotonicity variation of the boundary curve. However, it is cumbersome to express the curve analytically due to its irregularity. To address this issue, we propose a novel boundary similarity feature using the triangles formed by the boundary points. The concrete steps are as follows.
Step 1. The Canny operator is utilized to extract the boundary for each nucleus. Note that, for the sake of accurate calculation, all nuclei used for boundary similarity feature need to be adjusted into the same center and direction using (3) and (5).
Step 2. For each nucleus, the corresponding ellipse is delineated via the nucleus' major axis and minor axis. 12 points of ellipse boundary are determined through different polar angles with the same interval. Hence, the points' coordinates could be calculated as follows: where and represent the major axis and minor axis' lengths of each nucleus, respectively. ( , ) is the Cartesian coordinates. denotes the polar angles and we define that the initial angle's value is from 0 to /6. The angle interval is fixed as /6. Step 3. Then, we calculate the Euclidean distance between each nucleus boundary point and the corresponding 12 points of ellipse boundary. 12 boundary feature points of each nucleus are determined according to the minimum values of Euclidean distance.
Step 4. Based on these 12 boundary feature points for each nucleus, we could construct 220 different triangles. This section proposes to represent the boundary similarity features of a nucleus by measuring the angles of the 220 triangles. Specifically, given a triangle containing three control points , , and , the shape of the triangle could be represented by storing just two of its angles (e.g., ∠ and ∠ ) since the sum of three sides of a triangle is equal to . Finally, the boundary similarity feature could be calculated in the following: where BF represent the boundary similarity feature and we can see that the number of this feature is 220 dimensions.
Following this approach, the boundary similarity features for each nucleus are obtained and two examples are shown in    Figure 7(f) presents a triangle formed by three corresponding different control points named , , and respectively.
So far we have considered the complete feature set for each nucleus as only combining shape and boundary similarity features. In our case, the conventional support vector machine (SVM), extreme learning machine (ELM), and RF are utilized to train and test all feature data.

Results and Comparisons
where TP and FN are the number of HCC image patches which are correctly classified and incorrectly classified, respectively. TN and FP are the number of normal image patches which are correctly classified and incorrectly classified, respectively. ACC is the overall classification accuracy. Sensitivity (SEN) indicates the proportion of HCC image patches that are correctly classified and specificity (SPE) indicates the proportion of normal image patches that are correctly classified.

4.3.
Results. Performance results of shape and boundary similarity features are presented in this section. Note that we test our method 10 times and randomly select the training and testing data according to Table 2 every time. Figure 8 shows average HCC or normal classification ACC, SEN, and SPE using all shape and boundary similarity features (860 dimensions for each nucleus) when three types of classifier are used, including RF, SVM, and ELM. It is seen that these features with RF classifier perform best. Further, 5 features (DI, JI, P, R, and BF features) are utilized to train and test   RF classifier. Figure 9 presents the corresponding average classification ACC, SEN, and SPE. We can see that the BF feature could achieve the best effect. In addition, to find the best feature combination for accurate classification, we also combine different shape and boundary similarity features to train and test RF classification. Figures 10, 11, and 12 show the corresponding ACC, SEN, and SPE results. In conclusion, Figures 8, 9, 10, 11, and 12 demonstrate that JI + BF with RF classifier is much better than other features or feature combinations.

Feature Selection
Results. Generally, the number of nuclei in shape library and the number of boundary feature points are determined by experience. However, this may generate some redundancy and reduce the classification accuracy. To address this issue, 10-fold cross-validation and grid-search technology are adopted in our experiments to select the number of shape library's nuclei and boundary feature points. We first test the performance of each nucleus of shape library separately. The ACC results of each nucleus ranking from high to low are shown in Figure 13. Next, we integrate the first 30 nuclei as the new shape library and add 10 nuclei into the shape library for training and testing successively. Finally, the corresponding average ACC results for different number of shape library's nuclei are presented in Figure 14. Obviously, the shape library composed of the first 80 nuclei is the most beneficial for HCC nucleus recognition. For the boundary similarity feature, 12 boundary feature points constitute 220 different triangles and we also test the performance of each triangle separately. According to the individual test results, the importance of 12 boundary feature points ranking from high to low is determined through the occurrence number of each feature point in the triangles. Figure 15 shows average ACC results, which includes the different boundary similarity  [13] [12] [10] Our method are performed on our data for fair comparison. The intention of [10] is to propose some atypia features (auxiliary circularity, amendment circularity, and cell symmetry) and a voting ranking random forest to establish the classification model. The core of [12] is to utilize the concave-convex variations of nucleus boundaries to improve the performance of each classifier. Finally, the intention of [13] is to propose a simple but robust local descriptor without any quantization for local patch representation. Figure 16 presents the averaging ACC, SEN, and SPE results using 10-fold cross-validation. Note that comparison experiments utilize the optimal 136-dimension feature set and we can see that our method is slightly superior to [10,12,13].
4.6. Shape Library Selection. As previously mentioned, the nucleus shape library is composed of 160 nuclei including 80 normal nuclei and 80 HCC nuclei, which are selected by the pathologists. To further reduce manual process, we randomly select 10 groups of nuclei as the shape library and the remaining nucleus patches are used to examine the effect of different nucleus shape library. Similarly, each group of nuclei contains 80 normal nuclei and 80 HCC nuclei. The results for different shape libraries are presented in Table 3 and an interesting finding is that random selection of nucleus shape library influences little final classification results.

Discussions
For our shape similarity feature, different similarity measures are utilized to establish the feature set. According to Figure 9, Jaccard index is regarded as the JI features, which could achieve more effective classification results than other similarity measurements. This shape similarity feature measures the similarity and difference between each nucleus and all nuclei of shape library. The shape registration method is utilized to adjust all nuclei into the same center, scale, and direction and then we can obtain more accurate shape similarity features. Besides, Table 3 presents the effects of different shape libraries. It is our understanding that random selection for different shape libraries influence little final classification results and this feature could thus be considered as a robust feature. In regard to our boundary similarity feature, Figure 9 shows that BF feature could achieve relatively ideal results. BF feature is calculated via the similarity of triangles constructed by boundary feature points and the boundary feature points are determined by corresponding ellipse template. Different from other similarity measurement of curves, this calculation method is simple and effective. In addition, Figure 10 demonstrates that JI + BF feature combination with RF classifier perform best and this paper treats JI + BF as the optimal feature combination. Further, we determine the optimal number of JI + BF features using 10-fold cross-validation (see Figures 13,14,and 15). Finally, Figure 16 shows that our proposed method overcomes other related methods in terms of ACC, SEN, and SPE, demonstrably the performance superiority of our proposed two morphological features.

Conclusions
In this paper, we propose two novel kinds of features for normal and hepatocellular carcinoma (HCC) nucleus recognition, including shape and boundary similarity. First, the shape similarity feature is extracted via the Jaccard index's calculation between each nucleus and all nuclei of the shape library. Then, the boundary similarity feature is computed through the similarity of triangles constructed by boundary feature points. Next, combining JI and BF features (136dimension) is regarded as the feature set for image patches. Finally, the conventional RF classifier is used for obtaining the best classification results. Experiments with 9720 patches demonstrate that our proposed morphological features (JI + BF) with RF classifier are beneficial and robust to achieve the satisfactory results in terms of ACC, SPE, and SEN.

Conflicts of Interest
The authors declare that they have no conflicts of interest.