DNA ploidy analysis of cells is an automation technique applied in pathological diagnosis. It is important for this technique to classify various nuclei images accurately. However, the lack of overlapping nuclei images in training data (imbalanced training data) results in low recognition rates of overlapping nuclei images. To solve this problem, a new method which synthesizes overlapping nuclei images with single-nuclei images is proposed. Firstly, sample selection is employed to make the synthesized samples representative. Secondly, random functions are used to control the rotation angles of the nucleus and the distance between the centroids of the nucleus, increasing the sample diversity. Then, the Lambert-Beer law is applied to reassign the pixels of overlapping parts, thus making the synthesized samples quite close to the real ones. Finally, all synthesized samples are added to the training sets for classifier training. The experimental results show that images synthesized by this method can solve the data set imbalance problem and improve the recognition rate of DNA ploidy analysis systems.
In recent years, cervical cancer with its incidence rate rising year by year has become a social problem which threatens women’s lives. According to a survey report released by the World Health Organization in 2012, cervical cancer is the second largest killer of women in less developed areas [
As an automatic screening technique, DNA ploidy analysis developed rapidly in recent years [
Most classifiers learned with imbalance data show poor performance when classifying samples from the classes with few training data. The samples from the minority classes are overwhelmed by those from the majority classes. Many methods have been proposed to solve this problem, and these methods can be mainly divided into two categories. The first category works at the data level, including resampling [
However, these methods can only learn from the existing samples, but cannot obtain class information beyond what is contained in the existing samples. For the imbalance data in cell classification, on the one hand, overlapping cells are formed by single cells; on the other hand, we can collect a large number of single-cell images easily. If we can simulate the process of generating overlapping images in the image data domain, we can generate enough overlapping cell images close to real images for feature extraction and model training. Therefore, we present a new method to synthesize overlapping nucleus images by using single-cell images.
In this paper, we present a new method to synthesize overlapping nucleus images by making use of the prior knowledge of forming overlapping nucleus images. This method first selects two-cell images and then synthesizes new overlapping nucleus images after rotation and segmentation. In order to make the synthesized cells as close as possible to the real ones, we consider three aspects in the proposed method. To ensure that synthesized cells are representative, we select typical single-cell images as source images. In order to avoid the excessive accumulation of the synthesized data, we introduce randomness for the rotation angle and the overlapping length for cells. To make the overlapping parts close to real ones, we reconstruct the pixels of the overlapping parts according to the Beer-Lambert Law [
As large amounts of single-cell images are available, we can synthesize two-cell images with two single-cell images; namely, three-cell images can be synthesized with a two-cell image and a single-cell image. Similarly, we can always synthesize a (
The procedure of image synthesis is shown in Figure
Synthesis scheme.
The synthesizing procedure is shown in Figure
The synthesizing procedure
Randomness is employed to ensure the diversity of the generated overlapping cells. Firstly, rotation angles are randomly generated. Then, the overlapping length is random produced in an expected range. A uniform random number is generated by a linear random congruence method [
Image selection [
1: 2: 3: 4: 5: 6: 7: 8: 9: 10:
Image rotation refers to rotating an image with the centroid as center point. Given two-cell images, different overlapping cell images are generated when different rotation angles are used. The synthesized overlapping cell images can cover more conditions to ensure the diversity.
The original images for synthesis contain a background, which should be removed before synthesizing. In this paper, the threshold segmentation method is used to locate the cell area. In this method, pixels whose gray value is less than a threshold belong to the nucleus region; otherwise, the pixels belong to the background region. The segmentation formula is presented as (
After image segmentation, nucleus contours are obtained. The cell region is extracted by removing the background of the cell image. This process is shown in Figure
The process of removing background.
The overlapping cells have a common area. We use overlapping length to describe the degree of overlapping. When one nuclear region is tangent to the second nuclear region (as shown in Figure
The random overlapping results.
The nonoverlapping regions of cells remain unchanged after the overlapping operation. However, the overlapping region is too dark, which is not in accordance with the real cell images. Therefore, it is necessary to reconstruct the gray value of the overlapping region. First of all, we need to locate overlapping areas. The specific steps are as follows:
Finding the smallest cross The length of Every pixel in the searching area of the nonwhite part is traversed. If this point is within the first contour and at the same within second one, this point is determined as the one that need be reconstructed. All pixels that need to be reconstructed are searched, and all the points form a reconstruction pixel set. The pixel in the reconstruction pixel set is given a new value via (
Nuclei overlapping region.
Since the two pictures are operated in one background image, the positions in the background and in the source image need a coordinate transformation. As is shown in Figure
Coordinate conversion.
With (
According to the Beer-Lambert Law [
When the two points in the two cells are overlapping, the optical density satisfies the following additive relation:
As shown in Figure
Comparison of constructing overlapping parts.
A DNA ploidy analysis system is mainly used for the identification and analysis of diseased cells and cancer cells. In order to obtain real data, the samples are collected by the staffs of the Heilongjiang Maria Maternity Hospital. The cell samples were collected from 300 patients. The cells of each patient were smeared on a slide and then Feulgen stained. After that, the slide was placed under a microscope and the microscope automatically took cell images. Then, the DNA ploidy analysis system segmented cell images into single-cell images or overlapping cell images. Finally, cell pathology doctors classified each cell image manually into one of 8 categories, namely, single typical epithelial cell, single atypical epithelial cell, two epithelial cells, three epithelial cells, four or more epithelial cells, single lymphocyte, single centriole, and two or more centrioles. These cell images of each class are examples of a typical imbalance. The amount of single-cell images in classes 1, 2, 3, 4, and 6 are very large, while those of the other classes are very small. Our task is to synthesize overlapping samples for classes 4, 5, 7, and 8 with single-cell images. First of all, we need to select representative samples from classes 1, 2, 3, and 6. The cells in classes 1, 2, 3, 4, and 6 are used to synthesize new overlapping images, and the images of these classes need sample selection. The original data in the experiment are extremely unbalanced. In order to show the influence of the imbalance data on the accuracy rate, the number of testing samples is 2000 in each class. There are 8 types of cell images in total, and classes 3, 4, 5, and 8 have small number of training samples. Experiments are performed by adding synthesized cell data to these classes (i.e., classes 3, 4, 5, and 8) to make the data more balanced. In the experiments, the synthesized data are added into the training set gradually to make it more and more balanced.
Three popular classifiers, that is, the multilayer perceptron (MLP), support vector machine (SVM) and mixed Gaussian model (GMM), are chosen to evaluate the proposed method. The classifiers are trained with the new train sets of different amounts and their performance is compared. In the neural network training, the hidden node is 100, and the number of iterations is 200. The minimum error in training is set as 0.1. The number of transformation characteristics is 5. The random seed value is 20 to initialize the multilayer perceptron. In the SVM classifier, the number of transformation parameters is 80. Kernel type is rbf, and the mode of the classifier is one-versus-one. In the Gaussian model classifier, the pretreatment type is a normalization and pretreatment parameter (the number of transformation characteristics) with a value of 100, which is used for its transforming characteristics. The seed value generated by the randomizer is 42.
Based on the features of cell images, 45 dimensional features are first extracted and then 28 dimensional features are selected for classification. The selected features include 20 morphologic features [
For each cell image, the 28 features extracted for classification are shown in Table
Features of each class of cell images.
Class | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
Cell images | ||||||||
IOD | 89.610977 | 71.606436 | 180.15888 | 272.85886 | 470.77552 | 76.777616 | 86.473958 | 940.99801 |
Area | 837 | 447 | 1341 | 1705 | 1805 | 289 | 299 | 2910 |
Circularity | 0.7921 | 0.480011 | 0.461493 | 0.382338 | 0.479377 | 0.920065 | 0.654806 | 0.510143 |
Roundness | 0.903653 | 0.727129 | 0.68578 | 0.640621 | 0.732710 | 0.944292 | 0.812886 | 0.722639 |
Radius | 18.718845 | 17.566048 | 30.722118 | 37.70215 | 33.625638 | 10.377715 | 12.087826 | 41.920397 |
Deviation | 0.056678 | 0.096081 | 0.080454 | 0.090358 | 0.173582 | 0.159820 | 0.180099 | 0.18441 |
Mean | 0.642317 | 0.57354 | 0.6061 | 0.573193 | 0.482879 | 0.470737 | 0.457682 | 0.430258 |
Sigma | 1.475920 | 3.135578 | 6.373434 | 8.491296 | 6.312004 | 0.474985 | 1.634735 | 8.020041 |
Contrast | 1.593787 | 6.369128 | 2.674124 | 2.777126 | 9.373961 | 16.356402 | 15.341137 | 12.716838 |
Convexity | 0.974389 | 0.959227 | 0.90303 | 0.830088 | 0.827602 | 0.969799 | 0.934375 | 0.812395 |
Bulkiness | 1.002132 | 1.001983 | 1.051246 | 1.223468 | 1.161857 | 1.000329 | 1.045995 | 1.132826 |
StructureFactor | 0.283446 | 1.125497 | 1.343892 | 1.689330 | 0.973999 | 0.049896 | 0.463667 | 0.815918 |
71551.46 | 33796.029 | 335417.10 | 622132.67 | 511789.63 | 6978.0203 | 10412.977 | 1223692.3 | |
43622.82 | 7510.4405 | 67471.275 | 128759.63 | 177297.55 | 6334.6855 | 5318.0129 | 476218.75 | |
−0.018070 | −0.054505 | −0.073719 | 0.084856 | 0.049427 | 0.001029 | 0.007442 | −0.039142 | |
0.073786 | 0.140186 | 0.122783 | 0.129873 | 0.119612 | 0.083408 | 0.060474 | 0.120763 | |
0.073786 | 0.066543 | 0.101258 | 0.128430 | 0.091892 | 0.075985 | 0.115486 | 0.079980 | |
Energy | 0.023546 | 0.008648 | 0.011273 | 0.011569 | 0.003225 | 0.004424 | 0.005000 | 0.002741 |
Correlation | 0.941807 | 0.920241 | 0.950388 | 0.959477 | 0.962070 | 0.925055 | 0.943140 | 0.954548 |
Homogeneity | 0.601926 | 0.313502 | 0.511144 | 0.515849 | 0.315213 | 0.239775 | 0.322485 | 0.306934 |
Entropy | 5.706993 | 6.248901 | 6.224919 | 6.415742 | 7.111077 | 6.711657 | 6.795570 | 7.259227 |
Anisotropy | −0.525847 | −0.510334 | −0.537583 | −0.514221 | −0.514942 | −0.490626 | −0.499718 | −0.499788 |
Compactness | 1.108596 | 1.311802 | 1.633726 | 1.912168 | 1.913056 | 1.028384 | 1.198749 | 2.272140 |
ContLength | 107.9827 | 85.840620 | 165.92388 | 202.40916 | 208.30865 | 61.112698 | 67.112698 | 288.24978 |
Diameter | 36.359318 | 34.132096 | 60.440053 | 74.404301 | 64.412732 | 19.646883 | 23.021729 | 82.800966 |
Rectangularity | 0.801250 | 0.800937 | 0.793462 | 0.646409 | 0.707792 | 0.804348 | 0.771812 | 0.678742 |
Distance | 15.318871 | 11.491077 | 20.283375 | 23.627692 | 23.614855 | 8.526281 | 8.736569 | 28.915494 |
Sides | 4.261805 | 2.606224 | 2.438174 | 2.288309 | 2.631792 | 5.520584 | 3.114689 | 2.586205 |
NumRuns | 32.000000 | 32.000000 | 52.000000 | 62.000000 | 67.000000 | 20.000000 | 18.000000 | 91.000000 |
MeanLength | 26.156250 | 13.968750 | 25.788462 | 27.500000 | 26.940299 | 14.450000 | 16.611111 | 31.978022 |
For multiple class problems, we suppose that the classes have been labelled
Confusion matrix for multiple class classification problems.
Predicted classes | |||||
---|---|---|---|---|---|
… | |||||
Actual classes | … | ||||
… | |||||
… | … | … | … | … | |
… |
The image number for training in each class is shown in Table
The number of cell images with different conditions.
Classes | Imbalance ratio | Conditions | |||||||
---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ||
20,000 | 20,000 | 200 | 200 | 200 | 20,000 | 20,000 | 200 | 100.0 | 1 |
20,000 | 20,000 | 500 | 500 | 500 | 20,000 | 20,000 | 500 | 40.0 | 2 |
20,000 | 20,000 | 1000 | 1000 | 1000 | 20,000 | 20,000 | 1000 | 20.0 | 3 |
20,000 | 20,000 | 2000 | 2000 | 2000 | 20,000 | 20,000 | 2000 | 10.0 | 4 |
20,000 | 20,000 | 4000 | 4000 | 4000 | 20,000 | 20,000 | 4000 | 5.0 | 5 |
20,000 | 20,000 | 6000 | 6000 | 6000 | 20,000 | 20,000 | 6000 | 3.3 | 6 |
20,000 | 20,000 | 8000 | 8000 | 8000 | 20,000 | 20,000 | 8000 | 2.5 | 7 |
20,000 | 20,000 | 10,000 | 10,000 | 10,000 | 20,000 | 20,000 | 10,000 | 2.0 | 8 |
20,000 | 20,000 | 12,000 | 12,000 | 12,000 | 20,000 | 20,000 | 12,000 | 1.7 | 9 |
20,000 | 20,000 | 14,000 | 14,000 | 14,000 | 20,000 | 20,000 | 14,000 | 1.4 | 10 |
20,000 | 20,000 | 16,000 | 16,000 | 16,000 | 20,000 | 20,000 | 16,000 | 1.3 | 11 |
20,000 | 20,000 | 18,000 | 18,000 | 18,000 | 20,000 | 20,000 | 18,000 | 1.1 | 12 |
20,000 | 20,000 | 20,000 | 20,000 | 20,000 | 20,000 | 20,000 | 20,000 | 1.0 | 13 |
The results with the range of conditions.
Conditions | Imbalance ratio | Accuracy (%) | G-mean | ||||
---|---|---|---|---|---|---|---|
MLP | SVM | GMM | MLP | SVM | GMM | ||
1 | 100.0 | 75.58 | 71.68 | 62.05 | 0.7280 | 0.6932 | 0.5486 |
2 | 40.0 | 77.24 | 74.02 | 64.29 | 0.7496 | 0.7021 | 0.5776 |
3 | 20.0 | 79.73 | 75.18 | 65.01 | 0.7799 | 0.7318 | 0.5841 |
4 | 10.0 | 80.77 | 77.61 | 69.73 | 0.7912 | 0.7598 | 0.6570 |
5 | 5.0 | 81.49 | 78.57 | 74.15 | 0.7994 | 0.7696 | 0.7229 |
6 | 3.3 | 82.33 | 79.64 | 74.58 | 0.8102 | 0.7820 | 0.7272 |
7 | 2.5 | 82.47 | 79.92 | 75.28 | 0.8106 | 0.7845 | 0.7363 |
8 | 2.0 | 82.30 | 79.93 | 74.79 | 0.8086 | 0.7846 | 0.7314 |
9 | 1.7 | 82.43 | 80.70 | 75.69 | 0.8097 | 0.7816 | 0.7404 |
10 | 1.4 | 82.88 | 80.31 | 75.49 | 0.8167 | 0.7903 | 0.7379 |
11 | 1.3 | 83.38 | 81.03 | 75.98 | 0.8225 | 0.7979 | 0.7427 |
12 | 1.1 | 83.93 | 80.47 | 76.33 | 0.8290 | 0.7913 | 0.7470 |
13 | 1.0 | 83.87 | 80.65 | 76.39 | 0.8292 | 0.7931 | 0.7484 |
As shown in Table
The results in the range of distribution ratio of a small class.
Four methods, namely, the proposed method, upsampling [
The proposed method, the undersampling method, and the upsampling method use the MLP classifier, while the adaboost method uses the adaboost algorithm. In the adaboost classifier, the number of iterations is 50 and the learning rate is 1.0. The confusion matrix can show the relationship between the predicted results and the original cell classes. The assessment results on the precision of the classification using the confusion matrix comparing with 4 methods are shown in Figure
The comparison of the confusion matrix by 4 single methods.
As can be seen from Figure
According to literature, when the resampling method combines with the learning algorithm, good performance can be obtained. Therefore, we choose the randomForest algorithm [
As can be seen from Figure
The comparison of the confusion matrix by 4 hybrid methods.
Even though the confusion matrix can indicate the accuracy rate of each type of cells in detail, it cannot directly show the overall correctness, G-mean, and so on. Figure
The comparison of 8 methods (ada means the adaboost method, rf refers to the randomForest method).
As can be seen from Figure
In fact, the classifier can perform better by adjusting the parameters of the learning algorithm. Table
The results with the range of parameters in a classifier.
Adaboost | Accuracy | G-mean | RandomForest | Accuracy | G-mean | |
---|---|---|---|---|---|---|
Iteration number | Learning rate | Iteration number | ||||
50 | 1 | 0.6327 | 0.4080 | 10 | 0.8016 | 0.7885 |
80 | 1 | 0.6485 | 0.5595 | 30 | 0.8142 | 0.8022 |
100 | 1 | 0.6354 | 0.5609 | 40 | 0.8977 | 0.8951 0.8935 |
120 | 1 | 0.5902 | 0.5255 | 50 | 0.8935 | 0.8914 |
150 | 1 | 0.5849 | 0.4602 | 60 | 0.8973 | 0.8948 |
80 | 0.8 | 0.7045 | 0.6241 | 80 | 0.8928 | 0.8894 |
80 | 0.7 | 0.7406 | 0.6214 | 100 | 0.8940 | 0.8916 |
80 | 0.6 | 0.7232 | 0.5777 | 120 | 0.8249 | 0.8157 |
80 | 0.5 | 0.7682 | 0.6260 | 150 | 0.8949 | 0.8923 |
80 | 0.4 | 0.7767 | 0.6601 | 200 | 0.8937 | 0.8910 |
As can be seen from Table
In conclusion, we proposed a new method to synthesize overlapping cell images to solve the imbalance data problem. This method simulates the generation of overlapping cells by making use of prior knowledge. In this method, representative images are first chosen, and then the images are rotated randomly. After that, two segmented cell parts are overlapped, and finally the overlapping parts are reconstructed. Sample selection and randomness are introduced to make the synthesized images more representative. The new images are added to the training samples for model training. Experiments show that the proposed method greatly improves the accuracy of cell classification. The accuracy is improved from 75.58% to 83.93% and G-mean is improved from 0.7280 to 0.8292. When we combine the synthesized method with the randomForest algorithm, the accuracy reaches around 89.7% and the G-mean gets about 0.895. With the proposed method, a large amount of images can be generated. It is an interesting topic to select synthesized samples according to the performance of classification. In the future, we will focus on the task to select representative synthesized samples with the active learning method.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.