An Overlapping Cell Image Synthesis Method for Imbalance Data

DNA ploidy analysis of cells is an automation technique applied in pathological diagnosis. It is important for this technique to classify various nuclei images accurately. However, the lack of overlapping nuclei images in training data (imbalanced training data) results in low recognition rates of overlapping nuclei images. To solve this problem, a new method which synthesizes overlapping nuclei images with single-nuclei images is proposed. Firstly, sample selection is employed to make the synthesized samples representative. Secondly, random functions are used to control the rotation angles of the nucleus and the distance between the centroids of the nucleus, increasing the sample diversity. Then, the Lambert-Beer law is applied to reassign the pixels of overlapping parts, thus making the synthesized samples quite close to the real ones. Finally, all synthesized samples are added to the training sets for classifier training. The experimental results show that images synthesized by this method can solve the data set imbalance problem and improve the recognition rate of DNA ploidy analysis systems.


Introduction
In recent years, cervical cancer with its incidence rate rising year by year has become a social problem which threatens women's lives. According to a survey report released by the World Health Organization in 2012, cervical cancer is the second largest killer of women in less developed areas [1,2]. Cervical cancers can be detected at an early stage, and early diagnosis and early treatment are effective ways to deal with this problem. Currently, cervical smear is the most popular method for the screening of cervical cancer. In this method, human cervical exfoliated cells were first collected from patients and DNA contained in cells was stained. Then the stained specimen was placed under a microscope and observed by experienced pathologists to make a diagnosis. However, with the outbreak of cancers, this technique cannot meet the demand of practical applications. On the one hand, it requires great amounts of manpower and material resources; on the other hand, it often causes errors because of subjectivity and visual fatigue of pathologists. Therefore, automatic screening techniques become more and more important.
As an automatic screening technique, DNA ploidy analysis developed rapidly in recent years [3]. In this technique, cell specimens were first collected from patients and the DNA contained in cells was stained. Next, the specimens were placed under a microscope and images of the nucleus were taken using a high-resolution camera. Then, the nuclear images were classified and recognized by machine learning methods. Finally, the relative content of DNA in cells was measured and the abnormal cells were found to provide information for diagnosis. It is important for DNA ploidy analysis to analyze overlapping nuclei, where cancer cells are often found. However, it is difficult to collect enough overlapping nuclei images to learn a good classifier, because the samples of overlapping nuclei are few. Therefore, the number of overlapping nuclei images are far less than those of other categories, resulting in imbalance training data set problems [4,5].
Most classifiers learned with imbalance data show poor performance when classifying samples from the classes with few training data. The samples from the minority classes are overwhelmed by those from the majority classes. Many methods have been proposed to solve this problem, and these methods can be mainly divided into two categories. The first category works at the data level, including resampling [6,7] and feature selection approach [8]; the second one works at the algorithm level including the cost-sensitive [9] and single-class learning [10]. Resampling includes undersampling (removing samples from the majority class) and upsampling (creating new samples for the minority class). The most well-known method is the synthetic minority oversampling technique (SMOTE) [11] which interpolates among existing minority class examples and generates new minority class samples. But traditional SMOTE, which involves blindness, cannot solve the imbalance problems. Many improvements have been made by follow-up researchers, and some examples are SDSMOTE [12], GASMOTE [13], ECO-Ensemble [14], and WK-SMOTE [15]. At the algorithm level, the costsensitive learning-based method considers the costs associated with misclassifying samples, such as the cost-sensitive adaboost algorithm [16] and AdaCost [17]. Ensemble learning-based methods combine strength from individual learners and handle the class imbalance problem at the individual and ensemble levels, and some examples are the boosting algorithm [18,19] and bagging algorithm [20,21]. In addition, researchers also combine the resampling method with the learning algorithm method to deal with the class imbalance data sets, such as the PcBoost [22], CSFSG algorithm [23], HSDD method [24], and GADBSM method [25]. There are methods which use active learning to solve the class imbalance problem, such as the Bayesian active learning method [26] and the KA-SVM [27] method.
However, these methods can only learn from the existing samples, but cannot obtain class information beyond what is contained in the existing samples. For the imbalance data in cell classification, on the one hand, overlapping cells are formed by single cells; on the other hand, we can collect a large number of single-cell images easily. If we can simulate the process of generating overlapping images in the image data domain, we can generate enough overlapping cell images close to real images for feature extraction and model training. Therefore, we present a new method to synthesize overlapping nucleus images by using single-cell images.
In this paper, we present a new method to synthesize overlapping nucleus images by making use of the prior knowledge of forming overlapping nucleus images. This method first selects two-cell images and then synthesizes new overlapping nucleus images after rotation and segmentation. In order to make the synthesized cells as close as possible to the real ones, we consider three aspects in the proposed method. To ensure that synthesized cells are representative, we select typical single-cell images as source images. In order to avoid the excessive accumulation of the synthesized data, we introduce randomness for the rotation angle and the overlapping length for cells. To make the overlapping parts close to real ones, we reconstruct the pixels of the overlapping parts according to the Beer-Lambert Law [28,29]. Experimental results show that after adding synthesized overlapping cell images to minority categories, the accuracy is improved on the three classifiers, including the multilayer perceptron (MLP, also called artificial neural network) [30], support vector machine (SVM) [31], and Gaussian mixture model (GMM) [32]. The proposed method also outperforms four typical methods (undersampling [33], upsampling [11], adaboost [34], and randomForest [35]) which are popular in solving the imbalance problem.

The Methods
As large amounts of single-cell images are available, we can synthesize two-cell images with two single-cell images; namely, three-cell images can be synthesized with a two-cell image and a single-cell image. Similarly, we can always synthesize a (i + j)-cell image with an i-cell image and a j-cell image.
The procedure of image synthesis is shown in Figure 1. In the selection module, representative samples are chosen to avoid redundancy. Then the two selected images are rotated in a random angle, respectively. Next, two-cell images are segmented and the cell background is removed. Finally, the two segmented parts are overlapped to form an overlapping image, with the pixels of the overlapped part reconstructed according to the Beer-Lambert Law.
The synthesizing procedure is shown in Figure 2. In order to obtain a 4-cell image, a single-cell image and a 3-cell image are chosen. After rotation, segmentation, and contour extraction, two-cell parts are overlapped to yield a new overlapping cell image.

Randomness Introduction.
Randomness is employed to ensure the diversity of the generated overlapping cells. Firstly, rotation angles are randomly generated. Then, the overlapping length is random produced in an expected range. A uniform random number is generated by a linear random congruence method [36]. The basic recursive formula is presented as (1): x n = αx n−1 + c mod M , 1 λ n = x n /M, n = 1, 2, … , where x 0 is the initial value, α is the multiplier, c is the increment, and M is the modulus. They are all nonnegative integers.

Image Selection.
Image selection [37] is aimed at ensuring that the selected images are representative. One feasible method is to prevent similar images from being used more than once. When selecting cell images to generate new overlapping images, representative samples which accurately reflect the larger entity should be chosen. In order to make the synthesized samples more representative, sample selection of cell images is necessary. Algorithm 1 is used for image selection.where n is the feature dimension of the cell image. P is the initial sample set, while Q is the sample set after selection. T is the threshold value which is the mean distance of two samples in P, and d i is the Euclidean distance of all the samples. f eature α means the feature vector of the sample α.
2.3. Image Rotation and Segmentation. Image rotation refers to rotating an image with the centroid as center point. Given two-cell images, different overlapping cell images are generated when different rotation angles are used. The synthesized overlapping cell images can cover more conditions to ensure the diversity. The original images for synthesis contain a background, which should be removed before synthesizing. In this paper, the threshold segmentation method is used to locate the cell area. In this method, pixels whose gray value is less than a threshold belong to the nucleus region; otherwise, the pixels belong to the background region. The segmentation formula is presented as (2): where T is the segmentation threshold, f x, y is a gray value in an image, and F x, y is the corresponding gray value after segmentation. The valley point of the histogram is set as the initial threshold. After image segmentation, nucleus contours are obtained. The cell region is extracted by removing the background of the cell image. This process is shown in Figure 3.

The Random
Overlapping Length. The overlapping cells have a common area. We use overlapping length to describe the degree of overlapping. When one nuclear region is tangent to the second nuclear region (as shown in Figure 4(a)), the value of d is zero. Here, the overlapping length is subject to 0 ≤ d ≤ 1/2R min (R min refers to the minimum width value of two nuclear regions). The overlapping length of two black rectangles (as shown in Figure 4(b)) is a random value generated by (1).

Pixel
Reconstruction of Overlapping Regions. The nonoverlapping regions of cells remain unchanged after the overlapping operation. However, the overlapping region is too dark, which is not in accordance with the real cell images. Therefore, it is necessary to reconstruct the gray value of the overlapping region. First of all, we need to locate overlapping areas. The specific steps are as follows: (1) Finding the smallest cross x and vertical coordinates y, the maximum horizontal X and vertical coordinates Y, according to the coordinates of all points in the cell areas, and point x, y and X, Y respectively, are coordinates of the upper left and lower right corner of the minimum bounding rectangle in the nuclear region. In the same way, the minimum bounding rectangle is obtained (such as the two black rectangles in Figure 5), and two rectangles are intersecting at the point of a and b (as points a and b in Figure 5).
(2) The length of ab with the added 2 points is the width, and the rectangle's height is the new height; with these, a new searching area is constructed (such as the red rectangle in Figure 5).
(3) Every pixel in the searching area of the nonwhite part is traversed. If this point is within the first contour and at the same within second one, this point is determined as the one that need be reconstructed. (4) All pixels that need to be reconstructed are searched, and all the points form a reconstruction pixel set.
(5) The pixel in the reconstruction pixel set is given a new value via (7). Since the two pictures are operated in one background image, the positions in the background and in the source image need a coordinate transformation. As is shown in Figure 6, assuming that the background is rectangle B, the source image is rectangle A. X, Y is the position of point P in B, a, b is the position of point P in B, and x, y is the position of point P in A. The formula used for position transformation is presented as (3): With (3), the point coordinates are obtained in the overlapping region of the source image, then the corresponding pixel values can be obtained.
According to the Beer-Lambert Law [24,25], we can infer the pixel gray value in the overlapping areas. Firstly, a gray value of the point is converted to the value of optical density, and then optical density is accumulated. Finally, the value of optical density is converted to grayscale values. The gray value cannot be added directly in overlapping cell images. Since the absorbance represents the amount of cellular materials, the absorbance of the overlapping part can be superimposed. Therefore, the gray value of the overlapping region needs the conversion process. For the two overlapping cells, the relationship between gray values and the optical density can be modeled as follows: where I 0 is the average gray value of the background (I 0 is the threshold), I 1 denotes a gray value in the first cell, and I 2 are the gray values in the second cell. A 1 and A 2 are the corresponding optical density values. When the two points in the two cells are overlapping, the optical density satisfies the following additive relation: where A is the new optical density of the corresponding position at an overlapping point, and I s is the new gray value at the overlapping point. According to (6), the new gray value can be computed with As shown in Figure 7, it can be seen that the synthesized overlapping region is darker than the real overlapping region. After reconstruction, the overlapping region looks more natural. 1:

Experiments.
A DNA ploidy analysis system is mainly used for the identification and analysis of diseased cells and cancer cells. In order to obtain real data, the samples are collected by the staffs of the Heilongjiang Maria Maternity Hospital. The cell samples were collected from 300 patients. The cells of each patient were smeared on a slide and then Feulgen stained. After that, the slide was placed under a microscope and the microscope automatically took cell images. Then, the DNA ploidy analysis system segmented cell images into single-cell images or overlapping cell images. Experiments are performed by adding synthesized cell data to these classes (i.e., classes 3, 4, 5, and 8) to make the data more balanced. In the experiments, the synthesized data are added into the training set gradually to make it more and more balanced. Three popular classifiers, that is, the multilayer perceptron (MLP), support vector machine (SVM) and mixed Gaussian model (GMM), are chosen to evaluate the proposed method. The classifiers are trained with the new train sets of different amounts and their performance is compared. In the neural network training, the hidden node is 100, and the number of iterations is 200. The minimum error in training is set as 0.1. The number of transformation characteristics is 5. The random seed value is 20 to initialize the multilayer perceptron. In the SVM classifier, the number of transformation parameters is 80. Kernel type is rbf, and the mode of the classifier is one-versus-one. In the Gaussian model classifier, the pretreatment type is a normalization and pretreatment parameter (the number of transformation characteristics) with a value of 100, which is used for its transforming characteristics. The seed value generated by the randomizer is 42.  selected features include 20 morphologic features [38] and 8 texture features [39]. They are essential for distinguishing 8 types of cell images in classification. The 20 morphologic features are used to describe the shape and size of cells, including area, circularity, distance, sigma, sides, roundness, convexity, I a (centroid coordinates of x axis), I b (centroid coordinates of y axis), M 11 , M 02 , M 20 , compactness, ContLength, diameter, radius, rectangularity, anisotropy, bulkiness, and StructureFactor [38]. The 8 text features consist of contrast, energy, homogeneity, correlation, entropy, anisotropy, mean, and deviation [39]. Some typical morphologic features can be defined by (8), (9), (10), (11), (12), (13), (14), and (15), and two typical texture features, that is, the mean and deviation can be expressed by (14) and (15).
where g 0 represents the mean values of pixels of the cell area, and g x,y is the pixel value of dot x, y in the area of the cell. where g x, y is the gray values of the pixel(x,y) and Num is the number of pixels of images. For each cell image, the 28 features extracted for classification are shown in Table 1.

Evaluation Criteria.
For multiple class problems, we suppose that the classes have been labelled C 0 , C 1 , C 2 , … , C k k > 2 with the order of the labels which do not reflect any intrinsic order to the classes. The results of classifications are accessed according to the confusion matrix shown in Table 2. Their total accuracy is computed via (16). The recall rate of each class is computed in (17), and the G-mean can be computed via (18).
where C i represents the label of the class i, and n ii means that the sample number from class i is predicted to be class i in (18). k is more than 2 in these equations.

3.4.
Results. The image number for training in each class is shown in Table 3. The synthesized cell images are added to the training data to make them more balance. The accuracy of the three classifiers is compared in Table 4, where the conditions, imbalance ratio (the ratio of the number of samples of the largest class to the smallest class), the accuracy rate, and G-mean are shown and the experimental results of the inadequate training and full training with synthesized cells are compared. The entries in the table are sorted by descending order on an unbalanced ratio, namely the training data become balanced gradually. As shown in Table 4, when the imbalance ratio is 100, the original data without adding synthesized samples are used for training. The accuracy rates obtained by these three classifiers are the lowest. With the imbalance ratio decreasing, the accuracy increases. When the unbalance ratio is 1, that is, the sample numbers of all classes become the same, the three classifiers achieve their best performance compared with 1, and the accuracy rates are increased by 8.29%, 8.97%, and 14.34%, respectively; at the same time the G-mean reaches 0.8292, 0.7931, and 0.7484, respectively. The accuracy rate and G-mean changes, respectively, in the range of the distribution ratio of a small class in Figure 8.

Comparison with Other
Methods. Four methods, namely, the proposed method, upsampling [11], undersampling [33], and the adaboost method [34] are compared. The proposed method can be treated as an upsampling method which simulates the process of generating overlapping images in the image data domain. In the upsampling method, new features in feature space-based SMOTE [11] are generated. In the undersampling method, the training data are divided into clusters. Then, in view of the ratio of majority class samples to minority class samples, the representative data for majority class samples from each cluster are selected. Adaboost is an iterative algorithm, which places different weights on the training distributions in each iteration. After each iteration, the classifier increases the weights associated with the incorrectly classified examples and decreases the weights associated with the correctly classified examples separately. This forces the learner to focus more on the incorrectly classified examples in the next iteration.
The proposed method, the undersampling method, and the upsampling method use the MLP classifier, while the adaboost method uses the adaboost algorithm. In the adaboost classifier, the number of iterations is 50 and the learning rate is 1.0. The confusion matrix can show the relationship between the predicted results and the original cell classes. The assessment results on the precision of the classification using the confusion matrix comparing with 4 methods are shown in Figure 9.
As can be seen from Figure 9, in the proposed method, three epithelial cells (class 4) and four or more epithelial cells (class 5) have a lower accuracy rate of 62.2% and 66.3%, respectively. In comparison, the accuracy rate of class 4 and class 5 is only 40.3% and 52.1% in the undersampling method, 43.9% and 55.1% in the upsampling method, and 53.3% and 76.4% in the adaboost method. The images of class 4 and class 5 are difficult to classify because of diverse overlapping situations and overlapping cell numbers. In conclusion, the proposed method achieves the best performance, while the adaboost method gets the worst performance.
According to literature, when the resampling method combines with the learning algorithm, good performance can be obtained. Therefore, we choose the randomForest  Table 2: Confusion matrix for multiple class classification problems.

Predicted classes
C k n k1 n k2 … n kk algorithm [35] to train models. The randomForest belongs to an ensemble learning method, which fits a number of decision tree classifiers on various subsamples of the data sets. It also uses an averaging value to improve the predictive accuracy and control overfitting. We combine the upsampling method with the randomForest method, the proposed method with the adaboost method, and the proposed method with the randomForest method. In the randomForest classifier, the number of iterations is 60, the maximum depth of each tree is 3, the minimum number of sample leaves is 20, and the maximum features is "sqrt." As can be seen from Figure 10, the combinations of the two methods have a higher accuracy than that of a single method. The accuracy rate of class 3 is only 10.5%, which is extremely abnormal in the randomForest method, and the accuracy rate of class 8 just reaches 50%, relatively low compared to the other 6 classes except class 3. However, in the upsampling + randomForest method, the accuracy rate of class 3 obtains an accuracy of 95.8% and the accuracy rate of class 8 is 78.8%. According to the confusion matrix of the proposed + adaboost method, this method is not suited to deal with the balanced data generated by the proposed method. Finally, in the proposed + randomForest method, the accuracy rate of each class is good, and the lowest accuracy among 8 classes is 80.3%. Therefore, the proposed + randomForest method achieves the best performance among the 4 hybrid methods.
Even though the confusion matrix can indicate the accuracy rate of each type of cells in detail, it cannot directly show the overall correctness, G-mean, and so on. Figure 11 shows the results of all 8 methods.
As can be seen from Figure 11, the accuracy of the ran-domForest method is the highest, but the G-mean is far from the value of accuracy. It is obvious that the randomForest Table 3: The number of cell images with different conditions.   Classes  Imbalance ratio  Conditions  1  2  3  4  5  6  7  8   20,000  20,000  200  200  200  20,000  20,000  200  100.0  1  20,000  20,000  500  500  500  20,000  20,000  500  40.0  2   method is not suitable for dealing with imbalance data and it pays more attention to the samples of the majority class and ignores the samples of the minority class. Therefore, the method proposed effectively solves the imbalance problem. As for the proposed + randomForest method, the accuracy is close to that of G-mean, while they are higher than those of the other methods except for the randomForest method. The accuracy and G-mean of the proposed method are less than that of the proposed + randomForest method. The accuracy is high but the G-mean is relatively low in the   proposed + adaboost method, so it also performs worse on the imbalance data. In summary, judging by all the evaluation criteria, the proposed + randomForest method has achieved the best performance. In fact, the classifier can perform better by adjusting the parameters of the learning algorithm. Table 5 shows the range of results when parameters are varied. The data used in the adaboost and randomForest are synthesized by the proposed method.
As can be seen from Table 5, in the adaboost algorithm, the accuracy and G-mean decrease in a trend when the iteration number increases. When the iteration number is 0.4, the accuracy is the highest. However, the algorithm can cause more errors when the learning rate is low. When the iteration number is 80 and the learning rate is 0.8, the classifier performs its best. The accuracy and G-mean of the randomForest method show an upward tendency with the iteration number increasing from 10 to 60. The accuracy decreases when the iteration number increases from 60 to 80. When the iteration number is 100, the randomForest algorithm performs its best, because the values of the accuracy and G-mean are both relatively high.

Conclusion
In conclusion, we proposed a new method to synthesize overlapping cell images to solve the imbalance data problem. This method simulates the generation of overlapping cells by making use of prior knowledge. In this method, representative images are first chosen, and then the images are rotated randomly. After that, two segmented cell parts are overlapped, and finally the overlapping parts are reconstructed. Sample selection and randomness are introduced to make the synthesized images more representative. The new images are added to the training samples for model training. Experiments show that the proposed method greatly improves the accuracy of cell classification. The accuracy is improved from 75.58% to 83.93% and G-mean is improved from 0.7280 to 0.8292. When we combine the synthesized method with the   randomForest algorithm, the accuracy reaches around 89.7% and the G-mean gets about 0.895. With the proposed method, a large amount of images can be generated. It is an interesting topic to select synthesized samples according to the performance of classification. In the future, we will focus on the task to select representative synthesized samples with the active learning method.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.