Sparse Contribution Feature Selection and Classifiers Optimized by Concave-Convex Variation for HCC Image Recognition

Accurate classification of hepatocellular carcinoma (HCC) image is of great importance in pathology diagnosis and treatment. This paper proposes a concave-convex variation (CCV) method to optimize three classifiers (random forest, support vector machine, and extreme learning machine) for the more accurate HCC image classification results. First, in preprocessing stage, hematoxylin-eosin (H&E) pathological images are enhanced using bilateral filter and each HCC image patch is obtained under the guidance of pathologists. Then, after extracting the complete features of each patch, a new sparse contribution (SC) feature selection model is established to select the beneficial features for each classifier. Finally, a concave-convex variation method is developed to improve the performance of classifiers. Experiments using 1260 HCC image patches demonstrate that our proposed CCV classifiers have improved greatly compared to each original classifier and CCV-random forest (CCV-RF) performs the best for HCC image recognition.


Introduction
Liver cancer is a malignant tumor with high morbidity, which is a huge challenge that humans must face for a long time. Researches demonstrate that early diagnosis can greatly reduce the incidence of cancer. Computer aided diagnosis is important for early diagnosis, which can largely improve the accuracy of cancer diagnosis. With the rapid development of machine learning, there are many algorithms for image classification, such as -nearest neighbor ( NN), support vector machine (SVM), extreme learning machine (ELM), random forests (RF), and artificial neural network (ANN). These methods achieve the considerable results for pathological image classification.
Recently, there are different kinds of classification models for pathological image classification. The paper [1] proposed a novel voting ranking random forests method for HCC image classification. A -nearest neighbor classifier was proposed in [2] based on aggregated distance function, which combined several features (surface description, LBP, heraldic features, and SIFT) together. The paper [3] presented a two-phase hyperspectral image classification method combining Elastic Net Regression and spectral spatial bilateral filtering. The paper [4] created a complete, fully automatic, and efficient clinical decision support system for breast cancer malignancy grading, which made use of both image processing and machine learning techniques to perform the analysis of biopsy slides. The paper [5] used 4 different types of support vector machine classifiers based on the 6 features extracted from the raw gait data, including linear SVM, quadratic SVM, cubic SVM, and Gaussian SVM. A new structural statistical matrix was presented in [6], which was gray level size zone matrix texture descriptor variants. The paper [7] proposed two nuclear-and L2, 1-norm regularized 2D neighborhood preserving projection methods for extracting representative 2D image features. The paper [8] put forward feature importance in nonlinear embedding, an extension of PCA-based feature scoring method to KPCA, as well as several NLDR algorithms that can be cast as variants of KPCA.
However, accurate recognition of cells on pathological images based on hand-crafted features with classifier model remains a challenging task because of two main issues. One is that redundancy features not only influence classification 2 BioMed Research International Area, perimeter, diameter, area overlap ratio, center of mass, minor axis, major axis, smoothness, symmetry, and so on Texture Gray level cooccurrence matrix, local binary pattern, scale-invariant feature transform Tamura, fractal, Markov random field, wavelets, Haar-like features, Gabor, run-length, and so on results but also increase computing cost and another is how to optimize classifiers in a new form for more accurate results.
To solve these two problems, this paper proposes a concaveconvex variation (CCV) method to optimize three classifiers (random forest, support vector machine, and extreme learning machine) for the more accurate HCC image classification results. The main contributions of this paper are as follows.
First, for removing the features with redundancy and low contribution value, a new sparse contribution (SC) feature selection method is proposed. Second, CCV model computes weights according to geometrical characteristic of cell nucleus and optimizes three classifiers using the weights. The remainder of the paper is organized as follows. In Section 2, we introduce the related basic knowledge with respect to this paper. The steps of our proposed model are described in Section 3. Sections 4 and 5 explain the mechanisms of feature selection and CCV, respectively. Section 6 presents our experimental results and analysis. Finally, conclusions are summarized in Section 7.

Features.
Cell image features directly influence the performance of classifier and general divided into three groups, which are intensity, morphology, and texture. Three groups of features are shown in Table 1. Intensity features are mainly obtained by computing the pixel value of the whole image [9]. In the diagnosis of carcinomas, pathologists perform a semiquantitative analysis of a small set of morphological features to determine the cancer's histologic grade [10]. Texture is the most important group of features for classification. There are many applications of texture analysis for image classification [11][12][13].

Classifiers
(1) Support Vector Machine. Support vector machine (SVM) is based on the statistical learning theory, which aims to seek a hyperplane for separating two classification models. Note that our intention of this paper is for the optimization of classifier for recognizing the normal and HCC images. Therefore, we just use original binary SVM to verify our proposed method and parallel hyperplane is used for SVM. In the case of two classes question, given a set of training data ( 푖 , 푖 ), = 1, 2, . . . , , ∈ 푛 , ∈ {±1}. The hyperplane is described as ( ⋅ ) + = 0, which satisfies the constraint as follows: Lagrange function is used to solve the problem of constraint optimization, The optimal solution is * = ( * 1 , * 2 , . . . , * 푙 ) 푇 , where is Lagrange multiplier. Then, the optimal value * and bias * are computed in the following: where ∈ { | * 푗 > 0}. The optimal hyperplane is ( * ⋅ ) + * = 0 and classification function is shown below: (2) Random Forest. Random forest (RF) is an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees, which also is a powerful machine learning classifier that is relatively unknown in land remote sensing and has not been evaluated thoroughly by the remote sensing community compared to more conventional pattern recognition techniques [14]. As shown in Figure 1, the flow of RF is described. Firstly, samples are extracted using bootstrap method from the original training set and each sample has the same size. Then, the decision tree models are built for every sample to obtain results of each decision tree. The final result is acquired according to voting of all decision trees.
RF increases the diversity between two classification models by constructing different training sets. After training, there is a sequence of classification models {ℎ 1 ( ), ℎ 2 ( ), . . . , ℎ 푘 ( )}. The final classification results are generated by simple voting based on the sequence. The decision is shown as  where ( ) is the model of classification, ℎ 푖 is single decisionmaking tree, is output variable, and is indicative function.
The above equations can be expressed with matrix form, where is output of hidden layer node, is output weight, and is expected output, = ( For training SLFN,̂푖,̂푖, and̂푖 are described,  where = 1, 2, . . . , . After randomly generating the hidden node parameters, training an SLFN is equivalent to find a least-squares solution of = and can be decided bŷ where † is the Moore-Penrose generalized inverse of .

Hematoxylin and Eosin Pathological Image.
Hematoxylin and eosin (H&E) staining is ubiquitous in pathology practice and research, which facilitates pathologist interpretation of microscopic slides by enhancing the contrast between cell nuclei and other histological structures. This allows pathologists to visually identify cellular components, extracellular structures, and lumen with relative ease [18]. This paper thus uses H&E pathological images as experiment data set and each pathological image was selected by pathologists.

Classification Model and Steps
As shown in Figure 2, our proposed classification model is divided into two phases, training and testing.
In training stage, the whole slide images (WSI) are preprocessed using bilateral filtering firstly. Then, normal and HCC cells are selected from WSI under the guidance of pathologist. Then, single cell images are obtained by cell segmentation with a size of 64 × 64. Next, this paper extracts three groups gray features on gray-scale images, which are intensity, morphology, and texture features. A sparse contribution (SC) feature selection model is developed to select features. Subsequently, three classifiers are trained using the selected features. Meanwhile, the concave-convex variation (CCV) model is established through the calculation of each binary image CCV, which will optimize three original classifiers in the testing.
During testing stage, similarly, the testing images are preprocessed with bilateral filtering. Next, gray-scale and binary images involving the segmented cells are obtained. Testing images extract and select features as with training stage. After that, initial classification results are acquired through the trained original classifiers. For our contribution of this paper, the more accurate result is calculated using the optimized classifier with CCV model.

Feature Extraction.
Feature extraction is an important step of image classification. As described in Section 2.1, this paper extracts three groups of features. For the computation of the intensity features, we compute every single pixel's gray value of the whole images. This group can represent the global feature of each whole image. In addition, this paper regards density, nucleocytoplasmic ratio as morphology features. The greatest contribution group in this paper is texture feature and the specific steps of texture features are explained in the following.
(1) Gray Level Cooccurrence Matrix. Gray Level Cooccurrence Matrix (GLCM) was one of the earliest techniques used for image texture analysis [19]. Several typical parameters of GLCM are used to represent image texture condition. In GLCM, angular second moment (ASM) is used to reflect image gray distribution uniformity and textural detail; contrast (CON) can reflect the comparison between image pixel values; entropy (ENT) reflects the image gray distribution complexity and heterogeneity; correlation (COR) is used to reflect local gray correlation in image; inverse different moment (IDM) reflects the homogeneity of image texture. The computational formulas of these five parameters are shown in where ( , ) is the element of the th row and th column.
(2) Local Binary Pattern. Local binary pattern (LBP) was utilized to describe the local texture feature of images proposed by Zhao et al. [20]. LBP has the characters of rotational invariance and gray-scale invariance. The steps of LBP in this paper are shown as follows: (a) Divide the input image into cells with size of 16 × 16.
(b) For every pixel in each cell, compare its gray value with neighboring 8 pixels. If the center pixel's gray value is less than the neighboring pixel's gray value, then mark the neighboring pixel as 1; otherwise, mark it as 0. Each 3×3 cell can generate an 8-bit binary code, which can be used as LBP value of cell's center pixel.
(c) Plot each cell's histogram with computing the probability of each number (assume it as decimal number) and then normalize the histogram.
(d) Obtain the LBP textural feature of whole image by connecting each cell's histogram into a feature vector.
(3) Scale-Invariant Feature Transform. Scale-invariant feature transform (SIFT) is an image descriptor for image-based matching and recognition developed by Lindeberg [21], which is invariance on rotation, measure, and angle of view. The first step of SIFT is to build scale space, which uses Gaussian convolution kernel as linear kernel. The definitions of the scale space and Gaussian function are shown in When the building of scale space is completed, SIFT detects and locates the extreme point of the scale space. Then, assign direction parameters for each key point and the definitions of these parameters are shown in val ( , ) The last step of SIFT is to generate the descriptor of key points.
(4) Tamura. Tamura features include contrast, coarseness, directionality, line-likeness, regularity, and roughness, where the contrast, coarseness, and directionality are more important in image identification. This paper uses former three features and their definitions and computation methods are given by [22]. Following these methods of feature extraction, this paper extracts 463 dimensions' features for each patch, including intensity, morphology, and texture. Table 2 presents the number of each class's features.

Sparse Contribution Feature Selection Model.
After obtaining the complete features of HCC, feature selection is a normal practice. In recent years, some methods [23][24][25][26] are proposed for feature selection, such as filter, wrapper, and embedded. Feature selection aims to find out a beneficial subset from original feature, which can contribute more for final recognition. The purposes of feature selection are thus summarized as follows: (1) increasing the accuracy of prediction, (2) building a fast and efficient model, and (3) being applicable more to recognition model. Following these purposes, a new sparse contribution (SC) feature selection model is developed to select a beneficial feature subset for HCC image classification. The concrete procedures are as follows.
Step 1. Let denote the complete HCC image feature set, which is formulated as where is the number of the extracted features and is the number of the training samples. 푖 푗 represents the value of th feature of the th sample. First, the feature data of each row is normalized from 0 to 1. Nest, according to Pearson coefficient, we calculate the correlation coefficient between each two characteristics. The absolute value of correlation coefficient is close to 1, and the redundancy is strong. Finally, based on this rule, the redundant features are removed.
Step 2. After normalizing the data, a contrast mapping is performed to normalize the feature data of each row from −1 to 1. It could be calculated through the following formula: where and √ represent the average value and standard deviation of each row, respectively. 푖 is a feature value and = 1, 2, . . . , . is a control parameter, and for each row of fs, a suitable is selected to satisfy * range from −1 to 1 and obey approximately normal distribution. The updated fs is thus represented as follows: where represents the normalized data from −1 to 1. fs 1 is selected from (15), where 1 < . Next, we design a unit circle mapping to calculate the contribution value of each feature. For the sake of notational simplicity, let ( 1 , 2 , . . . , 푛 ) be a row of fs 1 that represents one feature of all training sample. All training sample is divided into two categories, normal and abnormal. normal and abnormal represent the average values of ( 1 , 2 , . . . , 푛 ) according to training labels. If normal ⋅ abnormal ≥ 0, the feature is removed. In another case, the contribution value of each feature is obtained through a unit circle mapping. Figure 3 shows an example of the calculation of contribution value. If normal > 0, the normal data where 푖 > 0 has positive contribution value for HCC image classification. On the contrary, the normal data where 푖 < 0 has negative contribution value. 1 is a normal data and 1 is the corresponding vertical coordinate of a unit circle, and (1 − 1 ) represents the contribution value of 1 . 2 is also normal data but 2 < 0, −(1 − 2 ) represents the contribution value of 2 . The contribution value of each feature ( ( )) is acquired summing the contribution values of normal data ( ( normal )) and abnormal data ( ( abnormal )). A threshold 1 is selected that if the contribution value of one feature is less than 1 , the feature is removed.
Step 3. Based on the first two steps, a new feature set is represented as follows: where 푚 2 푛 represents the selected features of all training data. To improve computational efficiency, 푖 푗 in fs 2 , which are closed to 0 (| 푖 푗 | ≤ 2 ), are set to 0. Finally, feature set of the training HCC images is expressed in the following: where = 2 < 1 < and it is seen that fs final is a sparse feature matrix.
In the proposed feature selection method, there are three important characteristics with respect to HCC image feature set. First, fs final removes the features with redundancy and low contribution value. Second, fs final remains the approximate uniform of all features distribution. Third, for a more straightforward training model, 2 is utilized to satisfy fs final lifetime sparsity [27]. Beside the selected feature set, the optimized classifiers influence classification performance significantly, which will be investigated in the next section.

Classifiers Optimized with Concave-Convex Variation
Concavity-convexity is an essential feature for image process [28]. The contour of a region is defined as convexity if the line between two random points of the region is still located in the region; otherwise, it is defined as concavity. For HCC images, according to statistics and a priori knowledge of pathologist, the outline of normal nucleus could be regarded as convexity approximately, which is more regular than abnormal nucleus. Following this a priori knowledge, different from the concave-convex feature in [28], concave-convex variation (CCV) is proposed to optimize three classifiers (RF, SVM, and ELM) in this paper. The inflection points of a curve are calculated through 耠耠 ( ) = 0, where ( ) is the analytic expression of the curve. However, in general, ( ) is uncharted because of the irregularity of each nucleus contour. To address this problem, the CCV models of normal nucleus and abnormal nucleus are established using a slope approximation method to optimize the classifiers for more accurate accuracy. The specific steps are as follows.
Step 1. A circumscribed rectangle of the contour is built through four horizontal and vertical lines. In this way, 4 tangent points are acquired, which are marked as ( left , left ), ( right , right ), ( top , top ), and ( bottom , bottom ). The contour is divided into 4 curves by these 4 tangent points, which can be efficiently utilized in the next step.
Step 3. For writing convenience, let = [ 1,left , 2,left , . . . , We record the number of symbol opposites between each two adjacent Δ 푖 . The number is considered as CCV( ), where represents a nucleus contour. In other words, if a Δ = [+, +, −, −, +, −], the CCV number is 3. This is the slope approximation method to describe the CCV of each nucleus and the CCV classifier is detailed as explained in the next steps.
Step 4. After obtaining each HCC image's CCV, the CCV models of normal and abnormal are established according to the labels of all training data. mean(CCV normal ) and mean(CCV abnormal ) are defined as the mean CCV value of normal data and abnormal data, respectively.
Step 5. A classification result of each original classifier (RF, SVM, and ELM) without CCV is acquired and it can provide the probability of each nucleus belonging to normal or abnormal. All testing data is roughly divided into two parts. For the classified normal testing data, CCV testing ( normal ) represents CCV of each classified testing normal data, and the adjusting weight normal is set as where 1 = mean(CCV normal ) and 2 = mean(CCV abnormal ), the updated probability of each testing normal data, and final normal = initial normal ⋅ normal , where initial normal represents the initial probability of each classified normal data. If final normal ≥ 1/2, the initial classified normal HCC images are labeled as normal; otherwise, they are abnormal. For the classified abnormal testing data, the adjusting weight abnormal is defined as where CCV testing ( abnormal ) is the CCV of each initial classified testing abnormal data. Similarly, if final abnormal = initial abnormal . abnormal ≥ 1/2, the initial classified abnormal HCC images are labeled as abnormal; otherwise, they are normal.
Following this method, the label of each HCC testing image is obtained using optimized CCV classifiers. To verify the effectiveness of CCV, three common classifiers (RF, SVM, and ELM) are utilized and the experiment results demonstrate that our proposed CCV classifiers improve classification performance than the corresponding original classifiers. The Experimental Data of this paper are obtained by the pathology department of a large hospital in Shenyang, China. 96 hematoxylin-eosin (H&E) pathological images are used in our experiment. The picture format is TIFF and spatial resolution is 1280 × 960. The magnification of pathology images is 400. Figure 5 presents the liver tissue pathology images; Figures 5(a) and 5(b) show a normal and HCC image, respectively.

Results and Discussions
As mentioned in Section 3, this paper segments the WSI to obtain the single cell images. The cell gray-scale images and the cell binary images are shown in Figure 6. The number of training images and testing images is shown in Table 3.

Experimental Evaluative
Criteria. This paper uses accuracy (ACC), sensitivity (SEN), and specificity (SPE) to evaluate the performance of classification. Sensitivity is the proportion of HCC cell images that are correctly classified and specificity indicates the rate of normal cell images that are correctly classified. 1-Score also is used for comparing the performance of classifiers before and after optimization.   where TP and TN are the correctly classified cell images in total images. FP and FN are the wrongly classified cell images.

Performance Comparisons.
As introduced in Section 5, CCV can apply to multiple classifiers. Thus, RF, SVM, and ELM are used in this paper. The comparison results between CCV and original classifiers are shown in Figure 7; Figure 7(a) describes ELM versus CCV-ELM. Figure 7(b) presents SVM versus CCV-SVM and Figure 7(c) shows RF versus CCV-RF. Based on Figure 7, we can see that CCV-based classifiers all enhance performance compared to the original classifiers in terms of ACC, SEN, SPE, and 1-Score. For ELM, the classification accuracy increases from 75.26% to 90.30% and specificity could achieve 81.79%. With regard to SVM, it performs best in three classifiers without optimizing by CCV. CCV could still further boost the ACC from 92.12% to 98.46%. Finally, RF performs better than ELM and poorer than SVM. However, CCV-based RF could achieve the best performance among these three classifiers with 98.74% classification accuracy.
This paper also plots precision and recall ( -) curve and receiver operating characteristic (ROC) curve for the further testing of CCV optimization ability. -curve is plotted with precision and recall, which can indicate the classifier performance, visually. ROC curve is plotted with true positive rate (TPR) and false positive rate. The computational formulas of TPR and FPR are shown in Figures 8 and 9 show -curves and ROC curves of classifiers with and without CCV, respectively. Similarly, classifiers optimized by CCV perform better than original. Figure 8 shows the -curves of three classifiers with and without CCV, where the red curves represent original classifiers and the blue curves are the classifiers optimized by CCV. As shown in Figure 8(a), CCV is significantly effective to ELM. Figure 8(b) shows that SVM performs well before and after optimizing by CCV. Figure 8(c) is the comparison of RF; we can see that RF performs deficiently without CCV and performs the best after optimization. Figure 9 exhibits the ROC curves of three classifiers. The red curves represent original classifiers and the green curves are the classifiers optimized by CCV. We can obtain the same conclusions with Figure 8. To sum up, CCV can improve performance of three classifiers, in which the CCV-RF for HCC image classification performs best in our experiments.

Comparison with Other
Works. [1] proposed a novel voting ranking random forests (VRRF) method to solve HCC image classification problem, which is based on conventional random forests model and optimized the voting step later. This paper compares the performance between VRRF and CCV-RF in terms of ACC, SEN SPE, and 1-Score, which  use the same data set in our paper. The comparison result is shown in Figure 10. According to Figure 10, we can see that both VRRF and CCV-RF perform well on HCC image classification. The classification accuracy of 97.68% and 98.74% can be achieved using VRRF and CCV-RF, respectively. SEN and 1-Score values of VRRF are also close to CCV-RF. However, CCV-RF perform better than VRRF in terms of SPE, which can achieve 97.48%. In addition, different from VRRF, our proposed CCV method could apply to various classifiers.

Feature Selection
Performance. In Section 4.2, SC feature selection model is developed for more efficient performances, which can not only improve the accuracy of classification but also reduce the computing cost. In our experiments, the number of feature dimension is reduced from 463 to 89. What is more, to further improve computational efficiency, some elements of matrix are set to 0 for sparse feature matrix. Table 4 shows the running time of three classifiers before and after feature selection. Table 5 presents the accuracy of three classifiers without CCV. Following Tables 4 and 5, it is seen that the running time of each classifier is reduced and the ACC are all increased. After feature selection, the selected beneficial features improve the accuracy and meanwhile reduce the running time. In addition, the construction of sparse feature matrix is also a significant factor for reducing computing time.

Discussions.
According to compared results and analysis above, our proposed classifiers optimized by CCV improve classification performance compared to original. Different from concavity-convexity feature, the CCV model utilizes statistical property and difference between the testing CCV   and the training mean CCV to adjust the optimized weights. The rule of adjusting weight is based on the Euclidean distance of CCV difference model. In addition, instead of parameter optimization for each classifier, the CCV model considers the difference among samples rather than the performance of classifiers.

Conclusions
This paper proposes a concave-convex variation (CCV) method to optimize three classifiers (random forest, support vector machine, and extreme learning machine). A new SC feature selection method is developed to remove redundancy features from complete feature set and it is beneficial for final classification and reducing the computational cost. Each classifier provides initial classification results and the corresponding probability. Then, we establish CCV statistical model according to all training data. The final classification results are obtained through CCV classifiers using the rule of adjusting weight. Experiments with 1260 HCC image patches demonstrate that our proposed CCV classifiers perform better than original classifiers in terms of ACC, SEN SPE, and 1-Score. Furthermore, the CCV-RF for HCC image classification performs best in this paper.

Conflicts of Interest
The authors declare that they have no conflicts of interest.