Three-Class Mammogram Classification Based on Descriptive CNN Features

In this paper, a novel classification technique for large data set of mammograms using a deep learning method is proposed. The proposed model targets a three-class classification study (normal, malignant, and benign cases). In our model we have presented two methods, namely, convolutional neural network-discrete wavelet (CNN-DW) and convolutional neural network-curvelet transform (CNN-CT). An augmented data set is generated by using mammogram patches. To enhance the contrast of mammogram images, the data set is filtered by contrast limited adaptive histogram equalization (CLAHE). In the CNN-DW method, enhanced mammogram images are decomposed as its four subbands by means of two-dimensional discrete wavelet transform (2D-DWT), while in the second method discrete curvelet transform (DCT) is used. In both methods, dense scale invariant feature (DSIFT) for all subbands is extracted. Input data matrix containing these subband features of all the mammogram patches is created that is processed as input to convolutional neural network (CNN). Softmax layer and support vector machine (SVM) layer are used to train CNN for classification. Proposed methods have been compared with existing methods in terms of accuracy rate, error rate, and various validation assessment measures. CNN-DW and CNN-CT have achieved accuracy rate of 81.83% and 83.74%, respectively. Simulation results clearly validate the significance and impact of our proposed model as compared to other well-known existing techniques.


Introduction
Recent studies show that in UK the second most leading cause of deaths due to cancer in women is breast cancer. In UK every year around 55,000 women are diagnosed with the breast cancer that is equivalent of one person every 10 minutes. One woman out of eight in her life time has a chance to be diagnosed as a sufferer of breast cancer [1]. Similar statistics are also shown in USA, with 231,000 estimated new cases for breast cancer in 2015 [2]. Breast cancer usually takes time to develop and symptoms are shown very late. As there is no effective way to cure later stage breast cancer, many lives can be saved if it can be detect at early stage. Therefore, for the early detection of breast cancer, it is recommended by America Cancer Society (ACS) that every woman who has a high risk factor of breast cancer should take screening test once in a year [2].
In current technical era, computerized diagnostic systems widely use mammogram screening methods to classify the breast tumor. Computer aided diagnosis (CAD) system typically relies on machine learning techniques to detect tumors in digitized mammogram images. Such techniques need to work with discriminant and descriptive features to classify images into multiple classes. In the past decade numerous methods have been proposed to classify the mammograms images and to attain better accuracy, efficiency, robustness, and precision. Nevertheless it is still an open research area due to the intrinsic challenges in mammogram representation and classification.
Many researchers have studied mammogram images for two-class (normal versus abnormal) classification and achieved significant results. Mazurowski et al. proposed a template based on a recognition algorithm for breast masses [3]. Their data set was based on 1,852 Digital Database 2 BioMed Research International for Screening Mammography (DDSM) images and achieved accuracy up to 83%. Lesniak et al. compared the performance of support vector machine (SVM) based classification with nearest neighbor algorithms [4]. They have used a private data set of mammography patches containing 10,397 images. The accuracy of their model was up to 67%. Wei et al. presented a relevance feedback learning method and performed classification using SVM radial kernel with a data set of 2,563 DDSM images [5]. Tao et al. compared the performance of two classifiers named curvature scale space and local linear embedded matric using a database of 476 and 415, and the accuracy of the two classifiers was 75% and 80%, respectively [6]. Abirami et al. [7] used wavelet features for the two-class classification of digital mammograms; they have achieved 93% accuracy on MIAS data set. Elter and Halmeyer [8] performed classification using Artificial Neural Network (ANN) and Euclidean metric classifier, respectively, and achieved a performance over 85%. All of the above researchers used two-class classification but two-class classification is not enough to avoid unnecessary biopsy because in abnormal cases the tumor can be either benign or malignant. Suckling proposed Extreme Learning Machine (ELM) method to classify mammograms of the Mammographic Images Analysis Society (MIAS) database [9]. The algorithm outperformed other techniques with same database [10]. Jasmine et al. performed two-class classification with his proposed method based on wavelet analysis using Artificial Neural Network (ANN) [11]. This experiment was performed using MIAS database of 322 images and has achieved accuracies up to 87%. In [12] Xu et al. compared the performance of three NNs and suggest that Multilayer Perceptron (MLP) performance improved as the number of features increased. They have achieved an accuracy up to 98% by using 120 mammogram images. Deserno et al. have used Image Retrieval in Medical Applications (IRMA) data set containing 2796 images, experimented based on 2D principal component analysis (2DPCA) and achieved accuracy up to 80% [13]. However, they have used 20 classes in their classification.
In the last few years, deep learning using NN has achieved state-of-the-art results in many fields of computer vision, such as object detection and classification [14]. Deep learning models are also applied on various medical imaging fields like tissue classification in histopathology and histology images [15]. However, in literature only a limited number of studies are available using deep learning for mammogram images classification [16]. In [17], CNNs were used to segment the breast tissue of mammographic texture. Multiscale features and autoencoders were applied to calculate breast density score [18]. CNNs were used to classify the microcalcifications but the data set was very small [19]. Kallenberg [25], [26], respectively, and achieved significant results on data set of 216 and 690 images. Uppal and Naseem used fusion of discrete cosine transform and discrete wavelet transform features to classify mammograms in 3 classes [27]; they used data in the MIAS database and obtained high accuracy of 96.97% and 98.39%, respectively. Deep learning methods can perform well at the cost of large amount of data set [28][29][30]. Table 1 summarizes the significant work done so far for the classification of mammogram images. It can be seen that significant results are achieved for two-class classification. However, for three-class (normal, benign, and malignant) classification, there has been little progress because either of the available data sets are small and private or proposed systems have not achieved very promising results.
In this paper, we have extended our previous work [31] and propose an improved classification technique for large data sets of mammograms using CNN. The application of classic approaches, for example, using DSIFT features and SVM classifier, on a classic two-class classification for normal and abnormal or a three-class classification (normal, benign, and malignant) using the rotation and scale invariant DSIFT features [32] and a SVM classifier with linear kernel, did not achieve satisfactory performance. Therefore, a threeclass classification study (malignant, benign, and normal) is carried out by using our proposed model. Example images of these classes are shown in Figure 1. Two different approaches, namely, CNN-DW and CNN-CT, are presented in our proposed model. An augmented data set is produced by using mammogram patches. The data set is filtered by contrast enhancement. In the first method enhanced mammogram images are decomposed as its four subbands by means of 2D-DWT, while in the second method discrete curvelet transform (DCT) is used. In both methods DSIFT descriptor is used to extract features for all subbands. Input data matrix containing these subband features of all the mammogram patches is created that is processed as input to convolutional neural network (CNN). A softmax layer and a SVM layer are used to train CNN for classification. A flow chart of the proposed model is given in Figure 2.
The main contribution of this paper is the development of a deep learning method based on a large data set of mammogram images. We have shown that the discriminant and descriptive features can perform well with different wavelets, if these are used according to our proposed model in combination with CNN. We also perform classification with SVM via 10-fold cross-validation presenting more unbiased results.
The remaining of the paper is organized as follows. Section 2 explains the feature extraction and representation steps in this research. Section 3 describes the CNN based classification model and SVM classification. Section 4 demonstrates the simulation/results and the paper concludes in Section 5.    is, height of the local histogram, and thus on the maximum contrast enhancement factor. In this technique, enhancement is done on very small patches, so the overenhancement due to noise or the effect of edge-shadowing is very low as compared to AHE [37].

Feature Extraction and Representation
The CLAHE method was originally developed to reduce the shadow of edges and noise produced in homogeneous areas in medical images [38]. The method has been used for the enhancement of digital mammograms [36][37][38][39][40] and demonstrated good improvements to mammograms visual quality.
An input image with dimensions × , is divided into small blocks. CLAHE is then used to enhance the contrast of each block. Finally the bilinear interpolation is used to combine the neighboring blocks back into whole images. The steps in CLAHE are described as below [40].
(2) The histogram of each block is calculated.
(3) For contrast enhancement of patches, a clip limit of histogram, = 0.001, is set.
(4) After clipping the threshold value the histogram is redistributed.
(5) Every block histogram is modified by the following transformation function: where ( ) is the probability density function of the input patch image grayscale value at and is define as where is the gray scale value of input pixel and is the total number of pixels in a block.
(6) Bilinear interpolation is used to combine the neighboring blocks in each patch. The gray scale value of the patch is also changed according to the new histogram.
In our experiment, we have used the block size of 8 × 8 and clip limit of histogram is defined as 0.001.

Two-Dimensional Discrete Wavelet Transform.
A twodimensional DWT consists of downsamplers and digital filter banks. The digital filter banks comprise low pass filter ( ) and high pass filter ( ). The number of banks depends upon desired resolution of the application [41]. As the mammogram images are two-dimensional signal, the DWT can be computed by separable wavelet functions. As shown in Figure 3, the columns and rows of the image are distinctly processed over the one-dimensional wavelet transform to establish the two-dimensional DWT. In frequency domain the enhanced image is decomposed into subband images at resolution 2 +1 .
is the approximation of the image. , ℎ , and V are three detailed subband images in diagonal, horizontal, and vertical, directions, respectively.
As a result of wavelet decomposition the image decomposed into four subband components like High-High (HH), High-Low (HL), Low-High (LH), and Low-Low (LL), which correspond to subimages that are , , V , and ℎ , respectively, as shown in Figure 3.

Discrete Curvelet Transform.
Discrete curvelet transform is an image representation technique used in computer vision. It was proposed by Candes and Donoho [42]. DCT codes image edges more efficiently than wavelet transform [43] and it has useful geometric features that can be used as a feature vector in medical image processing. Eltoukhy et al. [44,45] have used DCT for the mammogram images.
Let be a function that has a discontinuity across a curve and is smooth otherwise, and consider approximating from the best -terms in the expansion. The squared error of such an -term expansion obeys [46] −̃2 1 √ , → +∞, wherẽis the approximation from best Fourier coefficients. Equation (4) shows the expansion for wavelet, wherẽis the approximation from best wavelet coefficients. Equation (5) shows the expansion for curvelet expansion, wherẽis the approximation from the best curvelet coefficients. Equation (5) shows that the MSE will be reduced in DCT. Fast DCT proposed in [47] is described as below.
It has a two-dimensional space 2 with as the frequency domain variable and as the spatial variable, and and are the polar coordinates in the frequency domain. A pair of windows ( ) and ( ) are defined, which will be called the angular window and the radial window, respectively. is taking real arguments and supported on ∈ (−1, 1) and is taking positive real arguments and supported on ∈ (1/2, 2).

Dense Scale Invariant Feature
Transform. In next step DSIFT descriptor is extracted from all the subbands components. Dense SIFT scale-space extrema detection used Difference-of-Gaussian (DOG) function to identify potential interest points [48], which were invariant to scale and orientation.
In the key point localization stage, Hessian matrix is used to compute principal curvatures that eliminate the edges by rejecting the low contrast point [48]. Key point descriptor can be found out by using a three-dimensional histogram in which two dimensions correspond to image spatial dimensions and the third dimension corresponds to the image gradient direction computed centered at the key points.
The DSIFT descriptor is applied to all the subbands with step size 4 and radius size 5. Feature matrices having dimension (128 × 400) are extracted for all the subbands. From the columns of this matrix, six time domain features, kurtosis, mean, skewness, energy, maximum, and standard deviation, are extracted for each subband. The resultant feature matrix is of the shape of (128 × 6). This matrix is reshaped into a vector form of (1 × 768). Weighting coefficients are applied to the subband images according to (13) and (14) for CNN-DW and CNN-CT method, respectively. Feature vector = (3 * LL + 2 * LH + 2 * HL) , Feature vector = ( 1 + 3 * 2 + 2 * 3 + 2 * 4) . (14) Equal zero padding is performed on the start and end columns such that it reshapes as (1 × 785). Enhancement and feature extraction steps are performed on all the augmented data sets so that we have a data matrix̃( , ) of the shape (22368 × 785), where 22368 is the number of the sample images and 784 is the number of features of each sample, and every sample has a last column label that belongs to its receptive patch class.

Convolution Neural Network
In the next step we use CNN to learn features from the data set matrix̃. CNN has proved its importance in classification 16 Figure 4: Convolution neural network model. of images by its significance results. CNN has a multilayered architecture, consisting of a convolution layer followed by a maximum pooling layer. The number of layers depends upon the designer. The output of final maximum pooling layer is fed to a fully connected layer that works like MLP which is further forwarded to softmax layer. The convolution layer takes 1D or 2D matrices as an input. Equation (15) shows the single output matrix of convolution layer.
wherẽis the input matrix that convolves with kernel matrices , . Bias is added to each element of output after computing the sum of all convoluted matrices. is the one output matrix computed by a nonlinear activation function , that is applied to each element. Commonly used activation functions in convolution layer are tangent hyperbolic function and sigmoid function as follows: The pooling layer is used for dimensionality reduction in the convolution layer. Mostly used pooling layer algorithms are average pooling, mean pooling, and maximum pooling. During the training, the dropout algorithm is applied by randomly disabling the neurons, with a normally dropout ratio between 0.3 and 0.6. The final layer of CNN is a soft max layer that contains the output neuron according to the number of classes of the problem, which is assigned a confidence score.
The overall network design of CNN is presented in Figure 4. The two convolutional and max pooling layers are used with a kernel size of 2 × 2. Convolutional layers have 16 kernels with size of 7 × 7 and the second layer uses kernel sized 5 × 5. Then, a fully connected neural layer is used. The dropout ratio in the experiment is 0.55. Softmax layer is used to train CNN for classification.

Classification with Support Vector Machines.
Recently, many researchers have used SVM as a top layer instead of softmax layer in deep learning and showed improvements in the classification result [49]. In the second experiment we also use SVM layer instead of the softmax layer. All the other settings of the process remain the same as explained above.
SVMs have been applied to many classification tasks [50,51]. Input data is labeled as = −1 for class 1 and as = 1 for class 2. For linearly separable data a hyperplane can be defined as where is the input vector, is a scalar, and is dimensional normal vector of this hyperplane. Distance from origin perpendicular to this plane is − /‖ ‖. The solution of SVM is based on optimal hyperplane and minimum mean square error that is defined as where is a Lagrangian coefficient and > 0. Maximizing (18) results, Putting (19) into (18), it is redefined as where ( ) is the kernel function [52].

Simulation and Results
This section presents the database and validation assessment measures that are used in this experiment. Moreover, the experimental results are presented to show the superiority of proposed methods.

Database.
We have used IRMA data set [53] for experiments in this study. A total of 2796 patches of the original mammogram images are used for this experiment. Selected IRMA patches consist of four different sources including 2,576 images from DDSM, 150 images from MIAS, Accuracy is the most commonly used assessment measure for classification that considers all the cases; it used all the cases.
PPV is defined as the number of the correct detected positive cases over all detected positive cases.
NPV is defined as the number of the true negative cases detected over all negative cases.
Sensitivity is defined as the ratio of the detected true positive cases over actual positive cases. It deals only with positive cases. Unlike sensitivity, specificity deals only with negative cases. It is the ratio of the detected true negative over the actual negative.
MCC is an assessment indicator of deep learning methods, particularly for the negative case sample detected, that are evidently unbalanced compared with the positive sample detected. MCC provides a superior assessment compared to the general accuracy. .
The ROC curve is used for measuring the predictive accuracy of the model. It indicates the relation between the true positive rate and false positive rate.

Experimental Results.
In this subsection the proposed methods have been compared with existing methods in terms of accuracy rate, error rate, and various validation assessment measures. Figure 5 shows the result of two-class classification.
It can be observed that in two-class classification Histogram Oriented Gradient (HOG) method performs better with an accuracy rate of 83.2%. The other two schemes, Local Configuration Pattern (LCP) and DSIFT, have accuracy rates of 82.26% and 74.6%, respectively. Likewise, Figure 6 shows the result of three-class classification. Here LCP method performs better than the other two schemes with the best accuracy of 57.54, but the results are not so promising. This accuracy has been further enhanced by our methods as shown in the rest of the simulation results. In Figure 7, the accuracy rate of proposed CNN-DW method has been presented for different number of iterations using softmax layer. Note that the classification results for three-class category obtained by proposed CNN-DW method are more pleasing as compared to the existing schemes in Figure 6. CNN-DW method achieved the accuracy of 83.14% and 81.18% on validation data set and test data set, respectively. Furthermore, Figure 8 shows the error rate of the  proposed CNN-DW method with softmax layer at different iterations. With softmax layer, it has 16.86 and 18.82 error on validation data set and test data set, respectively. Likewise, the accuracy rate and error rate of second proposed method, that is, CNN-CT, have been shown. Figure 9 shows the accuracy rate of proposed CNN-CT method with softmax layer at different iterations. Note that the classification results for three-class category obtained by proposed CNN-CT method are better as compared to the existing schemes in Figure 6 and from CNN-DW method as well. The proposed method achieved the accuracy of 84.57% and 82.54% on validation data set and test data set, respectively. Similarly, Figure 10 shows the error rate of proposed CNN-CT method with softmax layer at different iterations. With softmax layer, it has 15.43 and 17.46 error on validation data set and test data set, respectively.
In the further simulation, the results of our proposed methods using SVM layer are presented. Figure 11 shows the accuracy rate of proposed CNN-DW method with SVM layer at different instants. It is shown that proposed CNN-DW method has achieved an average accuracy of 81.83%. Likewise, Figure 13 shows the accuracy rate of the other proposed CNN-CT method with SVM layer. Proposed curvelet method has achieved average accuracy of 83.74%.  Moreover, the proposed methods are also tested for SVM 10-fold cross-validation. Figure 12 shows the accuracy rate of proposed CNN-DW method with SVM layer and it has achieved average accuracy of 81.23% in 10-fold crossvalidation. Similarly, Figure 14 shows the accuracy rate of proposed CNN-CT method with 10-fold cross-validated SVM layer. It has achieved an average accuracy of 83.11%. Table 2 shows the quantitative comparison of existing and proposed schemes. It can easily be observed that the proposed CNN-DW and CNN-CT methods provide better measure values, especially on large data set of mammogram images. Proposed CNN WT method has outperforms all other methods. Similarly, Table 3 shows the quantitative comparison for SVM classifier with 10-fold cross-validation of the existing and proposed schemes. It can easily be observed that the proposed scheme provides better measure values in both the cases. Finally, Table 4 provides a summary on accuracy rate for 3-class classification.

Conclusion
A novel mammograms classification method for breast cancer detection based on CNN is proposed. We have proposed two algorithms; first algorithm is based on 2D discrete   wavelet transform while the other is based on curvelet transform. We have found that deep learning method can be used for the breast cancer detection by using data augmentation and results show that learning features from the data set before inputting the data to the CNN is more helpful for cancer detection. We have also found that by using the SVM layer instead of softmax layer the classification performance  can be improved. However, the 10-fold cross-validated result of the SVM can cut down the accuracy because the crossvalidated result is more unbiased than performing training and testing process proposed method with curvelet transform has better results as compared to the proposed method with wavelet method and other existing methods. In future work, more techniques of deep learning can be applied for the detection of breast cancer. Improvement can also be made by using different architecture of CNN.