A New Feature Ensemble with a Multistage Classification Scheme for Breast Cancer Diagnosis

A new and effective feature ensemble with a multistage classification is proposed to be implemented in a computer-aided diagnosis (CAD) system for breast cancer diagnosis. A publicly available mammogram image dataset collected during the Image Retrieval in Medical Applications (IRMA) project is utilized to verify the suggested feature ensemble and multistage classification. In achieving the CAD system, feature extraction is performed on the mammogram region of interest (ROI) images which are preprocessed by applying a histogram equalization followed by a nonlocal means filtering. The proposed feature ensemble is formed by concatenating the local configuration pattern-based, statistical, and frequency domain features. The classification process of these features is implemented in three cases: a one-stage study, a two-stage study, and a three-stage study. Eight well-known classifiers are used in all cases of this multistage classification scheme. Additionally, the results of the classifiers that provide the top three performances are combined via a majority voting technique to improve the recognition accuracy on both two- and three-stage studies. A maximum of 85.47%, 88.79%, and 93.52% classification accuracies are attained by the one-, two-, and three-stage studies, respectively. The proposed multistage classification scheme is more effective than the single-stage classification for breast cancer diagnosis.


Introduction
Cancer is a group of body cells that grow and proliferate abnormally and uncontrollably because of damaged DNA (deoxyribonucleic acid). This group of body cells, known as tumors, may be either benign or malignant. Benign tumors are not cancerous and life-threatening as they do not spread to other tissues or organs of the body. In stark contrast to benign tumors, malignant ones tend to be metastasized and may generally be fatal.
Breast cancer originates in a breast tissue. It is the most frequently diagnosed cancer among women, and it is 100 times more common in women than in men [1]. Worldwide, breast cancer is the second major cause of female deaths resulting from cancer [2]. There is no known way to prevent breast cancer, but mortality can be reduced with early diagnosis [3]. Radiological screening is the most important action to take for early diagnosis [4]. Although mammography is known as the most effective radiological screening technique both for breast investigation and diagnosis, the subtle difference of X-ray permeability between normal and abnormal regions makes cancer detection difficult [5]. This difficulty is aggravated as the breast tissue type becomes denser. Moreover, human factors heavily affect the interpretation of mammogram images. A computer-aided diagnosis (CAD) system detects and diagnoses cancer without these negative factors [6]. Hence, using a CAD system increases the sensitivity of cancer detection by providing radiologists a second opinion.
Classification accuracy of CAD systems is directly affected by detection of suspicious regions for breast cancer, namely region of interest (ROI), from whole-breast mammogram images. Besides the low-contrast problem, the digitization noise in mammograms also affects the success of ROI detection negatively; and noise reduction is required to improve the image quality [7,8]. Hence, preprocessing is necessary and should be the first of the four stages in a CAD system. Some studies have tried to overcome the problem of low contrast using histogram processing operations [8][9][10][11], morphological operations [12], and statistics theory [13], while unsharp filtering [8], wavelet transform [12,13], and median filtering [14,15] are the most common noise reduction.
In the second stage of a CAD system, the ROIs are detected from entire breast images. ROI detection in the past decade was generally performed using wavelet transforms [16], segmentation algorithms [17,18], and edge operators [19].
Mammogram-based breast cancer diagnosis studies can be categorized as microcalcification detection, mass detection, and mass recognition. Pal et al. presented a multistage system for microcalcification detection [27]. This multistage system first classifies a mammogram image as normal or abnormal; then, for an abnormal image, it detects the regions with microcalcification. The authors extracted statistical features on manually detected ROIs and implemented feature selection and classification using a multilayer perceptron neural network [27]. Lado et al. developed an extended generalized additive model (GAM) involving interaction of breast tissue factors to reduce the false-positive rate for microcalcification detection [16]. The authors stated that the false-positive rate has decreased to 0.74 per image from 1.46 when the breast tissue type is integrated into the GAM. Similarly, Malar et al. studied the effectiveness of breast tissue type integration on microcalcification detection using an extreme learning machine and achieved an accuracy of 94% using wavelet-based features [40]. Since the number of cells with microcalcification is smaller than the number of healthy cells, microcalcification detection is an unbalanced classification problem. Bria et al. proposed a cascaded fiveclassifier approach to eliminate the predominance of healthy cells [41]. In this approach, the first classifier initially discriminates the normal and abnormal cells and later benign microcalcification clusters (μCs) and false detections of normal cells are eliminated using a RankBoost classifier. The resultant malignant μCs are evaluated by the next classifier, and the process goes on until the μCs from the last classifier are obtained. Ultimately, final μCs are selected according to their probability maps with 93% accuracy. Kekre et al. [17,18] segmented mammogram images using a vector quantization technique for mass detection. They computed the areas of each region on the segmented images and classified the region having the largest area as a mass. Hachama et al. used an image registration for mass detection [42]. Savitha et al. suggested that analyzing mammogram images in the complex plane will increase the accuracy of mass detection [43]. They mapped the mammogram images into a complex plane and classified them using a fully complexvalued relaxation neural network with an accuracy of 97.84%. Vallez et al. stated that lesion detection and recognition accuracy can be increased by using predefined breast tissue type information [25]. The classification accuracy rate has been increased to 91% from 78% in their study. Guliato et al. suggested that the previously proposed polygonal modeling [44] is an effective method for mammogram classification as it helps in noise reduction while preserving the important features [45]. Oliver et al. proposed a knowledgebased approach for the automatic detection of microcalcifications and clusters in mammographic images [37]. In this approach, local features that characterize the morphology of microcalcifications are first extracted to create a dictionary of visual words by a bank of filters. Then, feature selection is accomplished by using a boosted classifier for microcalcification detection. Finally, the cluster detection is achieved at 80% sensitivity by locally integrating the individual microcalcification probability images.
In this paper, a new and effective feature ensemble with a multistage classification is proposed to be implemented in a CAD system for breast cancer diagnosis. The result is verified using a publicly available mammogram image dataset collected during the Image Retrieval in Medical Applications (IRMA) project. For the preprocessing stages, contrast enhancement and noise reduction operations are first executed on each mammogram ROI in the database by applying a histogram equalization followed by a nonlocal means (NLM) filtering [46]. The local configuration pattern (LCP) algorithm [47] is then applied to obtain LCP-based feature vectors from the mammographic images. Then, some statistical and frequency-domain features are extracted and concatenated with the LCP-based feature vectors. Eventually, these feature vectors formed by LCP-based, statistical, and frequency-domain features are classified as normal, benign, and malignant using eight different popular classifiers via cross validation. The classification process is performed in three different cases in this study. In the first case, called a one-stage study, the feature vectors are directly classified into three classes. In the second case, called a two-stage study, the feature vectors are initially categorized according to their breast tissue types, and are subsequently classified as normal, benign, and malignant. In the third case, called a three-stage study, the feature vectors are first classified according to their breast tissue types. Afterward, they are classified as normal and abnormal. At the third stage of this case, the feature vectors labeled as abnormal classes are categorized as benign and malignant. Moreover, a classifier combination via a majority voting of the most successful three classifiers is employed for both the two-and threestage studies.
This paper is organized as follows. The preprocessing and the whole feature extraction procedure realized in this paper are explicated and all of the classification methods and the evaluation metrics are briefly described in the following section. Discussions on the experimental studies and the obtained results are given in Section 3, whereas the main conclusions are precisely specified in the last section.

Materials and Methods
2.1. Database. It is very important to work on images with their ground truths for medical imaging applications [48]. In this study, a publicly available mammogram dataset constructed during the IRMA project is used [49]. This dataset consists of 12 classes defined by the Breast Imaging Reporting and Data System (BI-RADS). There are four breast tissue classes (fatty, fibroglandular, heterogeneously dense, and extremely dense) and three health status classes (normal, benign tumor, and malignant tumor) for each breast tissue type. There are 233 mammogram ROI parts, lowerdimensional mammogram images that consist of just healthy/cancerous regions of the whole breast, for each class, and therefore, a total of 2796 parts are available in the dataset [49]. The ROI parts of each class are classified using cross-validation technique. It implicitly means that 210 of 233 parts (90%) in each class are used for training while the remaining 23 of 233 parts (10%) are treated as the test parts. The process is repeated for each fold in the cross-validation technique, and the average classification accuracy for each classifier is obtained.

Preprocessing.
In the preprocessing stage, a histogram equalization followed by the NLM filtering is applied on the mammogram parts [48]. The NLM filter is an adaptive smoothing filter that changes the window size according to the similarity between neighborhoods of any two pixels as well as preserves the fine details by computing a weighting function according to the derivatives in the corresponding search window [46,48]. Given a discrete noisy image v = v i i ∈ I , the filtered value NL v i of any pixel is computed as where w i, j refers to the weight coefficient computed utilizing the similarity between pixels i and j and satisfies the conditions 0 ≤ w i, j ≤ 1 and ∑ j w i, j = 1.
The similarity between pixels i and j is measured as the Gaussian weighted Euclidean distance, v N i − v N j 2 2,σ , where σ σ > 0 is the standard deviation of the Gaussian kernel, whereas v N i and v N j are the neighborhoods of pixels i and j in the similarity window [48]. The pixels with larger weights indicate a similar neighborhood as it can be understood by analyzing (2). Z i and h in (2) refer to the normalizing constant and the degree of filtering, respectively.
The most essential stage in CAD systems, as well as in any pattern recognition problem, is the feature extraction in which data is represented in a lowdimensional space by the most descriptive features that maximize and characterize the interclass differences. In this study, three groups of features are concatenated to construct the feature vectors. The first group is LCP-based features obtained using LCP algorithm, while the second and third groups are some statistical and frequencydomain features, respectively.

Local Configuration
Pattern. The local binary pattern (LBP) is generally used for face representation and recognition in the past two decades [50][51][52], and it is a grayscale and rotation-invariant feature extraction technique presented by Ojala et al. [53]. The grayscale-independent LBP representation of an image I is obtained by thresholding P neighbors in the circular neighborhood of radius R with the intensity value of the central pixel as given in (4).
The terms g i and g c in (4) denote the intensity values of the neighboring pixel i and central pixel c, respectively. The rotation-invariant LBP-based feature vectors are described by the idea of rotating each bit pattern circularly to a minimum value ending up with the maximum value as the last element of the feature vectors. Equation (6) introduces the mathematical representation of this idea where the term LBP riu2 refers to the rotation-invariant LBP-based feature vectors.
The quantization of gray-level differences to binary levels sometimes causes undesirably the same LBP representations although the neighborhoods are relatively different. This problem is solved by computing the local variance VAR of each pattern, and the joint histogram O is formed. μ in (7) refers to the average intensity of the neighboring pixels.
The LBP algorithm is stated to be an effective technique for detecting local structures; however, the LBP riu2 feature vectors for patterns having equal variances may be the same although they have different configurations [47]. Guo et al. proposed a microscopic (MiC) descriptor that defines the microscopic configuration of an image by a linear configuration model as a solution to this problem [47]. In this model, the optimal weights A L of the neighboring pixels are calculated via the least square estimation technique to form the central pixel. For the conservation of being a rotationally invariant characteristic, a one-dimensional Fourier transform of optimal weight vectors is computed and H L values are obtained. The magnitude of H L is defined as the MiC feature of a pattern.
The local configuration pattern (LCP) is a technique that describes the local structures and microscopic configuration of a pattern together, where the LCP-based feature vector of an image is obtained by concatenating the microscopic configuration of each pattern in an image with their joint histogram as [24] where q is the number of patterns in an image.

Statistical
Features. Some significant and descriptive statistical features of each LCP-based feature vector are calculated as the second group of features to increase the data representability of the feature vectors. Energy is one of the most important statistical features of any distribution, and hence, the energy values of LCP-based feature vectors are evaluated. The mean, maximum, minimum, and mean energy of each LCP-based feature vector are additionally computed as statistical features. In the statistical theory, the variance, skewness, and kurtosis are defined as variation criterions. Owing to the large variations between healthy and cancerous regions on a mammogram image, these criterions are also calculated. Moreover, the standard deviation, energy variance, and area descriptor [54] of LCP-based feature vectors are additional variation-related features used in this study. Radiologists state that cancerous regions and malignant regions have more irregular distribution than healthy regions and benign regions, respectively. This statement corresponds to entropy in statistics. Therefore, the entropy of each LCP-based feature vector is calculated to measure this irregularity as a feature. The statistical features utilized in this study and their mathematical representations for the N × 1 dimensional feature vectors are listed in Table 1.

Frequency-Domain
Features. The third group of features computed in this study is the frequency-domain features. Frequency-domain features are determined by applying a two-level two-dimensional discrete wavelet transform (2D-DWT) using Daubechies1 (db1) wavelet function on the preprocessed mammogram images, and finally, 16 sub-bands for each mammogram image are obtained. The energy values of each sub-band are computed since the brightness is one of the most significant issues for breast cancer diagnosis. db1 function is a type wavelet in wavelet analysis. The mother function ψ t of db1 wavelet is described as [55] ψ t = The preprocessed mammogram parts are decomposed into four sub-bands that are LL (low-low), LH (low-high), HL (high-low), and HH (high-high) by a one-level 2D-DWT utilizing the db1 wavelet. Several parameter values are experienced in the LCP transform, and ultimately, the LCP algorithm is applied on each sub-band using 8 neighbors in the circular neighborhood of radius 2. Therefore, 81 × 1 dimensional LCP vectors of each sub-band are constructed. The endmost values in those LCP vectors are appreciably high; therefore, they are removed to get rid of their domination over other features. The remaining 80-dimensional feature vectors of each sub-band LL-LH-HL-HH are then weighted with the respective coefficients 1 4-1-1-0 concluded as the most efficient coefficients by [5]. Then, they are summed Area descriptor [50] σ μ up to form an 80-dimensional feature vector for each mammogram part [48]. In order to increase the representative power of the feature vectors, 12 statistical features computed from the LCPbased feature vectors, and 16 frequency-domain features evaluated from the sub-bands obtained by the decomposition of the preprocessed mammogram ROI parts using the twolevel 2D-DWT are concatenated to the LCP-based feature vectors [48]. Consequently, 108-dimensional feature vectors are extracted from each ROI part. The statistical features are extracted from the LCP-based feature vectors instead of extracting them directly from the mammogram texture to amplify the discriminative power of the LCP-based feature vectors. The frequency-domain features, which are the energy values of each sub-band in the spatial domain, are extracted since the brightness is one of the most significant issues for breast cancer diagnosis, and changes in the brightness in a mammogram image are clearly observed in the spatial frequency. Table 2 summarizes the feature vector construction process. In Table 2, the phrase "LCP: energy" refers to the energy value of an LCP vector whereas "LLLL: energy" is the energy of the LLLL (low-low-low-low) sub-band.

Classifiers
2.4.1. Fisher's Linear Discriminant Analysis. Fisher's linear discriminant analysis (FLDA) tries to find a projection matrix that projects the training data onto a low-dimensional space that maximizes between-class variance as well as minimizing within-class variance [48,56]. This is known as the Fisher maximization criterion and is defined as where w , S B , and S W refer to the projection vectors and between-class and within-class scatter matrices, respectively. On the test stage of FLDA, any test vector is projected via w projection vectors, and distances to the training vectors on the low-dimensional space are calculated [48]. The decision criterion for FLDA is given as where c is the class index, S is the total number of classes, and Ω c and Ω test are the projected training vector of the cth class and the projected test vector, respectively [48].

Linear Discriminant Classifier.
Linear discriminant classifier (LDC) tries to find the weight vectors w of a linear hyperplane g x that separates given classes [57]. The weight vectors of this hyperplane are defined by a linear combination of training feature vectors x of each class. The linear hyperplane is characterized by the weight vectors and a threshold w 0 as The LDC assigns any test vector x test to a class according to the sign of the projection function given in (14) for a two-class problem. The terms w 1 and w 2 in (14) refer to the class labels.
. Support Vector Machines. Support vector machines (SVMs), also known as maximum margin classifiers, determine the optimal hyperplane that maximizes the distance between the hyperplane and support vectors [58]. Support vectors are the training vectors that are nearest from each class to the hyperplane [59]. As it can classify linearly separable data, SVM can classify nonlinear data by transforming the data to a higher-dimensional space by using an appropriate kernel function [49]. If the training set is TS = x 1 , L 1 , x 2 , L 2 , …, x M , L M for a two-class problem, where x i i = 1, 2, …, M is the training data and L i L i ∈ −1, 1 is the class label, the test vector is classified according to the sign of the function given as where α i i = 1, 2, …, M are the nonzero quadratic coefficients and b / w is the perpendicular distance between the hyperplane and the origin, whereas w is the normal vector of the separating hyperplane [48].

Logistic Linear
Classifier. The logistic linear classifier (LLC) states that a linear hyperplane can be characterized by the relationship between the dependent and independent  [60]. In LLC, this relationship is determined using a logistic regression analysis by computing class-conditional probability density functions of x vectors. The LLC model for a two-class problem is given by (16) where p x w i , β , and β 0 are the class-conditional probability density functions of x , weight vectors for the linear hyperplane, and a threshold value, respectively. log The LLC assumes that log-linear models can be formed between classes with equal prior probabilities and covariance matrices. This assumption is equivalent to where p w i x and p w i are the probabilities of class w i given x and prior probability of class w i , respectively. The decision criterion for LLC is given in The principle of the decision tree classifier is to cluster any data into subgroups until all elements of any subgroup have the same class label [48,61]. Classification rules are defined by clustering the data into the leaves, class labels, in the training stage while those rules are applied to any test sample and the leaf that the test sample reaches provides the class label of the test sample in the test stage.
2.4.6. Random Forest. The random forest classifier is an ensemble of decision tree classifiers developed to improve the classification accuracy [62]. Each tree classifier in this ensemble votes for the best class of any sample, and the resultant class label is then specified via a majority voting technique.
2.4.7. Naïve Bayes. Bayesian classifiers compute the probability of each class given any test vector x and assign it to the class with the highest conditional probability [63]. The Bayesian decision criterion for a two-class problem is The terms P w 1 x and P w 2 x denote the posterior probabilities of classes w 1 and w 2 given x → , respectively, where P w i x is computed as The terms P w i , p x w i , and p x refer to the prior probability of class w i , the probability of x given class w i , and the probability density function of x , respectively. One-dimensional and l-dimensional case computations of p x w i are given in (24) and (25), respectively. μ, σ, and ∑ in these equations are the mean, variance, and covariance matrix of the feature vectors, respectively.
Naïve Bayes classifiers assume that all feature vectors are statistically independent and classify any test vector according to the Bayesian decision criterion given in (21) [63]. In this classification scheme, the probability density function for the l-dimensional case is computed as

k-Nearest
Neighbors. The k-nearest neighbor (kNN) classifier assigns any test vector to the respective class that its k-nearest neighbors belong at most, considering the distances between the test and training vectors in the feature space [64]. Although it is obvious that classification performance is directly related to the parameter k, there is no obvious information on the selection of k except that it should be positive and not a multiple of the total number of classes [48].

Evaluation
Metrics. The metrics sensitivity (SNS), specificity (SPC), positive predictive value (PPV), negative predictive value (NPV), false-positive rate (FPR), falsenegative rate (FNR), false discovery rate (FDR), false omission rate (FOR), and accuracy (ACC) are used for the evaluation of the performance of the CAD system in this study. The mathematical representations of these metrics are given in Table 3.

Results and Discussion
In this study, a CAD system for breast cancer diagnosis based on a multistage classification using a novel feature ensemble is proposed. The feature extraction stage is achieved on mammogram ROIs that are preprocessed by applying a histogram equalization followed by the NLM filtering. The proposed feature ensemble is formed by concatenating the LCP-based, statistical, and frequency-domain features. The classification process of these features is implemented in three different cases: one-stage study, two-stage study, and three-stage study. The mammogram ROIs are classified into three classes (normal, benign, and malignant) regardless of their breast tissue types in the one-stage study while the two-and three-stage studies consider breast tissue information and make a health status classification as explicitly explained in the related subsections. Eight well-known classifiers (FLDA, LDC, linear SVM, LLC, decision tree, random forest, naïve Bayes, and kNN) are used in all of the classification cases. Additionally, the results of classifiers that show the top three performances are combined via a majority voting technique in order to improve the recognition accuracy for the both two-and three-stage studies. The block diagram of the proposed system is given in Figure 1.

One-Stage Study.
In this case of the classification scheme, the feature vectors are directly classified into three classes (normal, benign, and malignant) regardless of the breast tissue types of the mammogram images. The flowchart for the one-stage study is shown in Figure 2. The average classification accuracies and standard deviations of the classifiers for the one-stage study obtained by elevenfold crossvalidation technique are shown in Figure 3. In this figure, "SVM ('p', 1)" is the SVM classifier using a linear kernel. The LLC classifier has the highest recognition accuracy (85.47%) among all classifiers. It assumes that logistic linear models can be formed between classes with equal prior probabilities. Hence, it is more applicable for the one-stage study than the other classifiers as the prior probabilities of each class in this case are equal. The total confusion matrix of the LLC classifier obtained by elevenfold cross-validation for the one-stage study is given in Table 4. It shows that benign and malignant mammograms are distinguishable from each other. The false recognitions are caused by the confusion of the benign and malignant mammograms with the normal mammograms.
The evaluation metrics of each classifier evaluated by elevenfold cross-validation for the one-stage study are given in Table 5.
The one-stage study is also achieved using three additional sets of feature vectors in order to demonstrate the discriminative power of the proposed 108-dimensional feature vector ensemble. These sets consist of 12-dimensional statistical feature vectors, 80-dimensional LCP-based feature vectors, and 92-dimensional feature vectors concatenated by the LCP-based with statistical features. The average classification accuracies of the classifiers for the one-stage study obtained by elevenfold cross-validation technique using different feature vector sets are shown in Figure 4. It can be inferred from Figure 4 that classification accuracies are increased when 92-dimensional feature vectors are used rather than only statistical or only LCP-based features. Furthermore, 108-dimensional feature vectors provide higher recognition accuracies than the 92-dimensional feature vectors. These results obviously prove the effectiveness of the proposed feature ensemble.

Two-Stage Study.
The recognition accuracy for breast cancer diagnosis is expected to be enhanced by the twostage study, which is composed of the breast tissue and health status classification. In the first stage of this study, the feature vectors are classified into breast tissue classes (fatty, fibroglandular, heterogeneously dense, and extremely dense). Then, the breast-tissue-type-defined feature vectors are classified into normal, benign, and malignant classes in the second stage. The flowchart for the two-stage study is shown in Figure 5.
The average classification accuracies and standard deviations of classifiers obtained by elevenfold crossvalidation technique for the two-stage study are shown in Figure 6. A maximum of 87.51% accuracy rate is attained using the FLDA classifier among eight wellknown classifiers. For this case, the LLC classifier performs worse than FLDA classifier as the prior probabilities of the classes are no longer equal.
As it can be explicitly inferred from Figure 6, the top three classifiers based on performance are the FLDA, LLC, and LDC. The results of these classifiers are combined via a majority voting technique to increase the classification accuracy to 88.79%.
The total confusion matrices of the (a) FLDA, (b) LLC, and (c) LDC classifiers obtained by elevenfold crossvalidation for the two-stage study and the total confusion matrix of the classifier combination obtained by elevenfold     cross-validation for the two-stage study are given in Tables 6 and 7, respectively. Similar results are obtained in the two-stage study as in the one-stage study. The confusion matrices in Tables 6 and 7 clearly show that the false negatives and false positives for both benign and malignant classes belong to the normal class. The terms N., B., and M. in Table 6 refer to the normal, benign and malignant classes, respectively.    The evaluation metrics of each classifier and the classifier combination evaluated by elevenfold cross-validation for the two-stage study are given in Tables 8 and 9, respectively.

Three-Stage Study.
After the classification accuracies are enhanced by the two-stage study, the authors propose a three-stage study for further improvement. The three-stage study consists of both breast tissue and health status classification, where the health status classification is achieved through two consecutive stages. In the first stage of this study, the feature vectors are classified into breast tissue classes similar to those in the two-stage study. The breast-tissue-typedefined feature vectors are then categorized into normal and abnormal classes in the second stage. Finally, in the last stage, the feature vectors labeled as abnormal classes are categorized into benign and malignant classes. The flowchart for the three-stage study is illustrated in Figure 7.
The average classification accuracies and standard deviations of eight classifiers obtained by elevenfold crossvalidation technique for the three-stage study are graphically shown in Figure 8. The FLDA has the best classification performance with a maximum of 93.29% accuracy rate among all classifiers. In this case, as the prior probabilities of the classes are not equal again as in the two-stage study, the classification success of the LLC classifier is less than that of the FLDA and LDC classifiers.
The total confusion matrices of the (a) FLDA, (b) LDC, and (c) LLC classifiers obtained by elevenfold crossvalidation for the three-stage study, and the total confusion matrix of classifier combination obtained by elevenfold cross-validation for the three-stage study are given in Tables 10 and 11, respectively. In the three-stage study, as seen in the tables, mammograms in normal and benign classes are exactly inseparable from each other, while malignant mammograms are clearly distinguished from the normal and benign classes. The terms N., B., and M. in Table 10 stand for the normal, benign, and malignant classes, respectively.
If Figure 8 is carefully examined, the FLDA, LDC, and LLC classifiers, as in the two-stage study, are the best three classifiers in terms of recognition accuracy. The results of these classifiers are combined via majority voting and eventually the classification performance is increased to 93.52%.
The evaluation metrics of each classifier and the classifier combination evaluated by elevenfold cross-validation for the three-stage study are given in Tables 12 and 13, respectively. 3.2. Discussion. The proposed feature ensemble is formed by concatenating the LCP-based, statistical, and frequencydomain features. The LCP algorithm is performed by itself for several image processing applications. The motivation behind the usage of the LCP algorithm for feature extraction relies on the decomposition of information existing in breast mammogram images. Moreover, the LCP features include pixel-wise relationships. As it covers relatively few relationships among pixels in a breast mammogram image, the LCP is used as the fundamental feature extraction method    to explore the underlying information in an image. However, the LCP features are not completely adequate to efficiently classify mammogram parts because it can be affected by various issues. Therefore, the use of LCP only will not result in the most representative features for a mammogram. Furthermore, twelve statistical features were calculated from the LCP features. The positive impact of statistical features extracted directly from the image texture on classification success is already known [52]. In addition, the LCP feature vectors extracted from breast mammograms are indicated as successfully discriminative features [5]. Hence, in this study, the statistical features are obtained from the LCP feature vectors rather than directly from the mammogram image pixel matrices. Moreover, 16 frequency-domain features are computed and appended to other two types of features (LCP-based and statistical features). Since the brightness is one of the most significant issues for breast cancer diagnosis and the variations of brightness in a mammogram image can be obviously observed in spatial domain, it is assumed that frequency-domain features are also representative of mammograms in this study. Ultimately, the feature vectors that have more representative power and are more robust to numerous effects are constructed by this method. Additionally, a multistage classification scheme is proposed in this study. It consists of three cases: the one-stage study, two-stage study, and three-stage study. In the onestage study, the feature vectors are classified according to only their health status regardless of the breast tissue type     of mammograms. The standard deviation values for the onestage study are high since some folds in cross-validation process provide high recognition accuracies but the other folds give much lower classification accuracies. This situation clearly implies that the accuracy results of the one-stage study are directly related with the mammogram parts used in train/ test separation of each fold. If a test set includes more similar parts compared to those in the corresponding train set, the accuracy suddenly raises. On the contrary, if the similarities between the test and train sets are weak, the classification fails. This consequence obviously reveals that the one-stage study does not give trustworthy accuracy results. In order to prevent the high standard deviation problem and increase the classification accuracy rates, the two-stage study is implemented. In the two-stage study, both breast tissue and health status classification are consecutively performed. By this way, the breast tissue types of mammograms are taken into consideration so that a more reliable classification is achieved. The trustworthiness of recognition can be inferred by examining the standard deviation values for each classifier. These values are much lower compared to those obtained in the one-stage study. Therefore, the accuracy results of the twostage study are not related with the mammogram parts treated in train/test separation of each fold. The crossvalidation process gives more reliable accuracy rates. Finally, the three-stage study considers both breast tissue and health status classification as the two-stage study does, except that the health status classification is realized through two consequent stages. By this way, the lowest standard deviation values especially for the classifiers which give higher recognition accuracies are obtained. This outcome apparently exposes that the three-stage study not only performs the most reliable classification process but also is independent from mammogram parts used in training and test sets of each cross-validation fold. Besides, the most successful experiments are achieved in the three-stage case. Ultimately, if one considers both success and reliability issues at the same time in this classification problem, the three-stage case provide these two issues simultaneously. The mammogram parts of fatty breast tissue type in the IRMA database are classified using only LCP-based feature vectors, and a maximum of 90.60% recognition accuracy is attained in [5]. By the proposed feature ensemble and multistage classification, this accuracy is effectively increased to 93.52% for all tissue types rather than for only one breast tissue type. This result explicitly shows that the new feature ensemble is more representative than an LCP-based feature vector by itself, and the proposed multistage classification scheme is more successful and reliable than a single-stage classification for breast cancer diagnosis. The comparison of the proposed study with other studies in the literature is given in Table 14.

Conclusion
Breast cancer is the second major reason for female deaths resulting from cancer worldwide. Although there is no known way to prevent breast cancer, mortality can be reduced only with early diagnosis. Therefore, the computeraided diagnosis (CAD) systems are very important as they allow radiologists to reconsider mammogram images with increased sensitivity of detection and diagnosis. In this study, a multistage classification scheme using a novel and discriminative feature ensemble to be implemented in a CAD system for breast cancer diagnosis is proposed. The proposed system is verified using the IRMA database. This database includes all twelve classes defined by BI-RADS, which are four different breast tissue types, and three different health status cases for each breast tissue type. The proposed feature ensemble is formed by concatenating the 80-dimensional LCP-based features obtained from the one-level, two-dimensional discrete wavelet transform of the preprocessed mammogram images, 12-dimensional statistical features computed from the LCPbased features, and 16-dimensional frequency-domain features calculated from the two-level two-dimensional discrete wavelet transform of the preprocessed mammogram images. In this study, a multistage classification scheme, namely the one-stage study, two-stage study, and three-stage study cases, is presented. The feature vectors are classified directly according to their health status in the one-stage study. In the two-stage study, the health status classification of each  breast tissue type, determined in the first stage where the breast tissue classification is achieved, is executed. The three-stage study also considers both breast tissue and health status; however, in this case, the health status classification is performed with two consequent stages, where the normal and abnormal mammograms are determined first, and the abnormal defined mammograms are then classified as benign and malignant. The maximum recognition accuracy of the proposed system is obtained in the three-stage study. These results clearly indicate that using three-stage study is very effective for a CAD system and helpful for radiologists to make more accurate breast cancer diagnoses.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.