Molecular Subtypes Recognition of Breast Cancer in Dynamic Contrast-Enhanced Breast Magnetic Resonance Imaging Phenotypes from Radiomics Data

Background and Objective Breast cancer is a major cause of mortality among women if not treated in early stages. Recognizing molecular markers from DCE-MRI directly to distinguish the four molecular subtypes without invasive biopsy is helpful for guiding treatment plans for breast cancer, which provides a fast way to consequential treatment plan decision in early time and best opportunity for patients. Methods This study presents an approach of molecular subtypes recognition from breast cancer image phenotypes by radiomics. An improved region growth algorithm with dynamic threshold without user interaction is proposed for cancer lesion segmentation, which gives the precise border of lesion other than area with background. The lesions are extracted automatically based on radiologists' annotation which guarantees the lesion is segmented correctly. Various features are extracted on lesions data including texture, morphology, dynamic kinetics, and statistics features carried out on a large patient cohort, which are used to validate the relationship between image phenotypes and the molecular subtypes. A new algorithm of multimodel-based recursive feature elimination is applied on the radiomics data generated by the feature extraction process. This method obtains the feature subset with stable performance for different classification models, and the gradient boosting decision tree model gets the best results of both classification performance and imbalance performance on molecular subtypes. Result From the experimental results, 69 optimal features from 143 original features are found by the multimodel-based recursive feature elimination algorithms and the gradient boosting decision tree classifier obtains a good performance with accuracy 0.87, precise 0.88, recall 0.87, and F1-score 0.87. The dataset with 637 patients in this paper has serious imbalance problem on different molecular subtypes, and the the robust features that are generated by multimodel-based recursive feature eliminiation algorithm make the gradient boosting decision tree classifier have good behaviors. The recognition precision for the four molecular subtypes of luminal A, luminal B, HER-2, and basal-like are 0.91, 0.89, 0.83, and 0.87, respectively. Conclusions The improved lesion segmentation method gives more precise lesion edge, which not only saves the time of automatic extraction of lesion region of interest without threshold setting for each case, but also prevents the segmentation error by manual and prejudice from different radiologists. The feature selection algorithm of multimodel-based recursive feature elimination has the ability to find robust and optimal features that distinguish the four molecular subtypes from image phenotypes. The gradient boosting decision tree classifier rather plays a main role in recognition than other models used in this paper.


Introduction
Breast cancer is a major cause of mortality among women if not treated in early stages. Early screening and diagnosis have a lot to do with the therapeutic effect of prognosis. For noninvasive diagnosis, different imaging modalities can be used, such as molybdenum target X-ray, MRI, Ultra-sound, etc. Dynamic contrast enhanced breast magnetic resonance imaging (DCE-MRI) is one of the best imaging techniques that provide temporal information about the kinetics of the contrast agent in suspicious lesions along with acceptable spatial resolution. Recognizing molecular markers from DCE-MRI is helpful for guiding treatment plans for breast cancer.
e four molecular subtypes of breast cancer are analyzed in this paper, including luminal A, luminal B, human epidermal growth factor receptor-2 over-expressing , and basal-like. However, tumor heterogeneity in cancers has been observed at the histological and genetic levels, and increased levels of intratumor genetic heterogeneity have been reported to be associated with adverse clinical outcomes [1]. Breast tumor structure contains a high degree of heterogeneity. is heterogeneity has been correlated with the level of tumor response to neoadjuvant chemotherapy [2]. e use and role of medical imaging technologies in clinical oncology has greatly expanded from primarily a diagnostic tool to include a more central role in the context of individualized medicine over the past decade [3]. Radiomics refers to the extraction and analysis of large amounts of advanced quantitative imaging features with high throughput from medical images obtained with computed tomography, positron emission tomography, or magnetic resonance imaging [4]. Radiomic studies can be used to understand relationships between imaging characteristics of tumors, such as heterogeneity, and their genetic characteristics, phenotype, or expected treatment outcome [5]. ese data are combined with other patient data and are mined with sophisticated bioinformatics tools to develop models that may potentially improve diagnostic, prognostic, and predictive accuracy [6]. e radiomics methodology can be divided into distinct process which consists of five steps that are image acquisition, image segmentation and rendering, feature extraction and feature qualification from image, and databases and data sharing for eventual ad hoc informatics analysis [4]. In this paper, we investigate the role of the integration of the contrast agent kinetic heterogeneity features derived from breast dynamic contrast-enhanced magnetic resonance imaging and clinical feature from patient medical records for predicting molecular subtypes. e computerized quantitative image analysis in this paper includes precise breast lesion segmentation, phenotype extraction and clinical symptom, molecular subtypes prediction modeling, and leave-one-case-out cross validation. 637 patients that are all confirmed by pathological examination from one institution are used for discovery and external validation. e primary goal of this paper is to develop an automated DCE-MRI-based lesion recognition method to distinguish the four molecular subtypes, which is helpful for the consequential treatment plan decision.
is work goes a step further on the original lesion data other than the intratumoral and peritumoral segmentation of tumor reported in [7,8], in which a specialist marked the boundary contour of the lesion manually. ere are many personal prejudices on the location or boundary of the tumor in different specialists. Moreover, the image patches containing the lesions are used in the prediction model on the lesion and lesion background data [9]. An automated segmentation method in this paper is used to extract the precise boundary of tumor. e major difference in the current work is the integration of higher visual features and dynamic features on actual lesion area from a larger patient cohort and combining multiple classifiers for feature validation. is is different from Banaie et al.'s method [10] and Fan et al.'s method [11], in which kinetic feature, such as ktrans, kep parameters extracted from 26 patients, and texture features from 173 patients, are validated by a logical regression without features selection. e imbalance problem in these datasets is ignored using a single classifier as we know that the morbidity of different molecular subtypes is serious different. In this work, we use radiomics features to distinguish between full four molecular subtypes other than on partial classes as work on luminal A and B in [9], or work on luminal A and other types in [11] by deep learning. ese fused features for four subtypes allow not only characterization of cancer morphology, but also depiction of heterogeneity between imaging phenotypes and molecular subtypes of breast cancer. e workflow of the presented method is depicted in Figure 1. An improved region growth segmentation algorithm is applied on the lesion images. Different types of radiomics features are extracted from tumor data. Feature selection by a cascade validation method is conducted on both radiomics feature. A large patient cohort is collected from an institution, which is used for model training and testing. e main contributions of this work are as follows: (i) An improved region growth algorithm with dynamic threshold setting is proposed on precise boundary of lesion segmentation, which not only saves time of automatic extraction of lesion region of interest without threshold setting for each case, but also prevents the segmentation error by manual and prejudice from different radiologists. (ii) e static visual features of texture, morphology, and statistics on lesion, dynamic kinetic features, and clinical features are extracted to validate the relationship between image phenotypes and the molecular subtypes, which is carried out on a largest patient cohort as we know from the latest work so far. (iii) e recursive feature elimination method based on multiple models is used to select useful features for prediction model, which pays attention to the imbalance problem of the dataset. e classification model based on DCE-MRI data achieves noninvasive molecular subtypes recognition, which improves the diagnostic efficiency of breast cancer. e rest of this paper is organized as follows. In Section 2, we discuss previous related work. Section 3 describes the details of the method. e experimental results and discussion are presented in Section 4, respectively. Finally, the concluding remarks are given in Section 5.

Related Work
e development of automated and reproducible analysis methodologies to extract more information from image-based features is a requirement [3]. Radiomics refers to the extraction and analysis of large amounts of advanced quantitative imaging features with high throughput from medical images, which leads to a very large potential subject pool [4]. Lots of visual features are extracted to quantify tumor image intensity, shape, and texture, which is associated with underlying gene-expression patterns [5,6,12,13]. Combining with the medical character and clinical recognition of lung tumor, Wang et al. presented a radiomic analysis of 150 features to build a prediction model for malignant and benign discrimination of lung tumors [14]. It is also feasible to use radiomics approach to decode normal liver features and predict treatment-associated liver injury [15] and differentiate malignant nodules from benign ones [16].
DCE-MRI is one of the best imaging techniques that provide temporal information about the kinetics of the contrast agent, which is used to predict complete pathological response to neoadjuvant chemotherapy [7,8,[17][18][19] and the risk of breast cancer recurrence in recent years [20][21][22][23]. Tumors exhibit genomic and phenotypic heterogeneity, which has prognostic significance and may influence response to therapy [1,24]. Burgeoning genetic, epigenetic, and phenomenological data support the existence of intratumor genetic heterogeneity in breast cancers [2,25,26].
Banaie et al. proposed a method to help physicians determine the likelihood of malignancy in breast cancer using DCE-MRI without biopsy [10]. Quantitative radiomics of breast cancer may enable precision medicine with differentiating luminal A and luminal B breast cancer molecular subtypes [9,27]. ree different deep learning approaches were used to classify the tumor according to their molecular subtypes. Computer-extracted image phenotypes as well as dynamic features from tumor and background parenchymal enhancement were used to determine DCE-MRI characteristics discriminating among four molecular subtypes of breast cancer [11,[28][29][30][31]. Deep learning with MRI dataset utilizing convolutional neural network may also play a role in discovering radiogenomic associations in breast cancer [32,33]. e dataset used in this paper contains DCE-MRI image data and golden standard from pathology. A variety of radiomics features are extracted on the accurately segmented lesion data by an improved region growth algorithm and the automatic feature selection process is realized by recursive feature elimination optimization method, rather than manually selecting features. Secondly, the dataset contains a comprehensive range of molecular types, and the imbalance of each molecular subtype of data is considered in the predictive model, rather than considering small datasets and partial category recognition studies which are presented in existing research.

Methodology
e data collected from a hospital in this paper are all cases with malignant lesions confirmed by histopathology. Generally, the edge of the malignant lesion is not clear. It is difficult to extract the edge of the lesion area accurately because of the image background enhancement. However, it is difficult to fetch good characteristics for image phenotypes without accurate lesion area. erefore, the approximate location of each lesion in this dataset is labeled by experienced radiologists, and it is a time-consuming work to annotate the area of the lesion. Meanwhile, the labeling results from different radiologists may be quite inconsistent. In this paper, the radiologists only marked out the lesion locations in the images. en an improved regional growth algorithm is used to realize the automatic edge extraction of the lesions. Based on the extracted lesion regions, 142 image features including texture features, morphological features, statistical features, and dynamic enhancement characteristics are extracted. Feature selection is performed using the multimodel-based recursive feature elimination (mmRFE) method. e mmRFE method considers the sorting factors of each feature in each model other than the traditional RFE with single model. e models in mmRFE used in this paper  are logistic regression (LR), support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT). Different classifiers differ in the recognition of molecular subtypes classification for patient cohort data which has imbalance problem on four molecular subtypes. e mmRFE method finds robust features for all four subtypes better than the classification effect of a single model in classification effect.

Lesion Segmentation.
Breast lesions are relatively small. It will be useless if the radiomics features are extracted from the entire image. erefore, it is general that the lesion areas are segmented firstly, on which the features are extracted.
ere are generally three ways to extract lesions, automatic segmentation, manual segmentation, and interactive segmentation [34]. Automatic segmentation does not require human intervention, completely separated by the algorithms, that is also the focus of current research. However, this method is often inaccurate for complex image objects. Manual segmentation usually requires the assistance of an experienced operator, which is time-consuming and inaccurate for irregular images. Interactive segmentation firstly finds the approximate location of the region of interest (ROI) and marks it with a rectangular box, which has less human intervention and a good segmentation effect on complex images. is paper presents an interactive segmentation for breast lesions. e breast lesions are marked by two radiologists with 10 and 15 years experiences, respectively.
e lesion in the ROI with border marks are connected areas and the grayscale is similar. It is known from above that the enhancement mode of breast lesion is mostly enhanced by internal interval, for which the regional growth (RG) algorithm has better segmentation effect. e regional growth algorithm has two important influencing factors, namely, the selection of seed point and the definition of growth criteria. If the seed point is not selected properly, it is possible that the result of segmentation is very different from the original target and even the segmented result is wrong part of the image rather than the original target. As the lesions are labeled by the radiologists, the centroid of the ROI region is used as the seed point in this paper.
Once the seed point in target area is obtained, the surrounding connected pixels that follow the certain growth criteria are added to target areas one by one and finally complete the growth until there are no more connected pixels that follow the criteria. e DCE-MRI images are grayscale images, so we only preset a certain threshold (T) that the pixel value is less than. Different growth thresholds have strictly different results on the segmentation effect of target results as shown in Figure 2 (T � 20, 30,40,50). e differences between the segmented results with different thresholds are obvious. Figure 2 lists two types of lesion ROIs. e first in Figure 2(a) has a more regular shape, and the other ROI in Figure 2(b) is more irregular in shape besides more burrs. In this paper, the threshold value of segmentation growth is determined dynamically by the Otsu method, rather than by manual setting [35]. e results generated by our method are shown in Figure 2 (ours). Although the ROI in Figure 2(b) is more irregular and burr, the experimental result shows that the improved algorithm is still doing well. e improved regional growth algorithm not only reduces the artificial participation, but also saves the time, which makes the ROI segmentation more automated. e later feature extraction task is performed on precise lesions other than lesion with background which is generally used in exists works. e lesion segmentation results are evaluated by the dice coefficient, which is a set similarity measurement function, as shown in formula (1). X represents the pixel set of the segmented lesion, and Y represents the actual collection of lesion pixels, where every pixel is represented as coordinate.
e dice coefficient represents the percentage of the intersection of two sets that are segmented correctly. S � 1 indicates that X and Y are fully coincident, and the segmentation accuracy rate is 100%. S � 0 indicates that the segmentation results are totally wrong.
In order to verify the accuracy of the lesion segmentation in this paper, the two lesions are manually hand-drawn by the radiologist to obtain the complete borders as shown in Figure 2 (source). e yellow curves are drawn by the radiologist manually. At the same time, the traditional region growth algorithm with different threshold and our method are conducted for comparison. It is seen that t � 20 is obviously different from the lesion, and T � 50 is obviously oversegmented. erefore, the dice coefficients of the three thresholds (T � 30, 35, 40) and our algorithm are evaluated, respectively, and the results are shown in Table 1.
As seen from the results of the evaluation indicators in Table 1, the traditional RG algorithm threshold cannot be determined automatically. It is necessary to find right segmentation threshold which is hard work for a large dataset. However, the results are greatly improved by our method, which dynamically searches the threshold without human interaction.

Feature Extraction.
Once the lesions are segmented from DCE-MR images, the radiomics features are extracted consequently for molecular subtypes recognition, which is the quantitative expression of image information so that we can find effective imaging features. e effective features are important to realize the correct classification of breast cancer molecular subtypes. e breast cancer lesion is highly heterogenous. is characteristic presented in DCE-MRI images is quantified by textures in this paper. At the same time, the internal density of differences areas in lesion are changed over time and this feature is obtained by kinetics parameters. e radiomics features including texture features, morphological features, statistical features, and kinetics features are designed in this paper.

Texture Features.
Texture reflects the arrangement properties of the surface organization of things, and it is a visual feature. Different tissues within the human body exhibit different textures in imaging examinations, and the same tissues exhibit different texture differences in a healthy area or in the lesion [36]. e image area has an invariant texture if a series of statistical or other characteristics of an image are fixed, slowly changing, or approximate [37,38].
According to the characteristics of the lesion, the texture features of breast cancer were extracted by gray-level cooccurrence matrix (GLCM) and locality binary pattern (LBP), respectively.
(i) e GLCM is calculated from the pairs of pixel gray levels i and j that represent the probability of (i, j) appearing in a given spatial distance and direction, and all calculated results can be represented in the form of a matrix. is paper takes the direction as [0, 45,90,135]; that is, the GLCM is constructed in these four directions for the statistics characteristics of energy, entropy, deficit matrix, contrast, and correlation on three-time phase in each direction [39]. (ii) LBP is an operator that characterizes local textures and is also used for texture feature extraction. e feature is then used in conjunction with the histogram of oriented gradient (HOG) feature classifier to improve the detection effect of some datasets [40][41][42]. e LBP mask used in this paper is the 3 × 3 matrix. If its value of each neighbor pixel is greater than the center point pixel value, the value of its location is set to 1. Otherwise, the center point pixel value is set to 0. is process will form a binary sequence with length 8, and then the value of the binary sequence as binary data is computed and is regarded as the LBP value. e computing process is shown as the formula (2) for a pixel (x, y), and g c is the center pixel value and g p is the neighbor pixel value.
(iii) e LBP matrix is computed by the formula applying all the pixels of the image, and then the histogram is extracted on the LBP matrix.

Morphological Features.
When a part of the tissue becomes a malignant lesion, it is usually accompanied by morphological changes. For example, the benign lesions of the breast are mostly lumpy, and the edges are smooth, while the malignant lesions are more morphological. Some malignant lesions are lumpy and the edges are irregular; others are diffuse with no obvious edge. e malignant tumor is surrounded by abundant blood vessels and has a strong aggression [43]. e BI-RADS standard divided the morphology of breast lesions into three types as mass, nonmass, and point-like [44]. e lumps are divided into circles, ellipses, and irregular shapes. e distribution of nonmass lesions is more diffuse and multiregional. e point-like lesions are usually less than 5 mm in diameter and are not easily detected displayed on enhanced images. e morphological features of breast DCE-MRI images in this paper mainly are designed as the morphological features in the study of breast molybdenum target images, which include standardized radial length mean and standard deviation, compactness, roughness, smoothness, roundness, and area [45].

Kinetics Features.
e dynamic enhancement characteristic presents the metabolism of the contrast agent in the lesion area which can provide the hemodynamic information of the lesion and shows the signal change of the lesion or normal tissue in different enhancement phase (8 phases in this paper) [46,47]. e features are extracted on both the whole lesion and single pixel as study objects.
Firstly, the radiomics features extracted on the whole lesion includes lesion enhancement rate and absorption rate. e first phase in DCE-MRI is normal status without the contrast agent. e other phases are obtained where the lesion is enhanced that pixel's grayscales are relatively high. e lesion enhancement rate is expressed as where S i represents the grayscale mean of the pixels in lesion area of the corresponding time series. e enhancement rate reflects the aggregation degree of the contrast agent in the lesion.
e absorption rate is expressed as formula (4), which represents the grayscale mean of the pixels in lesion area of the corresponding time series. e absorption rate of the lesion reflects the blood perfusion condition in the lesion.
Secondly, the enhancement rate is defined on every pixel, which is expressed as where T and t represent moments (such as s 0 , s 1 , s 2 threetime phase), and the ROI matrix size is M * N, I T (i, j) or I t (i, j) representing the pixel value of the t moment on image coordinate (i, j). e standard deviation, mean, and maximum dynamic characteristics are extracted using the obtained dataset.

Statistics Features.
e statistical characteristics of the image refer to the calculation of the grayscale values of each pixel point in the lesion. In this paper, the statistical features of three-time phase are extracted, including grayscale mean, standard deviation, information quantity, maximum value, peak degree, and deflection degree. Peak degree reflects the degree of steep easing of data distribution patterns. Deflection degree reflects the symmetry of the data distribution pattern.
Based on the three-time phase of breast cancer DCE-MRI images (three periods before and after adding contrast agents), the above paragraphs introduce the extraction of features, including texture, dynamics, statistics, and four types of morphological features. Among them, GLCM texture features include energy, contrast, correlation, entropy, and deficit matrix using representation as F 1 ∼ F 15 . LBP texture includes the three histograms as F 16 0 ∼ F 16 255, F 17 0 ∼ F 17 255, and F 18 0 ∼ F 18 255. Dynamic characteristics include absorption rate, enhancement rate, standard deviation, mean, and maximum, represented as T 1 ∼ T 13 ; statistical features include grayscale mean, grayscale standard deviation, information entropy, maximum value, deviation, and peak, labeled as C 1 ∼ C 18 . Morphological features include standardized radial length mean and standard deviation, tightness, roughness, smoothness, roundness, and area, known as M 1 ∼ M 7 . From the DCE-MRI sequential scans, we applied a computerized scheme to extract 142 imaging features while all invalid columns with 0 values are removed. Table 2 summarizes these DCE-MRI features.

Prediction Model Training.
e above feature extraction process generates a large number of radiomics feature data, but these features are not all useful for the recognition of molecular phenotypes. ere are many methods of feature selection, and there is no strict uniform method of the feature selection for breast cancer DCE-MRI images. e feature selection is based on recursive feature elimination algorithm in this paper. e main idea of the recursive feature elimination (RFE) is to constantly repeat the build model, and each time, all features are sorted according to their importance. e least important features will be deleted until no more features can be deleted [48][49][50]. It can be seen that recursive feature elimination is a greedy algorithm. Usually, a model is selected at first which is trained with sample data. e scores of importance for all features are calculated using the trained model, and the features with the least importance are removed from the current set of features. en the remaining features are used in the model repeatedly until no features can be deleted. After the iteration is completed, the optimal feature subset is generated according to the evaluation criteria. e traditional recursive feature elimination is based on a single model for feature selection.
In the process of selecting features by the RFE method, the optimal subset of features selected by different classification models is varied. ere is some overlap in the feature subsets for each model. In this paper, a multimodel-based recursive feature elimination (mmRFE) feature selection method is proposed. First, each model sorts all features according to their importance in order to get multiple sets of different sorts, and then the index of the positions of each feature in each set of sorts are recorded according to the sort results of each set of models. Finally, the index is summed up and the features are sorted again according to the sum results. A new comprehensive sort can be obtained. In the new sorted results, the index factors of each feature in different model are fully taken into account. e comprehensive sorting features are used to train each model and the classification results are deposited into the result set. e lowest fractional features are removed by the importance of all the features in the comprehensive sort until no features can be deleted. Finally, each model will get multiple sets of results. Selecting a subset of features based on the results of each model makes this subset of features perform well in every model, such as a subset of features is selected that each classification model has an accuracy of more than 85%. e flow chart of the mmRFE method is shown in Figure 3.
e classification models to be trained in this paper include logistic regression (LR), support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT). e performance of each classifier is evaluated and discussed in the next section. e experimental results are obtained between traditional RFE based on single model and mmRFE in this paper.

Patient Population.
In this paper, collected data of breast cancer DCE-MRI from a cancer hospital in Liaoning consist of 637 cases of patients in total. All 637 cases are malignant cases of breast cancer in women. e age range is between 43 and 70 years, and the average age is 57.2 ± 13.3 years. ese conditions are confirmed by histopathology examination after the patient received DCE-MRI examination which is diagnosed by radiologist. Diagnosis includes ductal carcinoma, invasive ductal carcinoma, invasive papillary carcinoma, mucous cancer, invasive lobular carcinoma, medullary carcinoma, solid papillary carcinoma, ductal carcinoma in situ, extensive ductal carcinoma, and extensive ductal carcinoma in situ. e pathological data of 637 patients are shown in Table 3 as well as the statistics of molecular subtypes. It is easy to see that the dataset has imbalance problem on molecular subtypes.

DCE-MRI Acquisition.
e DCE-MRI data were generated by GE 1.5 T magnetic resonance imaging equipment (Hdx, GE Healthcare, waukesha, WI, USA) with breast dedicated 4-channel coil. Routine scanning parameters are axial T1WI SPGR sequence, sagittal T2WI fat inhibition sequence, and axial DWI sequence. e above sequence layer thickness is 3 mm, FOV for 36 * 36 cm. DCE-MRI data take parameters as axial 3D dynamic SRGR sequence (TR 6.1, TE 2.9, Fov36 * 36 cm, Matrix 512 * 512) using the flip angle 2 degrees and 15 degrees scan to obtain T1 mapping, and then the flip angle 15 degrees for dynamic enhancement scanning. After collecting 1 phase sample, the high pressure syringe (Ulrich Medical) was injected intravenously Gd-DTPA 0.1 mmol/Kg, the injection rate was 3 ml/s, and the tube was washed with the 25 ml saline, and then the scanning of 8-time phase was continued.

Performance on Traditional RFE-Based Prediction Model.
is paper uses four models LR, SVM, RF, and GBDT to select the optimal feature subset based on the traditional RFE with single model. e accuracy, precision, recall, and F1score are used to evaluate classification performance. e experimental results by LR show filtered features with 80 dimensions, including GLCM texture features with 9 dimensions (energy, contrast, correlation, deficit matrix in the first time phase, correlation in the second time phase, energy, correlation, entropy, and deficit matrix in the third time phase), morphological features with 2 dimensions (standardized radial length standard deviation, roughness), statistical features with 5 dimensions (the first phase of the grayscale standard deviation, the maximum grayscale, the second time phase of the grayscale mean, the maximum value, and the third time phase of the grayscale standard deviation), dynamic enhancement features with 7 dimensions (T 1,0 standard deviation, mean value, maximum value, T 2,0 mean, T 2,1 standard deviation, mean, and maximum), and other LBP features.
e results from SVM experiment show that the features of the RFE filter are 77 dimensions, including the GLCM texture features with 8 dimensions (the contrast, correlation, deficit matrix of the first time phase, the correlation of the second time phase, the energy, contrast, entropy, and deficit matrix of the third time phase), and the morphological characteristics of 2 dimensions (standardized radial length mean and standard deviation), statistical features with 8 dimensions (grayscale mean, grayscale standard deviation, grayscale maximum, second time phase grayscale mean, bias, peak, third time phase grayscale standard deviation, and grayscale maximum), dynamic enhancement feature with 5 dimensions (T 1,0 standard deviation, maximum value, T 2,0 mean value, T 2,1 Histogram index at [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,  Histogram index at [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,  Histogram index at [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,  Grayscale mean, grayscale standard deviation, information entropy, grayscale maximum value, bias, peak Grayscale mean, grayscale standard deviation, information entropy, grayscale maximum value, bias, peak Grayscale mean, grayscale standard deviation, information entropy, grayscale maximum value, bias, peak Standardized radial length mean, standardized radial length standard deviation, tightness, roughness, smoothness, roundness, area the standard deviation, and the maximum value), and other LBP features. e results of RF experiments show that the features of the RFE filter are a total of 55 dimensions, including GLCM texture features with 11 dimensions (energy, contrast, correlation, entropy, deficit matrix in the first phase, energy, contrast, correlation in the second phase, energy, contrast, correlation in the third time phase), morphological features with 4 dimensions (standardized radial length mean, standardized radial length standard deviation, tightness, roughness), statistical characteristics with 14 dimensions (first time phase grayscale mean, grayscale standard deviation, grayscale maximum, bias, peak, second time phase grayscale standard deviation, maximum value, bias, peak, third time phase grayscale mean, grayscale standard difference, grayscale maximum, bias, and peak), dynamic enhancement feature with 9 dimensions (T 1,0 standard deviation, mean, maximum value, T 2,0 standard deviation, mean value, maximum value, T 2,1 standard deviation, mean value, and maximum value), and other LBP features. e experimental results by GBDT show that the filtered features are 66 dimensions, including GLCM texture features with 13 dimensions (energy, contrast, correlation, deficit matrix in the first phase, energy, contrast, correlation, deficit matrix in the second time phase, energy, contrast, correlation, entropy in the third phase, and deficit matrix), morphological features with 4 dimensions (standardized radial length mean, standardized radial length standard deviation, tightness, roughness), statistical characteristics of with 14 dimensions (first time phase grayscale mean, grayscale standard deviation, bias, peak, second time phase grayscale mean, grayscale standard difference, maximum value, deviation, peak, grayscale mean, grayscale standard difference, grayscale maximum, deviation, and peak value of the third time phase), the dynamic enhancement feature with 8 dimensions (T 1,0 standard deviation, mean, maximum value, T 2,0 standard deviation, mean value, maximum value, T 2,1 standard deviation, mean value), and other LBP features. e feature subsets selected by the four models respectively are shown in Table 4, from which it is known that the subsets of features selected by the four classifiers are different.
As shown in Table 5, it can be seen from the experimental results that the GBDT has the best experimental results compared to the other models, which perform best in each evaluation index, followed by SVM and then RF, while the experimental results of LR is slightly worse, less than 0.8, and not as effective as the results of the remaining three models. If the molecular classification is based on the RFE single model, GBDT is best suited as the selected object.

Performance on mmRFE Based Prediction Model.
In this experiment, the four classifiers are also used in RFE, respectively. e accuracy contained in each model is shown in Table 6. e logic regression accuracy is the lowest. ree feature subsets are found in all logistic regression experiments, in which the accuracy is more than 0.8. Compared with SVM, RF, and GBDT models, the first set for experimental results is more robust, so the first feature set is selected as the optimal subset of features in this experiment. e selected feature subset with 69 dimensions includes GLCM texture features with 12 dimensions (energy, contrast, correlation, deficit matrix in the first phase, energy, correlation in the second phase, deficit matrix, energy, contrast, correlation, entropy, and deficit matrix in the third time phase), morphological features with 4 dimensions (standardized radial length mean, standardized radial length standard deviation, tightness, and roughness), statistical characteristics with 13 dimensions (first time phase grayscale mean, grayscale standard deviation, maximum value, second time phase grayscale mean, grayscale standard deviation, maximum value, bias, peak, third time grayscale mean, grayscale standard deviation, grayscale maximum, bias, and peak), and dynamic enhancement features with 6 dimensions (R10 mean, maximum value, R20 mean, R21 standard deviation, mean, and maximum), and the rest are LBP features. e detail features are C 14 , T 7 , T 11 , T 9 , F 17 247, F 18 243, F 16   Based on the mmRFE, the feature screening is carried out by using the optimal feature subsets based on the current model selected by LR, SVM, RF, and GBDT, and the experimental results are displayed in combination with accuracy, precision, recall, and F1-score. e performance evaluation on each molecular subtype classification by logistic regression is shown in Table 7, and it can be learnt from the table that the logistic regression has better classification performance on luminal A type and basal-like type. e classification results by SVM are shown in Table 8 as well as the performance evaluation on each molecular subtype. e data in the table show that SVM has better classification effect of luminal A type, HER-2 expression type, and basal-like type of breast cancer. e luminal B type classification ability is weaker than the remaining three kinds. e classification results by RF are shown in Table 9 as well as the performance evaluation on each molecular subtype.
e data in the table show that RF has better classification effect of luminal A type, luminal B type, and HER-2 expression type of breast cancer. e basal-like type classification ability is weaker than the remaining three kinds. e classification results by GBDT are shown in Table 10 as well as the performance evaluation on each molecular subtype. e data in the table show that GBDT has better classification effect on all types of breast cancers better than the above three classifiers. From the results of each experiment, we can see that the identification ability of four classification models for the molecular classification of breast cancer is not identical and the three classification models LR, SVM, and RF cannot recognize the four molecular types of breast cancer very well that they are obviously weak for one or two subtypes of identification ability in molecular classification. GBDT is best suited as the selected classification model. e four classification models are trained based on features selected by mmRFE, and classification results of each model are shown in Table 11. e performance of four classifiers is all good at stability especially for LR which behaves worst on feature selected by traditional RFE algorithm. In another words, the features selected by mmRFE algorithm are more optimal for molecular subtypes recognition task. e GBDT model obtains the best performance        as well as good performance on the imbalance problem of molecular subtypes. e results with different features and classier models are summarized in Table 12. From the experimental results, we can see that the experimental effect of the ensemble model classification using the features selected from multimodel RFE is better than that of each model using the features selected from the single model RFE method.
us, it is proved that the multimodel feature selection method and the ensemble classifier are reasonable.

Conclusion
Breast cancer is a disease with high heterogeneity, and there are obvious differences in the response of different molecular subtypes to treatment. erefore, recognizing molecular markers from DCE-MRI images directly to distinguish the four molecular subtypes without invasive biopsy is helpful for guiding treatment plans for breast cancer in early time. It will effectively improve the accuracy of breast cancer diagnosis and treatment from the breast DCE-MRI imaging phenotype, which reveals the quantitative imaging characterization mechanism of breast cancer molecular subtypes diagnosis, and improve the patient's five-year survival rate for grasping the treatment time. e current surgical biopsy is a pioneering, local tissue sampling. However, the use of DCE-MRI imaging that determines the molecular subtypes directly is noninvasive. is method can support comprehensive evaluation of heterogenecity of the lesions and predict the prognosis in advance.
is paper introduces an approach for molecular subtypes recognition and mainly focuses on the feature extraction and selection. In order to capture the precise feature description, the paper proposes an improved region growth algorithm to extract the precise edge of lesion based on radiologists' annotations. en the various types of features of breast cancer phenotypes are extracted including texture, morphology, kinetic, and statistics features on different time phases of DCE-MRI. ese features are not all useful for molecular subtypes recognition task. erefore, the paper pays more attention to finding the best features. An mmRFE algorithm is proposed to select the feature subset, which is better than the traditional RFE algorithm based on the experimental results. Finally, we use the feature filtered by mmRFE algorithm to validate the performance of different classifier models as well as the imbalance performance of molecular subtypes on each model respectively. e GBDT obtains the best result on both classification and imbalance performance.
e future work will focus on extracting more features such as clinical features and the boost classification model. e problem should be discussed deeply in further work that strong model can find good features but bad for boost while weak models may be good in boost but cannot find useful features. e approach validated in treatment process will be another problem that should be also considered in the next work.
Data Availability e patient population data used to support the findings of this study have not been made freely available because the data are supplied by the Cancer Hospital of Liaoning under license. Requests for access to these data should be made to the corresponding author.

Conflicts of Interest
ere are no conflicts of interest.