A Machine Learning Approach for the Association of ki-67 Scoring with Prognostic Factors

ki-67 score is a solid tumor proliferation marker being associated with the prognosis of breast carcinoma and its response to neoadjuvant chemotherapy. In the present study, we aimed to investigate the way of clustering of prognostic factors by ki-67 score using a machine learning approach and multiple correspondence analysis. In this study, 223 patients with breast carcinoma were analyzed using the random forest method for classification of prognostic factors according to ki-67 groups (<14% and >14%). Also the relationship between subgroups of prognostic factors and ki-67 scores was examined by multiple correspondence analysis. There was a clustering of molecular classification LA, 0-3 metastatic lymph node, age <50, absence of LVI, T1 tumor size with ki-67 <14% and grade III, 10 or more metastatic lymph nodes, and presence of LVI and molecular classification LB, age >50, and T3-T4 tumor size categories with ki-67 >14%. The fact that the low scores of ki-67 correlate with early stage diseases and high scores with advanced disease suggests that 14% threshold value is crucial for ki-67 score.


Introduction
Machine learning investigates how computers can learn (or improve their performance) based on available data. A main research area for computer programs is to automatically learn to recognize complex patterns and make intelligent decisions based on available data [1]. Random forest (RF) is a supervised machine learning technique and a combination of tree predictors in which each tree depends on the values of a random vector sampled independently and with the same distribution for all the trees in the forest [2].
ki-67 score is the core protein expressed at G1, S, G2, and M phases of tumor cells and a solid tumor proliferation marker being associated with prognosis of breast carcinoma (BC) and its response to neoadjuvant chemotherapy [3]. A threshold value of 14% is determinant for the identification of molecular subtypes BCs (MSBC). Chemotherapy response and progression of MSBCs differ [4,5].
In luminal A (LA), ER and/or PR is positive, HER-2 is negative, and the proliferation index is low. In luminal B (LB), tumors are of high grade and may be PR+ or PR-or HER2+ or HER2-. If they are HER2-, they can be distinguished from LA by ki-67 score being >14% [3]. HER-2: HER-2 gene expression is high; however, ER and PR are negative and they are of high grade with ki-67 score of >14% [6]. Typically, triple negative breast carcinoma (TNBC) is the type lacking ER and PR with overexpression of HER2. Compared to other subtypes, TNBC tumors are usually larger [7,8] and they are associated with 2.5-fold more metastasis within five years after diagnosis [8].
Lymphovascular invasion (LVI) is present in one-third of BCs. As a single indicator of adjuvant chemotherapy [7], LVI is associated with increased lymph node metastasis and the risk of progression to systemic disease [9,10]. It is a negative effective factor in survival for relapse and survival in nodenegative patients [11].
Age is a prognostic factor in BC and varies by geographical region or demographics. In regions with young 2 Journal of Oncology populations such as Asia, Africa, and Turkey, BCs are more frequent under the age of 40, and these tumors are found at further stages compared to the Western societies [12]. The presence of axillary lymph node (LN) is one of the most important factors in prognosis estimation for the patients. Metastatic axillary lymph node ratio (mALNR) is known as an important factor in survival for BC [13]. In general, high mALNR indicates poor prognosis [14,15]. Spread of cancer cells to regional LNs is the most important prognostic factor and, assessing the status of axillary lymph nodes (ALNs) is important for the prediction of long-term survival in BC [16,17]. In developed countries, histologically nodenegative breast carcinoma (HNNBC) accounts for two-thirds of invasive BC [18]. Histologically node-negative BC patients usually have a good prognosis [18,19].
Histopathological grade is a special prognostic factor. Some recent studies have confirmed the importance of histopathological grading of BC as a predictive and effective factor in survival. Grade 2 and 3 BCs have poorer prognosis [20,21]. Tumor size (TS) is an independent prognostic factor independent in TNM staging system and it shows a good correlation with nodal metastasis incidence, relapse risk, and survival [22,23]. In the present study, we aimed to investigate the way of clustering of prognostic factors by ki-67 score using a machine learning approach and multiple correspondence analysis (MCA). Data regarding the prognostic factors including patient's age, body mass index (BMI), TS (cm), ki-67 score (%), ER, PR, c-erb-2 receptor status, molecular classification (MC) (LA, LB, Her-2 and TNBC) data, histopathological diagnosis, nuclear grade status (Modified Bloom Richardson), mALNscount (pN1, pN2, pN3), LVI, and the methods of operation were recorded. The way of clustering of ki-67 scores with prognostic variables was examined.

Materials and Methods
It was included in the range of 18 to 70 years of age in the study, patients with distant metastasis and morbid obesity (BMI ≥ 40) were excluded. Most of patients (86%) had invasive ductal carcinoma. In the case of a sufficient number of patients with a molecular class of TNBC, it was thought that TNBC could cluster with ki-67 classes.
Also, as number of LN, BMI, perivascular invasion (PVI), and histopathological type variables reduced the total inertia (58%) and caused ambiguity for variable clustering, they were excluded from the analysis. Furthermore, as the effect of surgical type variable on ki-67 classification is neglected, it was excluded from the MCA.
Univariate analyses, RF machine learning classification algorithm, and MCA statistical methods were used for data evaluation. For 16 missing values among different prognostic variables in data set, "rfimpute" RF value imputation algorithm was used. RF is a classification method involving a voting method. It is comprised by many decision trees [2]. Decision trees are independent from each other and formed by samples withdrawn from the data set using bootstrap method.
input vector: ( ),̂= V {̂( )} 1 wherê ( ) is the class prediction of the ℎ RF tree. During RF classification procedure, relative significance of different variables is also evaluated [24]. This study took the decrease in GINI index into consideration to evaluate the significance of each variable. The GINI index measures the impurity or inequality level of a sample assigned to a node [25].
Supervised machine learning approach was used in analyzing relationship on between as label ki-67 groups and input variables (MC, LVI, age, number of mLN, nuclear grade, TS, number of LNs, BMI, PVI, surgical type, and histopathological type). Thus in this study, classificability of prognostic variables by ki-67 groups (<14% and >14%) was analyzed using RF method. In the train set, 10-fold cross-validation method was applied for the parametric optimization of machine learning algorithm. Test set was used to determine the accuracy of the learned model. For the evaluation of model performance, the Receiver Operating Characteristics (ROC) curve and area under the curve (AUC) were calculated.
In the correspondence analysis, having no distribution assumption except the assumption that the frequencies in the cross table are positive numbers, the correspondence analysis aims to graphically demonstrate the association between the rows and columns in cross tables and develop simple factors by providing this demonstration [26]. In our study, we used MCA to reveal the association of ki-67 with prognostic factors.

Results
A total of 223 patients with breast carcinoma were included in this study. A total of 74 cases (32%) had a ki-67 score of <14% with a mean age of 52.5 ± 12.14 years. A total of 149 cases (66.8%) had a ki-67 score of >14% with a mean age of 50.75 ± 11.95 years. As in general terms, our study was built on the association of ki-67 scoring with variables qualified as prognostic factor for BC; the results of RF method were taken into account (Age, Number of mLNs, Histopathological Type and BMI, p = 0.742, p = 0.234, p = 0.403 and p = 0.386, respectively) rather than the nonsignificant values in Table 1; as significance control for the variables was also performed using the applied RF algorithm. Many cases had the histological type of invasive ductal carcinoma (86%) and the highest grade was Grade II (51.6%). By BMI groups, there were no underweight patients and most of the patients (56.9%) were in the obese group. The distribution of nonmetastatic lymph node count (nmLN) varies by ki-67 groups and classes (p = 0.07).
Using m try = 3 as number of discriminant variables in decision trees and ntree = 100 as number of used trees,  RFClassification Algorithm was applied to the data set involving 223 cases. Using all these arguments, the obtained accuracy was 0.91. For the evaluation of the performance of the obtained model, the ROC curve and AUC were calculated. Using the analysis, AUC was found at 0.95 (Figure 1).
According to the association of ki-67 with the prognostic variables for breast carcinoma, the variables with high and low significance are shown in Figure 2. Figure 2 was designed based on the mean decrease in GINI. According to this figure, the variable with the most contribution to ki-67 classification is MC (25), followed by LVI (14.6), age (11), number of mLN (9.6), nuclear grade (6) MCA was performed to determine the association of ki-67 proliferation with other variables. For this analysis, MC, LVI, age, number of mLNs, nuclear grade, and TS variables were taken from Figure 2.

Luminal A, Luminal B, and Her-2.
While normally ki-67 >14% class should not have LA, as imputation was performed for the parameters with missing data using "rfinput" command in the "RandomForest" package in R software, one LA was present in this section. To avoid that these missing data decrease the safety of the analysis, even though they are very few, this random procedure was not interfered. 37.8% of the cases with ki-67 score of <14% and 75.2% of the cases with >14% were LB. In this study, while there were 6 patients (8.2%) with Her-2 molecular type with ki-67 score of <14% and 27 patients (18.1%) with Her-2 molecular type with ki-67 score of >14%. Molecular subtyping was detected to be the most important factor decreasing the mean GINI index and, consistent with the literature, LB showed clustering with ki-67 score of >14% and LA with ki-67 of <14%. Her-2 did not show clustering in neither of the groups.

Triple Negative Breast
Cancer. ki-67 score of <14% was detected in 3 (4%) cases and ki-67 of >14% in 9 (6%) cases with TNBC. However, it did not show clustering with ki-67 scores (see Figure 3). The reason that TNBC did not show clustering with any of the subgroups of prognostic factors in MCA is the insufficient number of TNBC in data set.

Nuclear
Grade. The histopathological grade was determined using the modified Scarff-Bloom-Richardson grading system (Nottingham Combined Histological Grade) [32]. When the groups were assessed for nuclear grade, grade 2 was significantly more in both groups (p = 0.001). Clustering was observed for grade III group with ki-67 class of >14%, and no clustering was observed for Grades I and II with any of ki-67 scores ( Figure 3). This was considered to develop due to grade and ki-67 scores increased secondarily to nuclear proliferation developed at G1, S, G2, and M phases [3]. Consistent with previous studies, nuclear grade and ki-67 were found to be of positive correlation between scores [33].

Age.
In the present study, the number of 50-year-old or younger patients was more in both ki-67 score groups (p = 0.742). The association of age groups with ki-67 score classes was evaluated, and clustering was observed for ki-67 of >14% with 50-year-old or older patients and for ki-67 of <14% with patients younger than 50 years ( Figure 3). The variable of age and ki-67> 14% scores in study showed negative correlation [34] in contrast to our study; the positive correlation was found. This situation is thought to be caused by the difference of the population in which the sample is drawn.

Tumor Size.
Although, in many studies, there was no correlation between TS size and ki-67 score [35][36][37] in the present study, clustering was observed between T3/4 and ki-67 class of >14% and T1 and ki-67 of <14%. However, T2 did not show clustering with any of ki-67 classes (Figure 3). For the tumors at the same T stage, the risk of progression to advanced stage disease increases with the increasing size [31]. TS was considered to increase secondary to progression development with high ki-67 score.
Successful results were obtained in the study using [38] "k-Means clustering" classification method. However, as "k-Means clustering" method lays equal weight to each attribute during the classification, it may cause predicaments for unrelated attributes. Hence, in our study which also examines the association of the data, there are also attributes with no association with ki-67 scoring. Along with the aforementioned parameters, RF method was applied using R 3.3.3 program and the accuracy was found 91%. For validity of the results, the ROC analysis was conducted to evaluate the performance in our study, and AUC was found 0.95.RF method, it was preferred because of its advantages such as possibility of evaluating the relative importance of the variables in classification, the ability to identify variable interactions, and the short operation time.
As a graphical method is used during the analysis of the association between the categories of variables in MCA, it is considered to be more successful than the clustering analysis. In the study examining the prognostic factors correlated with ki-67 [35], the association was examined using univariate analysis such as ANOVA and chi-square test.
In our study of which the majority of data is categorical, a type of multivariate analysis MCA which also takes the visual dimension of the association into account was used. For MCA, the variance for two dimensions was found 76.30%. Among the variables contributing to inertia, the association of grade, mLN, MC, LVI, age, and TS was examined.

Conclusion
Luminal B, nuclear grade III, age ≥50 years, LVI (+), number of mLNs ≥10, tumor size T3/4, and ki-67 > 14% clusters were observed in the analysis of the relation between ki-67 threshold value and prognostic factors. Luminal A, age <50 years, LVI (-), number of mLNs 0-3, and tumor T1 were clustered with ki-67 < 14% score. The fact that the low scores of ki-67 correlate with early stage diseases and high scores with advanced disease suggests that 14% threshold value is crucial for ki-67 score.

Data Availability
Access to data is restricted, because the institution from which the data is received does not allow the sharing of data with third parties in terms of patient privacy.