Prognostic Models for Nonmetastatic Triple-Negative Breast Cancer Based on the Pretreatment Serum Tumor Markers with Machine Learning

Purpose Triple-negative breast cancer (TNBC) is a heterogeneous and aggressive disease with poorer prognosis than other subtypes. We aimed to investigate the prognostic efficacy of multiple tumor markers and constructed a prognostic model for stage I-III TNBC patients. Patients and Methods. We included stage I-III TNBC patients whose serum tumor markers levels were measured prior to the treatment. The optimal cut-off value of each tumor marker was determined by X-tile. Then, we adopted two survival models (lasso Cox model and random survival forest model) to build the prognostic model and AUC values of the time-dependent receiver operating characteristic (ROC) were calculated. The Kaplan-Meier method was used to plot the survival curves and the log-rank test was used to test whether there was a significant difference between the predicted high-risk and low-risk groups. We used univariable and multivariable Cox analysis to identify independent prognostic factors and did subgroup analysis further for the lasso Cox model. Results We included 258 stage I-III TNBC patients. CEA, CA125, and CA211 showed independent prognostic value for DFS when using the optimal cut-off values; their HRs and 95% CI were as follows: 1.787 (1.056–3.226), 2.684 (1.200–3.931), and 2.513 (1.567–4.877). AUC values of lasso Cox model and random survival forest model were 0.740 and 0.663 for DFS at 60 months, respectively. Both the lasso Cox model and random survival forest model demonstrated excellent prognostic value. According to tumor marker risk scores (TMRS) computed by the lasso Cox model, the high TMRS group had worse DFS (HR = 3.138, 95% CI: 1.711–5.033, p < 0.0001) and OS (3.983, 1.637–7.214, p=0.0011) than low TMRS group. Furthermore, subgroup analysis of N0-N1 patients in the lasso Cox model indicated that TMRS still had a significant prognostic effect on DFS (2.278, 1.189–4.346) and OS (2.982, 1.110–7.519). Conclusions Our study indicated that pretreatment levels of serum CEA, CA125, and CA211 had independent prognostic significance for TNBC patients. Both lasso Cox model and random survival forest model that we constructed based on tumor markers could strongly predict the survival risk. Higher TMRS was associated with worse DFS and OS both in stage I-III and N0-N1 TNBC patients.


Introduction
Breast cancer is the most common malignancy among women throughout the world, with the highest morbidity and mortality in various female cancers. According to the global cancer statistics report released by the World Health Organization, there would be about 2.08 million newly diagnosed female breast cancer cases and more than 0.62 million patients died of it in 2018 [1]. Triple-negative breast cancer (TNBC) is characterized by the absence of estrogen receptor (ER), progesterone receptor (PR) expression, and human epidermal growth factor receptor-2 (HER-2) amplification, accounting for 10%-20% of all breast cancers [2][3][4]. TNBC patients usually have more unfavorable histopathologic features when compared with non-TNBC, such as more rapid proliferation, larger tumor size, higher grade, and lymph node positivity [5,6]. TNBC patients can not benefit from endocrine therapy or anti-HER-2 therapy since targets are missing, making chemotherapy become currently the mainstay of systemic treatment.
Notorious for its heterogeneity, aggressiveness, and limited treatment options, TNBC is thought to have the poorest prognosis in all subtypes. Although it is reported that TNBC patients are sensitive to chemotherapy as demonstrated by higher pathologic complete response (pCR) rates than other subtypes after neoadjuvant chemotherapy [7,8].
ere are still a considerable number of patients who cannot obtain pCR, and those with residual lesions have significantly worse survival compared to non-TNBC [7]. On the other hand, there is a higher risk of relapse and disease progression after surgery and chemotherapy for TNBC [9,10]. Montagna E et al. evaluated the outcome of breast cancer patients after locoregional recurrence (LRR) furtherly and they found that patients with TNBC at LRR experienced a higher risk of subsequent relapse and death [11]. Recently, a retrospective analysis based on the SEER database also revealed that when in comparison with non-TNBC, TNBC patients had worse overall survival (OS) and breast cancer cause-specific survival (BCSS) in every stage and substage [12]. As for the survival of those patients with distant metastasis, it is also shorter in TNBC compared to other subtypes and this can be explained by the predilection for brain and lung metastasis of TNBC, while ER-positive breast cancers are more likely to relapse in bone or skin [4,13,14]. erefore, it is important to discover some efficient and easy detection prognostic markers to evaluate the risk of postoperative recurrence or survival.
Apart from the extensively documented clinicopathological risk factors such as lymph node status, tumor size, grade, and the level of Ki-67, there are still no prognostic biomarkers suitable for clinical use in TNBC [15,16]. e prognostic value of serum tumor markers has been investigated in breast cancer for several years and carcinoembryonic antigen (CEA) and cancer antigen 15-3 (CA15-3) are the most widely used tumor markers in clinical practice [17][18][19][20][21]. However, the prognostic efficacy of preoperative levels of serum tumor markers such as CEA and CA15-3 in breast cancer remains controversial. Several previous studies suggested that elevated preoperative CEA and CA15-3 levels are associated with tumor burden and poor prognosis [17,22,23]. In contrast, there are also some reports that failed to support this conclusion, showing no prognostic significance of CEA or CA15-3 [21,24]. Although the European Group on Tumor Markers has recommended the use of CEA and CA15-3 for assessing prognosis and early detection of disease progression in breast cancer since 2005 [25], the American Society of Clinical Oncology (ASCO) and National Comprehensive Cancer Network (NCCN) guidelines have not recommended the routine utilization of CEA and CA15-3 [26,27]. Additionally, most studies have been based on breast cancer overall; the association of these tumor markers and different subtypes of breast cancer such as TNBC remains to be clarified.
In recent years, machine learning methods have been widely applied to disease prognosis and prediction [28][29][30].
ese techniques are utilized for identifying informative factors and modeling the progression of cancer. Park et al. compared three classification models, namely, support vector machines (SVM), artificial neural network (ANN), and semisupervised learning models (SSLM) for the prediction of breast cancer survivability based on 16 features, including tumor size, the number of nodes, and age [28]. However, SVM, ANN, and SSLM, which are designed for classification data, are not suitable for time-to-event data. Lasso Cox regression model and random survival forest model are commonly used survival machine learning algorithms. For example, Zheng et al. developed a novel scoring system based on hypoxia and immune status by taking the lasso Cox regression model [30].
In our study, we intended to conduct research to investigate the prognostic efficacy of multiple tumor markers and constructed prognostic models for stage I-III TNBC patients based on the six pretreatment tumor markers' levels (including CEA, CA19-9, CA125, CA242, CA211, and CA15-3) with machine learning algorithms, so as to help identify the early-stage patients with high recurrence and mortality risk.

Study Population.
We conducted a retrospective analysis of stage I-III TNBC patients who were admitted to e Second Affiliated Hospital of Zhejiang University, School of Medicine, between January 2011 and December 2017 and whose serum tumor markers (including CEA, CA19-9, CA125, CA242, CA211, CA15-3) levels were measured prior to surgery or neoadjuvant chemotherapy. TNBC was defined as ER and PR negative or <1% if the percentage was specified and HER-2 status is 0 or 1+ by immunohistochemistry analysis or 2+ with negative fluorescent in situ hybridization [31,32]. Patients with any missing receptor information or a missing pathology report were excluded from the analysis. In addition, the patients were also excluded for meeting one of the following criteria: (1) carcinoma in situ; (2) male patients; (3) stage IV disease with distant metastasis at first diagnosis; (4) history of other malignant tumors. All data, including clinical and pathological information, treatment modality, serum tumor markers, and details of outcomes, were collected. TNM stage was based on the Eighth American Joint Committee on Cancer Criteria. e written informed consent was acquired from each breast cancer patient or patient's guardian and the study was approved by the Ethics Committee of e Second Affiliated Hospital of Zhejiang University, School of Medicine.

Follow-Up and Study
Endpoints. Patients were followed up at an interval of 3 months within 2 years, 6 months within 3-5 years, and 1 year for more than 5 years, with the date of surgery performed considered as the first day of follow-up. e primary study endpoints were disease-free survival (DFS) and overall survival (OS). DFS was defined to be from the date of surgery to the date of locoregional recurrence, distant metastasis, another second primary cancer, and death before recurrence or the date of the last follow-up. OS was defined to be from the date of surgery to death from any cause or the date of the last follow-up.

Lasso Cox Model and Random Survival Forest Model.
e least absolute shrinkage and selection operator (lasso) Cox regression model analysis was performed by using the "glmnet" package [33]. Partial likelihood deviance was selected as the loss function, and the optimal values of penalty parameter λ were determined through twenty-fold crossvalidation [34]. Regression coefficients of each tumor marker were calculated with the optimal λ value, and tumor marker risk scores (TMRS) of patients were then calculated based on the levels of serum tumor markers and their associated regression coefficients accordingly.
Random survival forest (RSF) is an extension of Breiman's random forest method which was designed for analysis of right-censored time-to-event data [35]. We performed a RSF model to build the predictive model using the "randomForestSRC" package [35]. Tuning parameters, such as node size and mtry, where node size represented the number of samples in the terminal node and mtry was the number of randomly selected candidate variables in each parent node, were optimized by a grid search to minimize the out-of-bag (OOB) error. TMRS of the RSF model were calculated utilizing the "predict" function of the "stats" package. With the median TMRS as a cut-off value, all TNBC patients were split into high TMRS and low TMRS groups in both models.

Statistical Analysis.
Statistical evaluation of comparison of each tumor marker levels in different stages was performed using one-way analysis of variance (ANOVA) and Tukey's post hoc test or nonparametric Kruskal-Wallis test according to the distribution and homogeneity test of variances of data. X-tile 3.6.1 software (Yale University, New Haven, CT, USA) was used to determine the optimal prognostic cut-off value of each tumor marker in TNBC patients [36]. e sensitivity and specificity of the survival prediction based on the TMRS were depicted by a timedependent receiver operating characteristic (ROC) curve, with quantification of the area under the ROC curve (AUC) using the "timeROC" package [37]. All packages were used in our study to analyze data with the R project (version 3.4.2). Graphpad prism 6 was used to plot Kaplan-Meier survival curves and the group differences in survival time were tested using the log-rank test, with hazard ratios (HRs) and 95% confidence intervals (CIs) being calculated. e difference between proportions was evaluated by the chisquare or Fisher's exact test as appropriate. Univariable and multivariable Cox's proportional hazard analyses were performed to compare and identify independent prognostic factors for DFS. All tests were 2-sided and statistical significance was set at p < 0.05. All data were analyzed using the SPSS 24.0 and Graphpad prism 6 software.

Patient Characteristics and Follow-Up. 258 stage I-III
TNBC patients met the criteria for inclusion in the study. e clinicopathological characteristics of the patients are shown in Table 1. e median age at diagnosis for participants was 51.5 years old (range 25-87 years). Among them, the age of disease onset in most (68.2%) patients was between 40 and 60 years. Bilateral morbidity was basically the same, with left 50.8% and right 48.8%, respectively. One patient was diagnosed with bilateral breast cancer, left invasive ductal carcinoma and right carcinoma in situ, with both sides having a negative expression of ER, PR, and HER-2. e pathological classification of 203 cases (78.7%) was nonspecific invasive cancer. 110 (42.6%) patients were classified as histologic grade III and the expression of Ki-67 was ≥30% (high expression) in 193 cases (74.8%). As for the TNM stage, there were 100 cases (38.8%) in stage I, 111 cases (43.0%) in stage II, and 36 (14.0%) in stage III. In addition, a total of 178 (69.0%) patients underwent a total mastectomy, and 80 (31.0%) received breast-conserving surgery.  Figure 2 shows the distribution of each tumor marker among different stages patients. First of all, for these early-stage TNBC patients, there were only a few people with elevated serum tumor markers levels. For example, only 10 (3.9%), 17 (6.6%), and 10 (3.9%) patients showed elevated levels of CEA, CA19-9, and CA15-3. However, in the comparison of stage I-III, the elevations of four markers (including CEA, CA125, CA211, and CA15-3) tend to be more found in more advanced stages (stage II or III). As we can see in Figures 2(a) and 2(e), the serum levels of CEA and CA211 were significantly higher in stage III patients than those in stage I and stage II. In terms of CA15-3, both stage II and III TNBC patients showed higher levels than stage I (Figure 2(f )). However, there was no obvious correlation between serum levels of CA19-9, CA242, and TNM stage (Figures 2(b) and 2(d)).

e Levels of Pretreatment Serum Tumor Markers.
On the other hand, we also compared the levels of tumor markers among patients without recurrence evidence, with locoregional recurrence and with distant metastasis, respectively ( Figure S1). e results suggested that for those with different DFS status, their pretreatment levels of serum tumor markers had no significant difference.

e Optimal Cut-Off Values Determined by X-Tile and
eir Prognostic Role. Stage II or III patients showed higher levels of tumor markers than stage I patients, but only a few people had elevated tumor markers levels; we did not think it was appropriate to use the clinical cut-off value as the prognostic cut-off for early-stage TNBC patients. So, we used X-tile to determine the optimal prognostic cut-off value of each tumor marker, and as shown in Table 2, the optimal cut-off values of CEA, CA19-9, CA125, CA242, CA211, and CA15-3 were 2.15 ng/mL, 17.30 U/mL, 9.05 U/mL, 8.85 U/ mL, 1.15 ng/mL and 16.00 U/mL, respectively.
us, we aimed to evaluate patients' prognosis according to the levels of these six tumor markers.
Construction of the Prognosis Prediction Model for TNBC Patients by Lasso Cox Model and Random Survival Forest Model.
We counted it as 1 score if the level of each serum tumor marker was higher than the optimal cut-off value, otherwise as 0 score. Based on the levels of these tumor markers, the lasso Cox model identified the risk signature that was significantly associated with DFS based on the optimal λ value 0.0234 (Figure 4(a)). e lasso algorithm is a shrinkage estimate that can be used to construct a penalty function and obtain a relatively refined model [34]. Here in our study, the regression coefficient of CA242 turned into zero, while the remaining tumor markers were included in the simplified lasso Cox model ( Table 3). TMRS of each patient was then calculated based on these regression coefficients and levels of  (Figures 4(b) and 4(c)).
We further chose another machine learning method, the RSF model, to build the predictive model. As Figure 5(a) shows, the OOB error was lowest when mtry was 1 and node size was 65, indicating the best RSF model. In this model, the recurrence risk of each patient was computed as well, and time-dependent ROC curves were plotted then. As is shown in Figures 5(b) and 5(c), AUC values were 0.637 and 0.663 at 36 and 60 months for DFS; 0.777 and 0.659 at 36 and 60 months for OS, respectively.

Prognostic Value of TMRS Groups in Two Survival Models and Subgroup Analysis.
e median TMRS was used as the threshold to divide total TNBC patients into high-risk and  A scatter represents a patient, and the cut-off value of each scatter plot is the clinical upper limit, in which higher than the cut-off is indicated by red scatter while the lower is blue. e comparison between different stages was performed using one-way ANOVA and Tukey's post hoc test or the nonparametric Kruskal-Wallis test as appropriate. * p < 0.05, * * p < 0.01, * * * p < 0.001 indicated a significant difference. CEA: carcinoembryonic antigen; CA: cancer antigen; TNBC: triple-negative breast cancer; NS: not significant.    (Figure 6(a) and 6(b)). On the other hand, the RSF model also showed great predictive value for nonmetastatic TNBC patients. e survival analysis indicated that patients in the high-risk group had significantly higher recurrence risk (HR � 2.454, 95% CI: 1.395-4.107, p � 0.0016) and mortality risk (HR � 2.857, 95% CI: 1.290-5.694, p � 0.0086) than those in the low-risk group (Figure 6(c) and 6(d)).     We further chose the lasso Cox model to evaluate the model performance in the subgroup analysis since it had a larger AUC value and better prognostic significance than the RSF model, with good interpretability for the survival model. Univariable analysis showed that T-stage (p � 0.093), N-stage (p < 0.001) and TMRS groups (p < 0.001) were potential prognostic factors for DFS (Table 4). Multivariable analysis including these factors demonstrated that besides TMRS groups, the traditional clinicopathological factor, N-stage, had independent prognostic value for DFS in TNBC patients as well (p < 0.001, Table 4). When stratified by lymph node status (N-stage), N 2 -stage (HR � 2.767, 95% CI: 1.218-6.288) and N 3 -stage (HR � 4.980, 95% CI: 2.081-11.917) patients showed poorer prognosis than N 0stage patients, while N 1 -stage showed no significant difference (HR � 0.658, 95% CI: 0.263-1.650) (Table 4). Hence, we selected N 0 -N 1 patients as low recurrence risk patients and plotted the Kaplan-Meier survival curve according to TMRS groups. As in Figures 7(a) and 7(b), TMRS groups showed excellent prognostic value again. ose N 0 -N 1 patients with higher TMRS showed significantly worse DFS (HR � 2.278, 95% CI: 1.189-4.346, p � 0.0135) and OS (HR � 2.982, 95% CI: 1.110-7.519, p � 0.0303) than those with lower TMRS (Figures 7(a) and 7(b)).

Discussion
e independent prognostic value of serum tumor markers, such as CEA and CA15-3, was revealed in several previous studies [17,20,22]. However, among all these studies, there is little discussion on molecular subtypes of breast cancer and few studies were performed to explore the prognostic value of multiple tumor markers. In our current study, we used X-tile to determine the best prognostic cut-off value of each tumor marker based on the idea of "optimal cut-off value" [36] and confirmed the significant prognostic role of CEA, CA125, and CA211. On the other hand, we synthesized the role of six tumor markers and constructed an excellent  Journal of Oncology prognostic model for stage I-III TNBC patients, providing a method for assisting in predicting prognosis. Among the six tumor markers included in our study, CEA and CA15-3 were mostly demonstrated and their elevated levels were closely related to poor prognosis in breast cancer patients [17,20,22,23,38]. Wu SG et al. found that elevated levels of CEA and CA15-3 had no significant effect on local recurrence-free survival but were significantly associated with the decrease of distant metastasis-free survival, DFS, and OS in the Chinese breast cancer cohort [23]. e correlation analysis between molecular subtypes and tumor markers indicated that there was only 1 case (1.6%) in TNBC with elevated CEA, much less than other subtypes, while the proportion of CA15-3 (14.3%) was similar to others [23]. Although two additional studies confirmed the significant prognostic value of CEA and CA15-3 for DFS and OS in overall breast cancer patients, subgroup analysis of molecular subtype showed inconsistent results [20,38]. e study of Nam SE et al. suggested no correlation between the levels of CEA, CA15-3, and OS of TNBC patients, while another research indicated that in basal-like subtype, which had an overlap of approximately 70-80% TNBC patients, elevated CEA conferred reduction for breast cancer-specific survival (BCSS), but without association observed for DFS [20,38]. Different from our study, the studies mentioned above all were performed based on the clinical upper limit as the prognostic cut-off. e negative evidence in the TNBC subtype suggested that perhaps we should screen an optimal cut-off used for prognosis. Our results confirmed the prognostic value of CEA in early-stage TNBC patients when using the cut-off selected by X-tile. CA125, which is mostly used in ovarian cancer, was found to increase significantly in metastatic breast cancer patients [39,40]. In Li JX's study, there was no relevance found between CA125 and breast cancer outcomes, including BCSS and DFS [38]. But another study that included young breast cancer patients indicated that a high level of CA125 was associated with worse DFS and OS when using 19.38 U/mL as the cut-off value [41], providing further evidence for selecting an optimal cut-off value. Although no study explored the prognostic significance of CA125 in different molecular subtypes, it was shown that the levels of CA125 in TNBC patients were higher than non-TNBC, suggesting that elevation of CA125 can be used to predict a poor outcome of TNBC patients Cox's proportional hazard analysis was carried out for univariable and multivariable analyses to identify independent prognostic factors for DFS in stage I-III TNBC patients. Multivariable analysis was performed further for the factor whose p < 0.10 in univariable analysis. * * p < 0.01, * * * p < 0.001 indicate a significant difference. DFS: disease-free survival; TNBC: triple-negative breast cancer; HR: hazard ratio; CI: confidence interval; TMRS: tumor marker risk score. [42]. According to our results, CA125 showed a significant prognostic value when using 9.05 U/mL as the cut-off. As for CA19-9, CA242, and CA211, there were quite a few studies exploring their relationship with breast cancer. Some researchers have investigated the diagnostic value of CA19-9 in breast cancer [43], but its role in predicting prognosis still remains unknown. CA242 and CA211 were discovered relatively later than other tumor markers. anks to their low specificity, most studies explored its application in the diagnosis or prognosis of pancreatic cancer or gastrointestinal cancer [44,45]. is is the first time to report the significant prognostic value of CA211 in breast cancer patients, suggesting that its role in breast cancer is worthy of further study.
In the present study, we found that both the lasso Cox model and RSF model based on tumor markers could help stratify stage I-III TNBC patients' recurrence risk and mortality risk. Numerous previous studies adopted the Cox proportional hazard model with lasso penalization for survival data [30,46] because it had wider application value for its role in simplifying variables. Our results also suggested that lasso Cox model had a larger AUC value and better prognostic significance than the RSF model. erefore, we developed a prognostic model involving five tumor markers except for CA242, which are easily detected in clinical practice, to calculate TMRS based on machine learning algorithms for predicting the outcome of TNBC patients. e role of tumor markers was further validated in our study. When comparing the clinicopathological characteristics, we found that the high TMRS group indicated a more advanced stage, with more lymph nodes metastasis (Table S1), which provided the possibility of estimating the stage according to tumor markers. In addition to being associated with tumor burdens, TMRS groups were also reported an excellent prognostic significance for TNBC patients.
e multivariable analysis also confirmed that TMRS was one of the independent prognostic factors. us, the prognostic value of the combination of multiple tumor marker levels was further intensified in our study. Although the ASCO has not recommended therapeutic decisions based on the serum tumor marker status [26], we still think that elevated serum tumor markers could be useful in discriminating high-risk groups, for which the hypothesis should be verified. ere are some limitations to this study that should be considered as well. First of all, we did not verify the validity of the model by using a verification set. Since public datasets that provide information about levels of patients' serum tumor markers are inaccessible, we can not perform further analysis based on an external dataset. On the other hand, it is a single-center study with a limited number of patients, and all the patients included are in the Chinese cohort, so multicenter prospective studies should be performed to confirm the validity of this prognostic model. In addition, due to the generally good prognosis of breast cancer, the number of cases with recurrence or death was small. Given this limitation, longer-term follow-up will be needed to update the results. What is more, we did not evaluate the prognosis of patients by comparing their changes in serum tumor marker concentrations before and after surgery, which is also a strategy of using tumor marker. Finally, whether the prognostic model is suitable for metastatic patients and other molecular subtypes is worthy of further exploration.
In conclusion, our study indicated that pretreatment levels of serum CEA, CA125, and CA211 had great prognostic significance for TNBC patients when using the optimal cut-off value determined by X-tile. TMRS, which was calculated based on tumor markers by taking the lasso Cox model, was an independent prognostic factor as well. A higher score of TMRS was associated with worse DFS and OS both in stage I-III and N 0 -N 1 TNBC patients. We hope that further study should be used in an effort to confirm the validity of this study and to provide more information by using tumor markers regarding therapeutic decision-making in clinical practice. Clinical Trial of the Multi-kinase Inhibitor TT-00420 (2019ZX09301158). e authors thank Dr. Jiaojiao Zhou of e Second Affiliated Hospital, Zhejiang University, School of Medicine, for polishing the English of the whole article. e comparison of tumor markers' levels between different groups was performed using one-way ANOVA and Tukey's post hoc test or nonparametric Kruskal-Wallis test as appropriate (CEA: carcinoembryonic antigen; CA: cancer antigen; TNBC: triple-negative breast cancer; NS: not significant).