Development and Validation of a Prognostic Nomogram for Lung Adenocarcinoma: A Population-Based Study

Purpose To establish an effective and accurate prognostic nomogram for lung adenocarcinoma (LUAD). Patients and Methods. 62,355 LUAD patients from 1975 to 2016 enrolled in the Surveillance, Epidemiology, and End Results (SEER) database were randomly and equally divided into the training cohort (n = 31,179) and the validation cohort (n = 31,176). Univariate and multivariate Cox regression analyses screened the predictive effects of each variable on survival. The concordance index (C-index), calibration curves, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC) were used to examine and validate the predictive accuracy of the nomogram. Kaplan–Meier curves were used to estimate overall survival (OS). Results 10 prognostic factors associated with OS were identified, including age, sex, race, marital status, American Joint Committee on Cancer (AJCC) TNM stage, tumor size, grade, and primary site. A nomogram was established based on these results. C-indexes of the nomogram model reached 0.777 (95% confidence interval (CI), 0.773 to 0.781) and 0.779 (95% CI, 0.775 to 0.783) in the training and validation cohorts, respectively. The calibration curves were well-fitted for both cohorts. The AUC for the 3- and 5-year OS presented great prognostic accuracy in the training cohort (AUC = 0.832 and 0.827, respectively) and validation cohort (AUC = 0.835 and 0.828, respectively). The Kaplan–Meier curves presented significant differences in OS among the groups. Conclusion The nomogram allows accurate and comprehensive prognostic prediction for patients with LUAD.


Introduction
Lung cancer is the most common malignancy worldwide and is the leading cause of cancer-related death worldwide, accounting for 1.8 million deaths annually [1,2]. Lung adenocarcinoma (LUAD) is the most common type of lung cancer, accounting for nearly 50% of non-small-cell lung cancers (NSCLC) [3,4]. LUAD is characterized by a high degree of malignancy and poor prognosis [5]. In recent years, the incidence and mortality of LUAD have increased, despite the introduction of novel therapeutic approaches including immunotherapy or molecular targeted therapy [6,7]. Recently, several prognostic factors for the survival of LUAD patients have been reported [8][9][10]. Te studies mainly focused on LUAD-related genes and biomarkers, but these results lack the integration of some critical clinical information. Terefore, it is necessary to identify high-risk prognostic factors for predicting the individualized survival of patients with LUAD, and a nomogram is considered a good tool for predicting outcomes.
Nomograms, which can provide evidence-based, individualized, and highly accurate risk estimation, have been widely used for diferent types of cancer [11][12][13]. A nomogram is created by applying meaningful variables based on the results of regression analyses investigating potential prognostic factors, which contributed to better risk stratifcation and clinical decision-making [14][15][16]. Te above makes the nomogram quite practical and easy to popularize.
Te majority of previous nomograms in LUAD have focused on factors such as lesion size, lymph node metastasis, histological types, treatment factors, pathological stage, lymph node invasion, and age [17][18][19]. Nonetheless, there is a lack of comprehensive and accurate nomograms for predicting the survival of LUAD patients. Te Surveillance, Epidemiology, and End Results (SEER) database is representative of the United States (US) population, with patientlevel data abstracted from 18 geographically diverse populations including rural, urban, and regional populations [20]. It is widely used in cancer research [21,22]. Te SEER database contains rich information on LUAD patients; thus, it ofers an excellent opportunity for LUAD study. Terefore, we used a cohort from the SEER database to develop a comprehensive, accurate, and efective nomogram for predicting the survival of LUAD patients.

Data Sources.
Patients with LUAD were identifed from the SEER database, a publicly available database, established in 1973. Te database covers approximately 26% of the population of the US and includes 17 national populationbased cancer registries [23]. Data from the SEER database includes demographic information as well as primary tumor site, tumor morphology, stage at diagnosis, the frst course of cancer treatment, and follow-up information on the vital statistics of the patients with cancer. To include as much data as possible in the analysis, we retrieved database records from 1975 to 2016. All the patient information can be obtained from the supplementary section (supplementary 1. Date-AD and 2. Date-Code). (Available here).

Statistical Analysis.
All analyses were performed using R version 3.6.3 and R studio (https://www.r-project.org/). Cox regression analysis was used for univariate and multivariate analyses.
A nomogram was created based on the risk factors identifed from the multivariate analysis using packages of "rms," "foreign," and "survival" in R studio. Te performance of the model was measured using the concordance index (Cindex), calibration curves, receiver operating characteristic (ROC) curve, and the area under the ROC curve (AUC). Te larger the C-index, the more accurate the prognostic prediction [24]. We use the "predict" function in the "survival" R package to calculate the risk score of the samples in the training cohorts. Based on the median risk score, the samples were then divided into a high-risk group and a low-risk group. Survival curves were constructed using the Kaplan-Meier method and compared using the log-rank test. During the internal validation of the nomogram, the C-index, calibration curves, and ROC curves were derived from the regression analysis using the same R package described above. A P value < 0.05 was considered statistically signifcant.

Clinical Characteristics of Patients.
A total of 67,146 patients with LUAD were identifed from the SEER database. Of these, 62,355 patients who satisfed the inclusion criteria were enrolled and randomly divided into the training (n � 31,179) and validation (n � 31,176) cohorts. Te methods used for data collection and analysis are summarized in Figure 1. Te clinical characteristics of patients are listed in Table 1.

Independent Prognostic Factors in the Training Cohort.
Te results of the univariate and multivariate analyses are presented in Table 2. Te multivariate analyses demonstrated that age, sex, race, marital status, AJCC-N, AJCC-M, tumor size, grade, and the primary tumor site were independent risk factors for OS.

Prognostic Nomogram for OS.
Te nomogram was constructed using all signifcant independent factors derived from the regression analysis above and the AJCC-T (an important aspect of AJCC-TMN staging) for OS in the training cohort ( Figure 2). Te C-index of this model for OS prediction was 0.777 (95% confdence interval (CI), 0.773 to 0.781). Furthermore, the calibration plot for the probability of survival at 3 or 5 years presented good agreement between nomogram prediction and actual observation (

Survival and Prognostic Factors for OS. Te
Kaplan-Meier analysis was conducted to identify vital prognostic factors that could be useful to predict the outcome. Te efects of vital factors for predicting OS across all patients with LUAD are shown in Figure 7. We found survival rates were signifcantly higher in the low-risk group than in the high-risk group (Figure 7(a)). As shown in Figure 7(b), patients aged 60-69 years had the best survival, but patients over 80 years had the worst survival. Survival for the other age groups was between patients aged 60-69 years and patients over 80 years, in order of worsening survival, 50-59, <50, and 70-79 (P < 0.0001). Male patients presented better survival than female patients (P < 0.0001) (Figure 7(c)). Further analysis showed that other races were associated with better outcomes compared with white and black races (P < 0.0001) (Figure 7(d)). Married patients (including married and domestic partners) showed better survival than single patients (including single, widowed, divorced, and separated), as shown in Figure 7(e) (P < 0.0001). Better survival was observed in LUAD stage T1 than in other stages (Figure 7(f)) (P < 0.0001). Diferent from the T stage, patients in the N0 stage exhibited better survival than in the N1, N2, or N3 stages (Figure 7(g)) (P < 0.0001). Te survival probability in the M0 stage was higher than in the M1 stage (including M1a and M1b) (Figure 7(h)) (P < 0.0001). As illustrated in Figure 7(i), the smaller the tumor size, the better the survival generally. However, the not otherwise specifed (NOS) group presented the worst survival probability (P < 0.0001). In the plot association between grade level and survival, grade I had better survival than other grades (Figure 7(j)) (P < 0.0001). However, patients with grade IV tumors achieved better survival than grade III patients. In Figure 7(k), patients with the primary site at the middle lobe showed better survival than those with primary sites at the lower lobe, upper lobe, overlapping lesion of the lung, lung NOS, or the mainstem bronchus (P < 0.0001).

Discussion
LAUD is a highly heterogeneous tumor in terms of pathology, biology, and clinical behavior, which leads to signifcant challenges to therapy and prognostic prediction [25,26]. In the past, several diferent approaches have been attempted to predict the prognosis of LAUD patients, for example, using features from pathology images, molecule biomarkers based on bioinformatics analysis and laboratory data, and clinical staging [27][28][29]. Here, our research focused on the combination of pathology, bioinformatics, and clinical characteristics of LAUD patients and provided a more accurate assessment of the prognosis of LAUD Journal of Healthcare Engineering patients. We considered a nomogram to be an appropriate choice.
Other factors, such as chronic obstructive pulmonary disease (COPD), cigarette smoking, and surgery, are not required for all LAUD patients [33]. So, we did not include those risk factors in the analysis. Tird, to some extent, the large number of patients included in the analysis contributed to the accuracy of the model. Te C-index, calibration curves, ROC curves, and AUC values were also used to validate the predictive accuracy of the nomogram in the validation cohorts. Te C-index was 0.779 (95% CI, 0.775 to 0.783) in the validation cohort, which was higher than the training cohort. Te calibration lines overlapped more closely to the standard lines than the training cohort (Figures 4(a) and 4(b)). Te AUC values under the ROC curves for the 3-and 5-year OS showed better concordance in the validation than in the training cohort (Figures 5(a) and 5(b), AUC � 0.835 and 0.828, respectively). Hence, in the current study, the C-indexes, calibration plots, ROC curves, and AUC values showed optimal agreement between prediction and actual observation, guaranteeing the repeatability and reliability of the constructed nomogram. Overall, the nomogram was able to predict the prognosis of LUAD patients efectively and precisely.
However, the primary limitation of our research is the lack of external validation. Some studies have examined the generalizability of their constructed nomograms via external validation [42]. But our model is based on a large set of globally representative data, which enhances its generalizability. Our use of random sampling and efective internal verifcation ensured the accuracy of the prediction. At the same time, it is relatively difcult to collect enough LUAD patients for external validation in a short time. Larger multicenter studies will be needed in the future due to the lack of external validation in our study.
Furthermore, we investigated the impact of prognostic factors on patient survival. Te cutof value of the risk score determined by the Cox regression analyses was used to stratify patients into two groups. Patients with a low-risk score (≤0.798) presented better survival, similar to previous studies on LUAD and NSCLC [42,43]. In addition, according to the Kaplan-Meier analysis, age, sex, race, marital status, TNM stage, tumor size, grade, and primary tumor site were associated with survival.
Te age stratifcation in our study was more detailed than in other studies, which is more conducive to analyzing the efect of diferent ages on patient survival. In our study, we concluded that age was closely correlated with LUAD survival. Older age predicted lower OS, which is consistent with other studies [44]. Furthermore, we demonstrated that female patients showed better survival, which is consistent with previous studies [45]. However, diferent results have been reported on whether age and sex have an impact on the survival of patients with LUAD. In the study by Pitz et al., sex but not age was an independent factor afecting prognosis, and women with LUAD showed better survival than men [46]. In contrast, Jubelirer et al. concluded that age, not sex, was the signifcant prognostic factor for LUAD [47]. Surprisingly, Zhao et al. concluded that neither age nor sex was a signifcant prognostic factor for LUAD patients [48]. We consider that these disagreements arose from an analysis of diferent databases or diferences in study objectives.
Te impact of race on the survival of patients with lung cancer has been reported previously. Several lines of 2958 (51) 3912 (68) 1550 (27) 2185 (38) 522 (9)  979 (17) 126 (2) GradeI Number at risk: n (%) Grade 9837 (79) 12427 (100) 5195 (42) 7296 (59) 2517 (20) 3679 (30) 805 (6)  1566 (13) 203 (2) GradeII 7375 (58) 12681 (100) 3163 (25) 4760 (38) 1419 (11)  2125 (17) 441 (3)  870 (7) 109 (1) GradeIII 177 (61)  288 (100) 77 (27)  114 (40) 34 (12)  52 (18) 13 (5) (21) 19 (4) 32 (7) 7 (1) 11 (2) 0 (0)  evidence have suggested that races other than white and black had the best survival [23,49]. Te same trend was observed in our study. However, it is controversial whether the white or black race can be associated with lower survival. Data from several studies demonstrated there was no signifcant diference in survival between white and black patients with lung cancer [23,50]. Nonetheless, our study and other studies have reported that black patients with lung cancer experienced worse survival rate than white patients [49,51]. Some evidence indicates that white patients may have a worse survival rate [52]. Te great diferences in smoking prevalence and hospital choices between black and white patients may partly explain these diferences [53,54]. It has been reported that marital status was a protective factor for survival among patients with lung cancer [55,56]. Te married patients in our study included married couples and those living with domestic partners, and singles were defned as single, widowed, divorced, and separated patients, which was a more detailed defnition than previous studies [56,57]. We highlighted the potentially signifcant social impact of marital status on the survival of LUAD patients. In the present study, a higher survival rate was shown in married patients. Similar results have been observed in other cancers, such as pancreatic and liver cancer [58,59]. Te TNM staging system is widely accepted as a tool to predict the prognosis of patients with cancer and provide therapy guidelines to doctors. Here, we did not include patients with stage Tx or Nx when assessing the impact of factors on prognosis to guarantee more accuracy and avoid potential errors caused by overdetailed classifcations. Although the Kaplan-Meier curves showed that the higher T stage shortened the survival of the patients, patients at the T0 stage did not present the best survival in our study. We speculated that this is because of the lack of sufcient patients at the T0 stage. Patients at the N0 stage had better survival than those at higher N stages as shown in previous studies [60]. Tat means that lymph node metastasis exerted a negative impact on the survival of LUAD patients. Te same tendency was also observed for patients at the M stage, that distant metastasis of LUAD reduces the survival rate. Te same conclusion can be obtained from other studies [61]. Tus, higher TNM stages indicate poorer survival of LUAD patients.
We further confrmed the efect of tumor size on prognosis based on the 8th edition of the AJCC. Compared with the 7th edition, the latest staging criteria place greater emphasis on the importance of tumor size for a patient's prognosis [62,63]. Hence, using the most recent criteria, one can efectively analyze the efect of tumor size on prognosis. In the present study, smaller tumor size was associated with a better prognosis, while the NOS group presented the worst. A previous study also established that the tumor size in lung cancer was negatively correlated with survival [23]. However, a majority of studies have not considered the NOS group [51]. We attributed this phenomenon to the complex workup and uncertain classifcation of the NOS group. Taken together, our specifc classifcation in tumor size made it suitable to predict the prognosis of LUAD patients.
To date, several studies have indicated that the diferentiation of the tumor was associated with survival in patients with lung cancer [64,65]. Te general rule emerging from these studies was that the poorer the diferentiation of the tumor, the shorter the survival of the patients with lung cancer. Our results validated most of these fndings. However, to our surprise, tumors with grade IV diferentiation showed even better survival than grade III. We attributed this diference to the limited number of patients in grade IV included in our study (n � 288). More patients in grade IV should be included in the future study.
Previous studies have explored the relationship between the primary site of NSCLC and prognosis [66]. However, conclusions from clinical studies remain controversial. Wang et al. reported that patients with lung cancer in the lower lobe had worse survival than tumors in the upper lobes [67]. Li et al. demonstrated that patients with NSCLC located in the main bronchus experienced worse outcomes than at other locations [68]. However, some studies have indicated that the primary site could not contribute to predicting the survival of NSCLC at stages I/II [69]. In the current study, we found that patients with lower and upper lobe tumors showed poorer survival than middle lobe tumors, and mainstem bronchus tumors showed the worst prognosis. Diferent from the grouping used in previous studies, we added the NOS group to provide additional guidelines for LUAD patients. Te survival time of the lung NOS group fell between the mainstem bronchus and overlapping lesions of the other lung groups. Overall, this evidence suggests that the tumor primary site has a signifcant impact on prognosis and should be considered in prognosis assessment.

Conclusion
In conclusion, we established and validated a novel nomogram for predicting the survival of LUAD patients. Younger age, female sex, race other than white and black, married status, lower risk score, lower TNM staging, smaller tumor size, and high diferentiation grade of the tumor were associated with good survival. Using this model, clinicians may evaluate the survival of LUAD individuals more precisely. In the future, the underlying mechanisms leading to these results should be studied to improve our understanding of LUAD.

Data Availability
Te data used to support the fndings of this study are from publicly available datasets and are available at https://seer. cancer.gov/data/.

Ethical Approval
Institutional review board approval was waived for this study because the SEER database is a public anonymized database. Te author Bin Xie has gotten access to the SEER database (accession number: 16037-Nov2018).