Identifying the Association Rules between Clinicopathologic Factors and Higher Survival Performance in Operation-Centric Oral Cancer Patients Using the Apriori Algorithm

This study computationally determines the contribution of clinicopathologic factors correlated with 5-year survival in oral squamous cell carcinoma (OSCC) patients primarily treated by surgical operation (OP) followed by other treatments. From 2004 to 2010, the program enrolled 493 OSCC patients at the Kaohsiung Medical Hospital University. The clinicopathologic records were retrospectively reviewed and compared for survival analysis. The Apriori algorithm was applied to mine the association rules between these factors and improved survival. Univariate analysis of demographic data showed that grade/differentiation, clinical tumor size, pathology tumor size, and OP grouping were associated with survival longer than 36 months. Using the Apriori algorithm, multivariate correlation analysis identified the factors that coexistently provide good survival rates with higher lift values, such as grade/differentiation = 2, clinical stage group = early, primary site = tongue, and group = OP. Without the OP, the lift values are lower. In conclusion, this hospital-based analysis suggests that early OP and other treatments starting from OP are the key to improving the survival of OSCC patients, especially for early stage tongue cancer with moderate differentiation, having a better survival (>36 months) with varied OP approaches.


Introduction
In Taiwan, betel nut chewing, cigarette smoking, and alcohol consumption have been found to be highly associated with oral cancer [1], with habitual betel nut chewers showing a particular high prevalence [2][3][4]. Oral cancer is one of the 10 most prevalent cancers in Taiwan, mostly classified as oral squamous cell carcinoma (OSCC) [5], which has high rates of morbidity and mortality [6] because diagnosis often only takes place in the later stages [7]. Although many tumor markers [8][9][10] and single nucleotide polymorphism (SNP) markers [11] have been reported as being associated with oral cancer, outcome-based studies focusing on oral cancer therapy are lacking.
The survival of OSCC patients following surgical therapy has been reported to be affected by tumor size, nodal metastasis, staging, and differentiation [12]. Some researchers have been further concerned with factors involved in outcomes for postoperative radiotherapy for OSCC patients [13]. However, the correlation between the multiple survival affecting factors for predicting the well survival of OSCC therapy is less addressed and remains a challenge. Recently, several computational methodologies have been introduced to analyze the relationship between multiple factors and therapies for several non-OSCC diseases, including machine learning algorithms [14], data mining [15], decision tree-based learning [16], and rule-based multiscale simulations [17].
The Apriori algorithm is used here to explore the correlation between clinical factors and good survival outcomes (i.e., >36 months) in operation-(surgery-) centric treatments, including operation alone, operation/IA, and operation/IA, CT, IV, and RT, where IA, IV, CT, and RT, respectively stand for intra-arterial, intravenous, oral chemotherapies, and radiotherapy. The study aims to computationally evaluate the correlation between clinicopathological factors and survival outcomes in 493 OSCC patients treated by operation alone or by operation followed with other nonsurgical treatments.

Data Source.
The database used to construct our cases and control groups was obtained from the chart registry of cancer center of the Kaohsiung Medical University Hospital from 2004 to 2010. Patients were excluded if they had distant metastases at presentation, did not complete the therapeutic protocol in Kaohsiung Medical University Hospital, or had incomplete records. A total of 493 patients fulfilled the requirements and were included for further analyses (the raw data set is available at http://bioinfo.kmu.edu.tw/OP high-OP low groups.xlsx). The patients were followed at Kaohsiung Medical University Hospital. The last followup was recorded from the last outpatient visit or the date of death. This use of patient data and the study design were reviewed and approved by the Institutional Review Board of Kaohsiung Medical University Hospital (KMUH-IRB-EXEMPT-20130029). The Apriori algorithm was proposed by Agrawal and Srikant in 1994 [18] and has been widely used for frequent itemset mining and association rule learning in databases. The Apriori algorithm aims to generate the desired rules from large itemsets. The general idea is that if items are large itemsets, then any rule in will have the minimum required support because is large; that is, ⇒ . The Apriori algorithm can be divided into three steps. Algorithm 1 shows the pseudocode of the Apriori algorithm. The algorithm's first pass counts item occurrences to screen the large itemsets (Section 2.2.1). The second pass generates the candidate itemsets from large itemsets −1 , using the apriori-gen function (Section 2.2.2). Next, each transaction checks whether the subsets of -itemsets of belong to , called subset function and described in Section 2.2.3. Finally, each counts item occurrences in , and will be stored in if c.count minimum support. The algorithm terminates when is empty; that is, no frequent set of or more items is present in .

Introduction of the Apriori
2.2.1. Screening the Large 1-Itemsets. Algorithm 2 shows the pseudo code of first pass which simply counts item occurrences = { 1 , 2 , . . . , } to determine the large itemsets in all items. The array of item counts is used to count item occurrences, and elements in Item-counts having minimum support are included in the 1 set.

Candidate Set Generations.
The function apriori-gen ( −1 ) generates from −1 , and it returns a superset of the set of all large -itemsets. Algorithm 3 shows the pseudo code of the function apriori-gen ( −1 ). We use a set , = { −1 .
such that 1 to − 2 items are not equal to the 1 to − 2 items of −1 .
. Only if we find an Finally, checks whether the subsets of are included in −1 .

Candidate Set Counts Using Hash Tree.
After the candidate sets are generated, the are stored in a hash tree created by the function subset ( , ). The leaf of the hash tree comprises the pointers to and the associated counters, and the leaf refers to distinct partitions of . In the hash tree, the hash function can be used to insert the candidate itemsets and search the transaction subsets in . The hash function is hash( ) = mod , < , where is a constant, and is the number of items. Function subset ( , ) is a recursive function which traverses the tree starting from the root node to the leaves, with each item in = { 1 , . . . , } chosen as a possible starting item of a candidate itemset. It is applied at every level of the tree. When reaches a leaf of the tree, all candidate itemsets are checked against and their counters are updated.

Statistics Analysis.
Statistical analysis was performed with JMP version 9. All statistical tests were done at a 0.05 significance level.

Demographic Data and Survival
3.1.1. Age and Survival. As shown in Table 1, all patients were categorized into 2 groups based on whether the survival is greater or less than 36 months. In this regard, no difference in varied age groups can be found. This is probably because 01: Function apriori-gen ( −1 ) 02: set ← Ø 03: for (all −1 .
anyone who was eligible for surgical resection would have comparable survival rates. Table 1, the site distribution of the 493 cases of oral cancer patients showed common affected sites including the cheek mucosa, gum, tongue, and retromolar trigon. Postsurgical organ function and cosmetics may vary with surgical site, but no difference to survival could be found. Table 1, laterality is recorded in the database of cancer registries and is a mixed expression of clinical/pathological tumor size and location. It does not play a significant role in the surgical group. Table 1, comparison of the pathological characteristics between >5-year ( = 271) and <5-year survival ( = 222) revealed better treatment outcomes for low grade tumors ( = 0.0006), suggesting that well-differentiated tumors are less aggressive and thus are associated with better overall survival. Table 1, regional lymph node examination might express the details and quality of surgical resection. However, the number of examined lymph nodes was not found to have an effect on survival. This might be due to cross-interaction between clinical lymph node stages and overall survival.

Clinical Stages, Pathology Stages, Clinical/Pathology
Tumor Sizes, and Survival. As shown in Table 1, neither clinical nor pathological stages were found to have an impact on 5-year survival. There might be some influencing factors between low-and high-tumor stages which cannot be simply explained by surgery. However, for clinical/pathological tumor size alone, significant differences between >5-year and <5-year groups are found ( = 0.0004 and = 0.0141, resp.). Smaller tumor size means less tumor burden and has less surrounding tissue infiltration, which may explain improved overall outcomes.     Table 1, treatment modalities (OP) were further differentiated into 3 groups based on different adjuvant therapies, that is, surgery alone, surgery plus intra-arterial chemotherapy, and surgery plus concomitant chemoradiotherapy. Significant differences between groups were found ( < 0.0001), and further analysis of surgical modalities based on the clinical/pathological stages could produce interesting insights.

Surgical Modalities and Survival. As shown in
This hospital-based study followed nearly 500 patients with oral squamous cell carcinoma after surgical treatment. Results showed that age of onset and laterality of tumor location did not influence the treatment outcome. The latter might be attributed to oral cancer being a less multifocal or multicentric disease than, for example, breast cancer and, hence, laterality of the primary tumor has less influence on survival. These findings are in line with previous findings [19,20].
Advanced tumor stage or failure of locoregional control negatively influences survival in patients with OSCC [21]. However, we did not observe a significant influence from either clinical or pathological tumor stages. Similar to our findings, Pandey et al. reported no difference in survival rates for the extent of tumor [22], and the observed difference might be due to the facts that all stages of tumor have been poured in the analysis.
In the present study, multimodality treatment proved to be a prognostic factor. Benefit from systemic or adjuvant local therapies might correlate with disease biology as the grade of tumor differentiation was also an important influencing factor. Table 2 shows the best rules for OP > 36 months. The head and body represent a class association rule ⇒ which means the head of an association rule ⇒ (with rule body ) must be restricted to one attribute-value pair.

Data Mining Results Using Apriori Algorithm.
The attribute of the attribute-value pair is thus the class attribute. The resulting rules can be evaluated according to three metrics: confidence, lift, and leverage. The minimum value of 1.5 for lift (or improvement) is computed as the confidence of the rule divided by the support of the righthand-side (RHS). The lift represents the ratio of probability. Given a rule ⇒ , and occur together to the multiple of the two individual probabilities for and ; that is, If lift is 1, and are independent. The higher lift is above 1, the more likely that the existence of and together in a transaction is due to a relationship between them and not just random occurrence. Unlike lift, leverage measures the difference between the probability of co-occurrence of and as the independent probabilities of each of and ; that is, Leverage measures the proportion of additional cases covered by both and above those expected if and were independent of each other. Thus, for leverage, values above 0 are desirable whereas values greater than 1 are desirable for lift. Finally, conviction is similar to lift, but it measures the effect of the right-hand side not being true and also inverts the ratio. Conviction is measured as conviction = Pr ( ) ⋅ Pr (not ) Pr ( , ) .
(3) Table 2 shows that the rule "grade/differentiation = 2 and clinical stage group = early" is associated with the rule "primary site = tongue and group = OP. " The rule shows 49 patients as being grade/differentiation = 2 and clinical stage  Table 1 are regarded as early and stage 4 is regarded as late stage in Table 2. * 2 The best rules with lift >1.5 were shown here. group = early, while 27 of these 49 patients fulfill the rules "primary site = tongue and group = OP. " The confidence shows the proportion of the rule "primary site = tongue and group = OP" in the rule "grade/differentiation = 2 and clinical stage group = early, " that is, 27/49. The lift is 1.91, meaning the existence of rule "grade/differentiation = 2 and clinical stage group = early" and rule "primary site = tongue and group = OP" together in a transaction is not just a random occurrence. The leverage value of 0.05 means that the proportion of additional cases covered by both rule "grade/differentiation = 2 and clinical stage group = early" and rule "primary site = tongue and group = OP" are greater than those that would be expected if these two rules were independent of each other. The conviction value of 1.52 indicates the effect of the righthand side is not being true.
From the top down in Table 2, the lift values gradually decrease but still show a high correlation between the body/head and survival of >36 months. When the Apriori algorithm-based lift value of the items listed in "body" and "head" of Table 2 is high, there is less chance of misinterpretation of the relationships between each item. Judging by the top 8 results, the same items such as grade/differentiation = 2, clinical stage group = early, primary site = tongue, and group = OP flowed between the "body" and "head". These data suggest that early stage tongue cancer with moderate differentiation will have a better survival (>36 months) with varied surgical approaches where the OP has three kinds of treatments.
Judging by the top 9 to 10 results, however, only three items are included without the group = OP and their lift values are decreased to 1.74. These results suggest that the factor of "group = OP" is not important to the top 9 to 10 results and is less strongly correlated compared with the top 8 results. It also implies that the OP plays an important role in creating a correlation with improved survival (>36 months). In clinical settings, this might be due to good treatment outcome which often accompanies surgery.
Accordingly, our proposed Apriori algorithm is a relatively simple form of rule-based computation to identify potential rules involving various factors, such as grade/differentiation = 2, clinical stage group = early, primary site = tongue, and group = OP. The algorithm can reveal the combination effect of these factors on the outcome of OSCC therapy.

Conclusion
This hospital-based analysis reviewed 493 patients with OSCC to mine survival factors in operation-centric patients. The results identify the importance of grade/differentiation = 2, clinical stage group = early, primary site = tongue, and group = OP in predicting higher survival for OSCC patients.