Discovering Associations of Adverse Events with Pharmacotherapy in Patients with Non-Small Cell Lung Cancer Using Modified Apriori Algorithm

Aim To explore the associations between adverse events and pharmacotherapy in patients with non-small cell lung cancer. Methods 16,527 patients with non-small cell lung cancer admitted to the Cancer Hospital, Chinese Academy of Medical Sciences, between January 1, 2010, and December 31, 2016, were included in the study. Their medication and laboratory examinations data were extracted from the medical records. Common Terminology Criteria for Adverse Events Version 4.03 were utilized for adverse events reporting. A new association algorithm was developed based on Apriori algorithm and used to investigate the associations between drugs and adverse events. In addition, a statistical comparison was conducted to compare the modified Apriori algorithm with the conventional Apriori algorithm. Results Different types and levels of adverse events were identified from the abnormal laboratory findings. The three most common adverse events were hypocalcemia, elevated creatine phosphokinase, and hypertriglyceridemia. In addition, using the modified Apriori algorithm, 380 association rules were found between adverse events and chemotherapy. Moreover, the statistical comparison of the two methods demonstrated that the modified Apriori algorithm was more advantageous in analyzing the correlation between drugs and adverse events than the conventional Apriori algorithm. Conclusions The modified Apriori algorithm can be used to more efficiently associate pharmacotherapy with adverse events. Based on the modified Apriori algorithm, meaningful association rules between drugs and adverse events were found, demonstrating a promising way to reveal the risk factors of adverse events during cancer treatment.


Introduction
Lung cancer is the most common cause of cancer-related death in China. There were 705,000 (470,000 male and 225,000 female) new cases of lung cancer and 569,000 patients (387,000 male and 282,000 female) died of lung cancer in 2012 [1]. The five-year survival rate of lung cancer patients is only 15% [2]. Non-small cell lung cancer (NSCLC) accounts for more than 85% of all lung cancer cases [3]. Pharmacotherapy, especially chemotherapy, is the main strategy for cancer treatment because of its demonstrated efficacy in reducing tumor progression and improving overall survival in advanced cancer patients. However, these treatments frequently cause severe adverse reactions and induce unexpected outcomes, preventing them from being used as first-line therapies [4]. To achieve better longterm prognosis, cancer patients are often treated with combined chemotherapy. However, simultaneous administration of multiple drugs may increase the adverse drug reactions (ADR) [5]. Due to the high prevalence of NSCLC, ADR related to chemotherapy are becoming an increasingly important issue. The use of association algorithms, such as Apriori algorithm, has shown their feasibility and effectiveness in detecting adverse drug events (ADE) [6]. Apriori algorithm was first presented in 1994 [7] and has been widely used for frequent itemset mining and association rule learning [8].

BioMed Research International
Data mining techniques like Apriori algorithm typically focus on positive association rules based on frequently occurring itemsets to extract association rules from big data. Therefore, these algorithms may ignore many important but infrequent itemsets [9]. In addition, because these algorithms lack attention to the concept and meaning of items, the results may include many nonsense and redundant ones [10].
In this study, we proposed a modified Apriori algorithm to overcome the deficiencies of conventional Apriori algorithm, especially for ADR detection, and studied the relationship of administered drugs with adverse events.

Data Source.
The study was approved by the Ethics Committee of Cancer Institute and Hospital, Chinese Academy of Medical Sciences. The database was obtained from the medical records of NSCLC patients who were admitted to Cancer Hospital, Chinese Academy of Medical Sciences, from January 1, 2010, to December 31, 2016. Patients were excluded if they did not complete the therapeutic protocol or had incomplete records. Patients' information including demography, prescription, medical test orders, and results of clinical laboratory tests were extracted and normalized. The collected medical dataset contains the records of 17,048 patients. Every drug and clinical laboratory test was defined as an independent variable and coded for analysis. The collected data were organized using SQL server 2012 database software.

Data
Cleaning and Standardizing. The obtained data were streamlined. First, data cleaning was implemented to remove duplicate records in the database. Second, consistency checking was performed to check whether the data meet the requirements and identify data that are beyond the normal range or logically unreasonable. Due to errors in inputting, coding, and extracting, the dataset contains some invalid data and missing data, which were identified by consistency checking. The preferred method for consistency checking was manual retrieval. If manual retrieval was impossible, four other steps could be selected: estimation, case deletion, variable deletion, and pairwise deletion. For a dataset with a small percentage of invalid or missing data, these cases were generally selected for deletion. After data streamlining, 521 (3.06%) patients were excluded and 16,527 patients were included in the study. Among them, 16,527 patients were enrolled in the study meeting data integrity requirements. Of all the studied patients, a total of 1,820,207 prescriptions were extracted from electronic medical record, of which 1,201,594 prescriptions were related to pharmacotherapy. Drugs are classified according to the active ingredients. Drugs with different dosage, forms, or specifications, but with the same active ingredients, are defined as the same type of drug. A total of 8,867,853 clinical test records were extracted from electronic medical record database, involving 502 testing items. The abnormal ones, above or below normal laboratory testing range, were more meaningful and stored separately in the database, including 888,805 above normal and 683,225 below normal tests.

Demographic Data
2.3.1. Age and Gender. The enrolled 16,527 patients were categorized based on their age and gender. As shown in Figure 1, 9,941 (60.15%) were males and 6,586 (39.85%) were females and the ratio of males to females was 1.51 : 1. Patients were 61.67 years old on average, ranging from 13 to 94 years old at diagnosis and had a median age of 62 years old. The largest population of males and females in the enrolled NSCLC patients was at the ages of 60-64 years.

Geographical Distribution of Patients.
Their street addresses were concealed to protect patient privacy and their district addresses were extracted and converted to latitude and longitude data using the open platform of Baidu maps (http://lbsyun.baidu.com/). Heatmap with superimposed colors was plotted to describe the population distribution ( Figure 2).

Algorithm Design.
A modified Apriori algorithm was developed by introducing the mechanism of 2 for analyzing The modified Apriori algorithm for association rule learning can be characterized using 2 steps. Let itemset = { 1 , 2 , . . . , } be a set of items , where the number of items in is . Let transaction = { 1 , 2 , . . . , } be a set of itemsets, where every itemset that contains is a subset of . Let be a set of transactions.
Step 1 contains 4 algorithms of conventional Apriori algorithms [8]. Algorithm 1 is the whole pseudocode of Apriori algorithm for screening large itemsets without introducing the mechanism of 2 . Algorithm 2 takes the number of first pass items to screen the large itemsets. Algorithm 3 uses the join subset of large itemsets −1 to generate candidate itemsets . Algorithm 4 makes the process more efficient by deleting from if the subsets of do not belong to −1 . The occurrence of each belonging to is counted. When the result is greater than the minimum support (min sup), will be added to . The algorithm will keep running until is empty.
In order to compensate the deficiency of the original algorithm, step 2 was performed by introducing 2 into the algorithm. 2 indicates the degree of deviation between the observations and the theoretical values.
It is assumed that there are two categorical variables, and , whose values are { 1 , 2 } and { 1 , 2 }, respectively. The sample frequency series are shown in Table 1: When is greater than 40 and the theoretical frequency of each group is not less than 5, it conforms to 2 (1) distribution. In this way, the correlation of the two items can be identified according to the testing theory in statistics. The null hypothesis 0 is that and are independent. The argument to be inferred is 1 that " is related to ." At the given significant level = 0.05, if the calculated result is greater than 3.841 ( 2 (1), Chi-square test critical table), the null hypothesis is rejected. Thus, and are not independent from the statistical point of view at confidence of 0.95. If the calculated result is less than 3.841, the null hypothesis is true. In other words, and are independent.
Algorithm 2: The pseudocode of finding frequent 1-itemsets.  Although the original algorithm can examine whether two items are related, it could not distinguish positive and negative association rules. In order to exhibit negative association rules and to implement the modified Apriori algorithm more conveniently, a new screening variable minimum test value (min tev) was defined and calculated by the total sample number and critical value that has been determined by significant level and degree of freedom. Variable comp was defined as the degree of positive and negative association by removing square count and holding the sign . (2) After Step 1 was applied to screen every possible combination of administered drugs, the confidence( /( + )), comp of each new drug combination, and each test result were figured out in Step 2. Each new drug combination with confidence greater than the thresholds (min cof) was retained. Comp greater than min tev indicates positive association rules, while comp less than negative min tev indicates negative association rules. In addition, the intensity of association rules can be measured by the absolute comp value.
The desired algorithm and statistical analysis were implemented by using MATLAB software. All statistical tests were performed at significance level of 0.05.

Results and Discussion
3.1. Treatments. Chemotherapy, surgery, radiotherapy, and interventional therapy were used for treatment of NSCLC. Among them, surgery was the predominant procedure, which was used in 12,804 (77.47%) patients, followed by chemotherapy for 5,122 (30.99%) patients, radiotherapy for 1,777 (10.75%) patients, and interventional therapy for 56 (0.34%) patients. Multimodality therapy was performed for Algorithm 3: The pseudocode of join. A total of 592 drugs were given to patients including chemotherapeutic agents, analgesics, biologics, antimicrobial agents, glucocorticoid, traditional Chinese medicine, and others. A total of 5,122 patients were treated with different regimens of chemotherapy using 33 types of drugs due to different conditions. Among them, 4,716 (92.07%) patients were treated with multiple drugs and 406 (7.93%) patients with single agent. Among these drugs, platinum-based drugs such as cis-platinum, carboplatin, nedaplatin, oxaliplatin, and lobaplatin played critical roles in chemotherapy and were found in medical orders for 4,767 patients, accounting for 93.07% of all patients treated with chemotherapy. Combination of platinum-based drugs and pemetrexed was the primary regimen, which was used for 2,582 (50.41%) patients, followed by combination of platinum-based drugs and paclitaxel, which was used for 1,451 (28.33%) patients. Single-agent regimen was rarely used. Of them, pemetrexed disodium was administrated in 102 (1.99%) patients and cis-platinum in 97 (1.89%) patients, respectively. Table 2 shows the most commonly used chemotherapeutic regimens involving one platinum-based drug and another drug as well as their application in patients. The targeted therapy of tumors has been widely accepted. Among all the patients studied, 782 patients received targeted treatments. Among the 12 involved targeted drugs, Rhendostatin and bevacizumab, both of which are monoclonal antibodies inhibiting angiogenesis, were the most and second most frequently used drugs, treating 242 and 233 patients, respectively. The third mostly used targeted drug was gefitinib, treating 130 patients. Table 3 lists the usage of targeted drugs. and ranking the second; 6,849 patients had hypertriglyceridemia, accounting for 41.44%; and 6,549 patients had hyperglycemia, accounting for 39.63%. In addition, anemia, hypoalbuminemia, increased GGT, and decreased white blood cell were also important adverse events. Table 4 lists the number of patients with different types and levels of adverse events and Figure 3 shows the adverse events with more than 1000 incidence.

Comparison of the Two Methods for Data Mining.
To obtain all possible association rules that make sense, given the low usage of certain oncology medications and the high incidence of adverse events, min sup was set at 165, which is 1% of the total number of patients, and min cof was set at 10%. In addition, to compare the differences between the modified and conventional Apriori algorithms, the association rules were mined under the same parameters using the two algorithms, respectively. For anticancer drugs, the conventional Apriori algorithm was implemented on MATLAB platform.
The running time of the conventional Apriori algorithm was 35.97 s, and a total of 558 association rules were obtained. Among them, 177 were association rules of single drug and adverse events and 381 were association rules of two drugs and adverse events. Traditional Apriori algorithms cannot distinguish positive association rules or negative association rules. Among these 558 association rules, there were a large number of invalid rules, especially indistinct negative association rules. These invalid rules were troubling for the subsequent analysis of valuable association rules.
The running time of the modified Apriori algorithm was 34.49 s, slightly shorter than that of the conventional Apriori algorithm. A total of 380 association rules were obtained, much fewer than that of the conventional Apriori algorithm. Among them, 119 were association rules of single drug and adverse events, 261 were association rules of two drugs and adverse events, 370 were positive association rules, and 10 were negative association rules. Table 5 compares the results obtained using the two Apriori algorithms statistically. As shown, the operation time of the modified Apriori algorithm is slightly shorter, and the modified Apriori algorithm can distinguish positive and negative association rules. The rank sum test shows a significant difference in the positive association rules obtained using the modified and conventional Apriori algorithms ( = 0.0498 for single drug association rules and = 0.0004 for two drug association 6 BioMed Research International    rules), but no significant difference in the negative association rules obtained using the modified and conventional Apriori algorithms ( = 0.2329 for single drug association rules and = 0.9188 for two drug association rules), possibly due to the fewer negative association rules found by the modified Apriori algorithm. It is believed that when more variables are added, the two will show significant differences. The statistical results clearly show that compared with the conventional Apriori algorithm, the modified Apriori algorithm has slightly shorter calculation time and obviously reduced number of invalid association rules and can distinguish the negative association rules; thus, it is more advantageous in mining the correlation between administered drugs and adverse events.

Top Meaningful Association Rules.
In order to find out whether the use of those drugs for tumor treatment can lead to adverse events more easily and which drugs can reduce the incidence of adverse events, the modified Apriori algorithm was used to analyze the correlation of all drugs with adverse events and identify more meaningful association rules. The positive association rules imply that the use of these drugs could lead to adverse events more easily. One should pay attention to their clinical use and, if necessary, preventive and intervention measures. Table 6 lists the 16 top meaningful positive association rules that cover common adverse events such as hematopoietic system suppression, abnormal liver function, and hyperglycemia.
Decreases in both anemia and neutrophil counts are myelosuppressive side effects of chemotherapy. The incidence of anemia is also high in patients undergoing chemotherapy [11]. Many patients have to undergo other treatments due to chemotherapy-induced anemia [12]. In this study, 38.88% of the enrolled patients exhibited different degrees of anemia. Data mining results show that cisplatin is related to anemia. Decreased neutrophil count is a common side effect of chemotherapy and it is the primary reason for dose delay or decrease [13]. Cisplatin is obviously associated with decreased neutrophils and white blood cells, increased cholesterol and GGT, and hypertriglyceridemia, but it is an important chemotherapy drug for the treatment of NSCLC. Hepatotoxicity could increase the levels of ALT and AST. It has been reported that liver toxicity is a common side effect of pemetrexed-containing regimen [14,15]. Our results also showed strong correlations between pemetrexed and increased ALT as well as AST, revealing the liver toxicity of pemetrexed. Gemcitabine is associated with decreased platelet count. In addition to chemotherapeutic drugs, the association rules also show that other medications are associated with adverse events. For examples, Folium Sennae is associated with increased blood bilirubin; zoledronic acid is associated with increased alkaline phosphatase; and pazufloxacin is associated with hypermagnesemia. Although Apriori algorithm can reveal the association of drugs with adverse events, it could not reveal the causal links. These association rules still need to be further confirmed using evidence-based medical or experimental studies. Table 7 lists the top meaningful positive association rules between 5 combinations of medications and adverse events. Among all the patients in the study, 6,426 (38.88%) suffer from anemia. The incidence of anemia is 84.09% in patients using both gemcitabine and carboplatin. The combination of gemcitabine and cisplatin increases the incidence of neutropenia to 73.27%, indicating that it is necessary to pay close attention to the neutrophil count in combination chemotherapy. Association analysis suggests that the combination of paclitaxel and gemcitabine and the combination of nedaplatin and pemetrexed greatly increase alanine aminotransferase (ALT) and aspartate aminotransferase (AST), suggesting they may have addition effects and need to be further studied. Both single-agent and combination chemotherapy can promote hyperglycemia [16]. Hyperglycemia can induce related complications but also attenuate the antiproliferative effects of chemotherapy, as reported by a preclinical study [17]. Among all the patients in the study, 6,549 (39.63%) suffer from hyperglycemia, and 393 (2.38%) have level 4 hyperglycemia. The combination of paclitaxel and carboplatin is associated with the occurrence of hyperglycemia. Almost half (49.72%) of the patients treated with paclitaxel and carboplatin have hyperglycemia.
The modified Apriori algorithm can help distinguish and filter out clinically significant negative association rules, which suggest that these drugs may reduce the occurrence of adverse events in cancer therapy. The discovery of these rules provides clues for further exploration of the mechanisms of drug actions and the development of adjuvant therapies for cancer chemotherapy. Meanwhile, if the adverse events reducing effects of these drugs were confirmed through a higher level of evidence-based medical evidence, these drugs can be useful in clinical applications, directly benefiting cancer patients.
Among those with negative association rules to chemotherapy-induced adverse events, administering antitumor drug disodium cantharidinate and vitamin B6 is negatively associated with decreased white blood cells, neutrophil, and platelet as well as increased AST, indicating that they could be used to fight against cancer but also have no adverse impact on patients' hematopoietic system and liver function and even could increase white blood cells and decrease AST. The published pharmacological action of disodium cantharidinate confirms that injection of disodium cantharidinate with vitamin B6 could increase white blood cells. Further association analysis shows that lentinan is negatively associated with decreased platelet count. The incidence of decreased platelet count is 10.52% in all patients, but only 5.91% in patients treated with lentinan. Lentinan is also negatively associated with decreased neutrophil count. The incidence of decreased neutrophil count is 26.41% in all patients, but only 16.52% in patients treated with lentinan.

Conclusions
Introducing the Chi-square test into the conventional Apriori algorithm produces a more efficient and accurate algorithm for analyzing the association rules between drugs and their related adverse events in 16,527 medical records. Using the modified Apriori algorithm, drugs associated with adverse events are identified among many association rules. These associations suggest that, in the treatment of NSCLC, one should pay more attention to these drugs and, if necessary, conduct retrospective or prospective clinical studies. In addition, the study shows that cantharidinate in combination with vitamin B6 exerts antitumor effect but also has no adverse effect on hematopoietic system and liver function, suggesting that the combination of these drugs may offer effective and low toxic treatment. Therefore, their anticancer mechanisms need to be further studied. Although many rules are obtained through the modified association algorithm, these rules only reveal the correlation and mutual exclusion between them. The intrinsic causality and mechanism of action need to be further studied.

Conflicts of Interest
The authors report no conflicts of interest.