Distant Supervision with Transductive Learning for Adverse Drug Reaction Identification from Electronic Medical Records

Information extraction and knowledge discovery regarding adverse drug reaction (ADR) from large-scale clinical texts are very useful and needy processes. Two major difficulties of this task are the lack of domain experts for labeling examples and intractable processing of unstructured clinical texts. Even though most previous works have been conducted on these issues by applying semisupervised learning for the former and a word-based approach for the latter, they face with complexity in an acquisition of initial labeled data and ignorance of structured sequence of natural language. In this study, we propose automatic data labeling by distant supervision where knowledge bases are exploited to assign an entity-level relation label for each drug-event pair in texts, and then, we use patterns for characterizing ADR relation. The multiple-instance learning with expectation-maximization method is employed to estimate model parameters. The method applies transductive learning to iteratively reassign a probability of unknown drug-event pair at the training time. By investigating experiments with 50,998 discharge summaries, we evaluate our method by varying large number of parameters, that is, pattern types, pattern-weighting models, and initial and iterative weightings of relations for unlabeled data. Based on evaluations, our proposed method outperforms the word-based feature for NB-EM (iEM), MILR, and TSVM with F1 score of 11.3%, 9.3%, and 6.5% improvement, respectively.


Introduction
Data-driven approach for knowledge extraction from electronic medical records (EMRs) has gained much attention in recent years. An EMR repository contains a collection of tacit knowledge [1] (e.g., professionals' experiences, know-how) and explicit knowledge (e.g., diagnosis procedure, patient information) in a digital form of structured and unstructured data. This EMR repository offers insight into significant healthcare problems: patient mortality prediction [2], patient risk identification [3,4], drug-disease relation extraction [5], and drugdrug interaction prediction [6,7]. One of the potential applications is automatic adverse drug reaction (ADR) identification from EMRs. The ADR terminology is an unpleasant event (e.g., symptom, disease, and finding) associated with a medication given at recommended dosages [8]. Even though ADRs can be identified by premarketing clinical trials, only partial ADRs are reported. Postmarketing surveillance with a large amount of population is necessary for remaining ADR monitoring. To this end, there are two multidisciplinary tasks of ADR surveillance: ADR identification and ADR prediction. The former task targets on retrieval of unrecognized ADR that may exist in data but not explicitly described as knowledge, while the latter one aims to construct a model for predicting unknown ADR that have not been reported in anywhere.
In earlier research, the statistical co-occurrence method is broadly employed to quantify the relationship strength between a drug-event pair. While the method is simple, its result might present no explicit clinical relevance of a derived drug-event pair [9] due to disregard relational context that might express an exact impression in a clinical event such as a drug treats a symptom or a drug causes a symptom. To fill in this research gap, many researchers consider surrounding contexts around drug and event entities within clinical texts and represent such data by either using pattern-based method [10][11][12][13][14][15] or feature-based method [16][17][18]. Consequently, a potential ADR is identified by either training supervised learning or semisupervised learning [19] model. However, there are two main difficulties when dealing with unstructured texts using such learning models. A rare availability of labeled instances derived by human annotation to form a gold-standard example is the former problem, and intractable processing of unstructured clinical texts is the latter one. Toward the insufficiency of labeled instances, several studies alleviate this problem by using a sort of heuristics or rules (distant supervision [20,21]), that is, mapping a sentence that contains entity pair e 1 , e 2 from knowledge base and tagging relation label y to such mentioned sentence to form a training set. For the second problem, a wordbased approach [22][23][24], the most commonly used method for text representation, is introduced; however, the method ignores either grammatical or semantic dependency among words. Therefore, pattern-based methods [10,11,14] are promoted to either extensive or substitute for word-based text representation. Recently, distant supervision paradigm is introduced to overcome hand-labeled data process to obtain a label of an instance from knowledge base [20,21]. For example, knowledge bases consist of the following drug-event relations ("ramipril-allergy," "ADR") and ("aspirin-fever," "IND"), so-called entity-level relation. By distant supervision, we can derive automatic labeled data of an associated sentence with such drug event, for example, "His ramipril were discontinued due to allergy and added to list in our medical records," "ADR," and known as instance-level relation. Therefore, multiple-instant learning (MIL) paradigm [25] is introduced into the classifier builder process to handle such two-level relations.
This paper introduces ADR identification framework by aiming to classify an entity-level relation of a drug-event pair. Our work differs from prior related works in the following aspects: (i) we propose key phrasal pattern-based bootstrapping method for characterizing ADR and IND, (ii) we introduce alternative parameter learning of a generative model, and (iii) we perform enhancement of the proposed method by incorporating transductive learning method.
The rest of this paper is organized as follows. A brief literature review and fundamental knowledge are given in Section 2. Then, Section 3 introduces problem formulation and our proposed framework. Section 4 presents the experimental results. Finally, the conclusion is discussed in Section 5.

Adverse Drug Reaction Identification from Unstructured
Texts. Recently, narrative notes in EMRs have been demonstrated as a promising data source and widely utilized for improving detection of patients experiencing adverse reactions, across drugs and indication areas [10][11][12][13]26]. There are at least three common subprocesses for dealing with unstructured texts in EMRs: (i) named entity recognition (NER) (particularly, named entities of drug and event) and normalization, (ii) relation generation (drug-event candidates), and (iii) relation classification (ADR identification).
As the next subprocess, the generating of drug-event candidates is performed using the windowing technique [27][28][29]. A drug-event pair tends to form a relation if they are located in the same sentence, the same section, or more practically in the same window size n. In general, this boundary detection (BD) task aims to detect the beginning and the ending points within given texts that a drug and an event tend to be semantically related. The challenges of BD task [30][31][32] have arisen based on a boundary of interest and a domain of given texts. Many previous works define a potential boundary of a drugevent candidate within the same sentence, and the sentence boundary detection (SBD) in clinical texts is recognized as challenge with noise prone. One of the major issues is usage ambiguity of a period or a full stop ("."). Typically, the period has several possible functions, such as a sentence boundary marker, a floating-point marker (e.g., "0.08," "40.5 mg"), a marker for a numeric bullet of an enumerated list, or a separator within an abbreviation (e.g., "y.o.," "h.s."). Other punctuation marks such as a colon (":") increase the complexity of SBD as well. Additionally, the grammatical dependency is a potential method for improving a window-based relation generation because it considers more specific semantic dependency of the surrounding contexts.
Lastly, the generated lists of drug-event candidates are identified as ADR or IND using supervised, semisupervised, or unsupervised learning methods. The potential works on ADR identification from unstructured texts are summarized in Table 1. A statistical association is one of the pioneer works to identify ADR by considering the co-occurrence of a drug and an event in a specified window size n to form association hypotheses, and then, the 2 × 2 contingency  Figure 1: An example of narrative notes from a discharge summary in an EMR system is shown in (a). The possible outcomes derived by NERs, normalization, and relation generation of drugs and events from the given texts are displayed in (b). Both drugs and events are unified by UMLS CUI. For privacy concerns, confidential information is concealed using deidentification as [ * * … * * ]. hand, a pattern-based method [14,15] is manifested that achieves more accurate clinical relation extraction because it relies on cues or trigger words that usually implies a semantic relation. Although, a pattern-based method is more efficient than the window-based method, a set of predefined patterns or redundant pattern filtering by a human is required. In our previous work [13], a pattern-based method has been proposed to utilize labels weakly suggested by a set of simple rules, (distant supervision) and pattern distribution has been investigated for characterizing ADR relations. Different from [10-12, 18, 37], a pattern-based method is acquired as feature representation and machine learning methods such as support vector machine (SVM), decision tree C4.5 (DT), random forest (RF), or naïve Bayes (NB) are well-established as a classifier. Kang et al. [36] deploy a graph base and applies the shortest-path preference to ADR identification. With regard to the efficacy of word embedding [40] in NLP, Henriksson et al. [26] examine the distributional semantic model derived by word-embedding method for NER, concept attribute labeling, and relation classification.
In their work, a high dimension on semantic space of each word is used as a feature for model learning. The distributional semantic model is shown to improve the classifier performance for all tasks. In another work, Nikfarjam et al. [17] apply the word embedding in a similar manner. However, to generalize semantic space, the authors employ a clustering method on such semantic vectors.

Distant Supervision and Multiple Instance
Learning. The main objective of distant supervision is to alleviate the problem of hand-labeled training which is time-consuming, rare, and expensive/costly by relying on knowledge base. Such knowledge base is reliable, cheap, and ubiquitously available. Distant supervision is first introduced by Craven and Kumlien [20]. In their work, the term weakly labeled data is presented for biomedical relation extraction from MEDLINE. Lately, Mintz et al. [21] propose an interchangeable paradigm, distant supervision, to extract relation from Freebase.
Their assumption relies on "if the two entities participate in a relation, any sentence that contains those two entities might express that relation." The distant supervision has been applied recently for relation extraction problem [41][42][43][44][45] by mapping relations of any couple entities from knowledge bases (e.g., Freebase, YAGO) to a sentence in a large-scale text corpus (e.g., New York Times). Similarly, in previous works on application for emotion classification from social media (i.e., tweets, microblog text) [46][47][48], the authors make use of distant supervision to map lexicon emoticons or smilies from knowledge bases (i.e., Wikipedia, Weibo) to large-scale noisy texts. In medical domain, distant supervision for ADR identification [33,49] is leveraged to automatically assign adverse reaction relation by mapping drug-event pair from knowledge bases to each health-related texts. The work of Yates et al. [49] utilizes SIDER as knowledge based on English tweets and posted messages from breast cancer forum, and Segura-Bedmar et al. [33] deploy SpanishDrugEffectDB database on Spanish health-related texts. As mentioned in the previous section, applying distant supervision on text corpus mostly encounters the two-level relation concept and the entity-level and the instance-level relations. This mapping procedure may trigger noisy labeled data [50,51], and MIL paradigm [25] is widely used as a solution [41,42,52,53] for such wrongly labeled data problem. Fundamentally, MIL is aimed at handling the situation that training labels are associated with sets of instance examples rather than individual examples [54]. The concept of MIL considers two levels of data, namely bag-and instance-level relations. Let X be an instance space, Y be a set of labels, where Y = −1, +1 , and x 1 , y 1 , x 2 , y 2 , … , x n , y n be a training set, where x i ∈ X is an instance and y i ∈ Y is a known label of x i ; usually, the supervised learning is to train a classifier function f X → Y. On the one hand, a given training set in MIL consists of bags and bag labels as x im is a set of multiple instances, x ij ∈ X, and y i ∈ Y is a label of bag B i and m can be different across a particular bag, the goal of MIL is to learn f 2 X → Y. For ADR identification problem, bag-and instance-level relations in MIL are equivalent to the entity-and the instance-level relations of drug-event relation by distant supervision, respectively.

Transductive Learning.
In semisupervised learning, as varieties of the prediction method, the three parameters are (i) predictive model, (ii) single model or collaborative model, and (iii) test instances handling model. As the first parameter, recent works [55][56][57] have proposed various predictive models, such as generative models [22,58], low-density separation models [59], and graph-based models [60]. For the second parameter, at least two alternatives, namely selftraining [61,62] or cotraining [63], can be applied to assign a label to an unlabeled instance by either one single predictive model or multiple ensemble predictive models. The last parameter concerns with how to handle test instances, where two choices are (i) to manipulate the test instances separately from the unlabeled instances (inductive learning) or (ii) to treat them as unlabeled instances in the training step (transductive learning). Regardless of any choice for the above three parameters, semisupervised learning requires a few labeled instances for constructing an initial model, triggering complexity in the acquisition of such initial labeled data. The main idea of transductive learning is to take advantage of the information from unlabeled data during training time, while inductive learning ignores such information even though they are available [19].

Methods
This section presents the proposed ADR identification framework to overcome the shortcomings of the existing research: (i) the lack of domain experts for instances labeling and (ii) intractable processing of large-scale unstructured clinical texts. Our proposed framework contains the three main tasks ( Figure 2). First, a set of drug-event candidates is generated from EMR texts. A silver-standard data and unseen data preparation are the next process. Finally, we explore alternative parameter learning schemes of generative models to identify potential drug-event relations.
To solve the first issue, we assign a label to an unlabeled instance by exploiting facts in knowledge bases (i.e., SIDER and DrugBank) and consider two labels, ADR and IND, as classification outputs. While distant supervision can supply a label to an unlabeled instance by simply looking up from knowledge bases, the labeled data set by this method is formed as MIL problem which training labels are associated with sets of instance examples rather than individual examples. As for the latter issue, applying phrase-based method and dependency representation may improve the model performance. In our work, the main idea is that a sentence regarding harmful (ADR) or beneficial (IND) clinical events can be simplified into the three key elements, a drug, a key phrasal pattern, and an event, and dependency among such three elements has significance. Such key phrasal pattern implies a semantic relation between any pair of drug and event entities. We have employed key phrasal pattern-based method for ADR identification in our previous work [13].
The method exhibits the high precision; notwithstanding its drawback is low recall rate due to a limit to the number of key phrasal patterns and the utilization of simple models. In this work, we extend such key phrasal pattern-based method with more sophisticated models, which is expected to be able to retain the high precision and improve retrieval performance. The EM, an iterative method, is incorporated with Markov property assumption to draw conditional probability distribution of pattern-based feature (dEM). Finally, we leverage unlabeled data through the transductive learning as semisupervised learning to enhance the performance of the proposed framework. For performance evaluation, we construct EM with independent assumption through NB (iEM) as the baseline and also compare our proposed methods to multiple advanced methods; multiple-instance support vector machine (MISVM), multiple-instance naïve Bayes (MINB), multiple-instance logistic regression (MILR), and transductive support vector machine (TSVM). The (2) Automatic data labeling (silver-standard data and unseen data labeling)  (2) In the automatic data labeling, distant supervision assigns a relation label (y) to each drug-event pair (d, e) obtained from the relation generation with its pattern p if such relation exists in knowledge base. The silver-standard data set is labeled data in the experiment. Here, two types of output data sets are a set of labeled data (D L ), composed of (d, p, e, y) extracted from a corpus (EMR texts), where the labels (y) are defined for the drug-event pairs (d, e) in the knowledge base, and a set of unlabeled data (D U ), composed of (d, p, e) extracted from a corpus, where the labels do not exist for the drug-event pairs (d, e) in the knowledge base. (3) In this relation classification, this work proposes three types of generative models with independent/dependent expectation-maximization (EM) model (iEM/dEM): (i) transductive learning with iEM (baseline), (ii) supervised learning with dEM, and (iii) transductive learning with dEM. multiple numbers of parameters such as pattern types, pattern-weighting models, and initial and iterative weighting relation labels for unlabeled data are investigated throughout three alternative MIL models: iEM with transductive learning setting (baseline), dEM-supervised learning, and dEM with transductive learning.

Problem Formulation.
We firstly present the formal definition of distant supervision and then formulate the problem using MIL concept. Let K denote knowledge bases regarding ADR and IND that are obtained from SIDER (http://sideeffects.embl.de) and DrugBank (https://www. drugbank.ca), T be a set of seeds, where T ⊆ K, and Y is a set of labels, where Y = ADR, IND ; the data set of seeds T in knowledge bases K or an entity-level set can be defined as entities space which consists of a drug entity d i and an event entity e i that are defined in K, y i ∈ Y is a label corresponding t i , and N is a total number of seeds. Therefore, the data set of seeds can be derived as T = d 1 , e 1 , y 1 , d 2 , e 2 , y 2 , … , d n , e n , y n . For instance, the drug ramipril associates with the adverse event allergy and the drug ibuprofen is used to treat the event arthritis as a symptom which is supposed to exist in K. We can derive a data set of seeds to be a source of distant supervision as T = ramipril d , allergy e , ADR , ibuprof en d , arthritis e , IND . These seeds are entity-level data that are used as knowledge for later processes.
Let C be a clinical-record corpus from MIMIC (https:// mimic.physionet.org), which contains a set of discharge summary sentences S. We transform each sentence into the three key elements, that is, a drug entity d , a key phrasal pattern entity p , and an event entity e , while semantic of such simplified texts is retrained. Given x j = d j , p j , e j is a tuple obtained from an input sentence and x j ∈ ℋ is 3dimensional entity space, in order to automatically generate labeled examples using distant supervision, the goal is to obtain a mapping function f ℋ → Y that relates a drugevent pair of d j , e j to a relation label y i , where d i , e i , y i exists in T , d j = d i , and e j = e i . Finally, we can derive a set of labeled data D L = d 1 , p 1 , e 1 , y 1 , d 2 , p 2 , e 2 , y 2 , … , d n , p n , e n , y n , namely, an instance-level data set, whereas n is a total number of mapped sentences.
For example, the sentence "His ramipril were discontinued due to allergy and added to list in our medical records." is supposed to exist in the corpus C. Then, the transformed sentence x 1 using a dependency tree can be simplified into the three key elements of a drug d 1 = ramipril d , a key phrasal pattern p 1 = be-discontinue-due-to p , and an event e 1 = allergy e , where a key phrasal pattern is applied in either the syntactically lemmatized lexicon or surface lexicon (e.g., was-discontinued-due-to), and can be employed as either word or phrase form (discuss later in Section 3.3.1). From the mapping function f ℋ → Y, we can project such sentence x 1 to a seed ramipril d , allergy e , ADR in T and transfer corresponding labels ADR to the sentence x 1 . Therefore, we can derive a labeled data by distant supervision as ramipril d , be-discontinue-due-to p , allergy e , ADR ∈ D L .
As another example, a sentence "The allergy improved despite ongoing treatment with ramipril." is also supposed to exist in the corpus C. The transformed sentence x 2 is ramipril d , improved-despite p , allergy e . In the similar manner, we can use the mapping function f ℋ → Y to assign the corresponding label of the entity pair ramipril d and allergy e . Therefore, the derived labeled data is ramipril d , improved-despite p , allergy e , ADR ∈ D ℒ . However, the sentence x 2 might not express the correctly clinical event of adverse reaction. This is known as the noisy label and need to to be handled by a particular technique such as MIL.
In MIL concept, bag-and instance-level relations are equivalent to the entity-and the instance-level relations of drug-event relation derived by distant supervision, respectively. Regarding the definition in Section 2.2, X is an instance space, Y is a set of labels, where Y = ADR, IND , the labeled data set D L can be rewritten in the form of MIL as is a set of multiple sentences which all sentences in a bag B i correspond to the same drug d and event e , n is the number of bags, and m is the number of sentences in a bag and can be varied across a different bag. On the one hand, unlabeled instances (D U ) are formed as a group of bags in the similar way but without a label as Our goal is both to train an instance classifier function f X → Y in the instance-space paradigm from D L only (supervised learning) and attempt to infer the accurate label for each instance in the D U set during the training process (transductive learning). The bag label, eventually, can be derived from an aggregation function of the instance level, and the model assessment is investigated through the model performance of the entity level. Regarding noisy data labeling from distant supervision, the collective assumption and standard assumption with logical-OR aggregation for the bag label judgment are rather improper. The relaxed version of the MIL standard assumption is used in our proposed framework by assuming that the positive and negative bags are able to contain a mixture of either positive or negative instances, but the probability of at least one positive instance should be the maximum for the positive bag and vice versa. Consequently, to learn bag classifier f 2 X → Y, the estimated bag label from an instance classifier can be computed using (1), where y i is a label of a bag i (the entity-level label), y ij is a label of the instance-level and possibly different for each sentence instance j within the same bag i, and n is the total number of sentences in the bag.
Generally, the training data are not sufficient for parameter training. In order to learn such classifier function f X → Y, we make use of the iterative EM technique with transductive learning setting to estimate the posterior probability p y|x through the two parameters, that is, prior probability p y and class-conditional density p x|y , of the generative model.

Medical Named Entity Recognition and Relation
Candidate Generation. Figure 3 displays information extraction from sentences in the MIMIC corpus, with the output of drug-key phrasal pattern-event tuples as candidates of ADR or IND relation. This process involves NER, SBD, and parsing. Here, the MetaMap [64] is used for NER, our inhouse program for SBD (https://github.com/makoto404/ MIMIC_SBD), and Stanford CoreNLP's OpenIE for parsing. After extracting relation candidate tuples (entity 1 , predicate, entity 2 ), we select only the tuples that include drug name and event name as entity 1 and entity 2 or vice versa. The output is in the form of (a drug, a key phrasal pattern, and an event).
The automatic labeling process using distant supervision is illustrated in Figure 4. Firstly, each pair of drug and event d, e from the set of seeds in knowledge bases is used to extract drug-event pairs from the set of sentences; then, we assign the label corresponding to the seed label to all sentences that mention such d, e pair. However, to reduce the ambiguity of the ground truth from knowledge base supervision, a pair of d, e that is found to exhibit both of ADR and IND semantic relations is excluded. Given a set of sentences X, the training set D L is in the form In the Block 1 of Figure 4, the first bag (Bag 1 ) consists of two sentences that correspond to the same entity-level of drug d 1 and event e 1 . The second bag (Bag 2 ) contains only one sentence relevant to drug d 2 and event e 4 .
Finally, all sentences that are able to be assigned a label by distant supervision are referred as the set of labeled data D L and the remaining data that are not matched by distant supervision is used as unlabeled data D U .

Feature Extraction for Clinical Textual Data.
To recognize a relation between a drug and an event, our approach generates a set of relation candidates (drug-event pairs) from medical records in the form of (drug, pattern, event). Table 2 depicts examples of multiple types of feature extraction and drug-event candidates. Our work considers two parameters related to representing such relation candidates. The first parameter, called relation boundary constraint, defines potential of using surrounding context for determining drug-event relations while the second and third parameters, called syntactic lemmatization and pattern granularity constraints, are related to patterns used to detect drug-event relations, as follows.  (ii) Pattern granularity: in terms of pattern units, two options are in word form (W) and phrase form (P).  Ba g 6 Figure 4: Block 1 expresses the data labeling using the fact from external sources (KB seeds). The D L is a data set that a pair of drug and event entities can be mapped to a set of KB seeds through the distant supervision. Hence, all sentences that correspond to the same drug-event pair are assigned to the same bag and same label (labeled data D L ) regarding a label of such drug-event pair in a set of seeds from knowledge base. Finally, such D L set is used as a training data. Block 2 depicts our proposed MIL-dEM method. The label assignment for unlabeled D U data set (test set) can be obtained from a classifier in the previous process. Lastly, such unlabeled data is incorporated and contributed to estimating the parameters of a generative model. element of which corresponds to a term (i.e., word, phrase) denoted by w with a value of either 0 or 1 for presence or absence of such term, respectively.

Pattern-Weighting Models
where x B presents a sentence x in the form of a binary vector, B x, w i = 1 when w i is the ith term in the sentence x (otherwise 0), and w i is a term in the universe W.
(ii) Multinomial (frequency) document model: a sentence is expressed by a vector of term frequency TF as 3 where x TF is a sentence x in the form of a TF vector, TF x, w i expresses the normalized frequency of the ith term w i by the sentence size x , and f x w i is the frequency that the term w i occurs in the sentence x. As another option, a document can also be expressed by a vector of term frequency-inverse document frequency TFIDF as 4 where x TFIDF is a sentence x ∈ X (the document universe), in the form of a TFIDF vector, and IDF w i expresses the inverse document frequency, corresponding to the logarithm of the ratio of the total number of sentences in the universe X to the number of sentences that contain the ith term w i .

Probabilistic Classification
Modeling. This section describes two EM-based probabilistic classification models, one with independent assumption (iEM) and the other with dependent representation assumption (dEM).

EM Model with Naïve Bayes Independent Assumption (iEM)
. Let X = x 1 , x 2 , … , x X be a set of sentences, x i = w i1 , w i2 , … , w i X i be a sentence that includes x i terms, and C = c 1 , c 2 , … , c C be the set of possible classes. The probability that the sentence x i has c k as its class y i = c k can be formulated as While in most situations, it is possible to obtain the class p c k simply from the training set, and the generative probability of x i given a class c k usually suffers with insufficient training data. As done by several works, the assumption of independence, usually called naïve Bayes (NB), can be applied to alleviate this sparseness problem as expressed in Therefore,the NB text classifier can berewritten in the form Here, it is necessary to estimate two sets of parameters, denoted by θ, of expectation-maximization (EM) algorithm. The first parameter set is the class-conditional probability of any term w q ∈ W given the class c k while the other one is the probability of the class c k . The parameter set is defined by θ = p t+1 w q |c k , p t+1 c k 8 In the expectation step (E-step), for each iteration, the θ parameter of the previous step is applied to re-estimate the model probability. In our experiment, the convergence threshold is 10 −7 and the maximum number of iterations is set to 50.
For the maximization step (M-step), with a Laplace smoothing factor λ > 0, the t + 1 th-iteration probability of p t+1 w q |c k and p t+1 c k can be estimated from the tth-iteration probability. The maximum likelihood estimation for NB is simply computed from an empirical corpus using where W is a total number of terms and any term w z ∈ W.
The following demonstrates an example of applying the above formulations with the key phrasal pattern-based feature. Given the L-P feature representation of x i = (C0033487, be-hold-due-to, C0020649) corresponds to relation tuple d i , p i , e i obtained from an input sentence where the pattern p i be the phrase form, we can estimate p y i = c k |x i as expressed in Another example, given the L-W feature representation of the same sentence x i = {C0033487, be, hold, due, to, C0020649}, corresponds to relation tuple d i , p i , e i where the pattern p i is in the word form. We can compute the class probability of the given texts p y i = c k |x i as

EM Model with Dependency Representation (dEM).
We introduce a dependency representation as an alternative model representation that is based on the same intuitions as the NB model but less restriction regarding the implicitly strong independence assumptions. This dependency representation is an efficient factorization of the join probability distributions over a set of three random variables w q , w r , and w s , where each variable is a domain of possible values, that is, drug, key phrasal pattern, and event. We extend the dependency representation with iterative learning by EM approach in order to align the model assumption to the natural language and also figure out an unseen random variable using probability estimation based on an existing prior knowledge. This dependency representation is also known as Bayesian networks (BN) and the conditional probability of independent variable given a class probability can be derived by the chain rule Therefore, the BN text classifier can be rewritten in the form According to the core of BN representation, a random variable is represented by a node in a directed acyclic graph (DAG), and an edge between any two nodes is presented by an arrow line which implies a direct influence of one node on another node. Given a sentence x i with three elements (w iq , w ir , and w is ) in the form of a relation tuple (d i , p i , e i ), there are three factorized ways (3!) as alternative model skeletons of the dependency representation through the chain rule. We, hence, propose the linear interpolation in order to weigh and combine the probability estimation from all of possible dependency representations as defined by such that the total γ is ∑ 6 i=1 γ i = 1 Generally, the linear interpolation method of three random variables can be estimated from the combination of two random variables and individual random variable. Similarly, two random variables are able to approximate from individual random variable as well. For instance, given two history terms w iq and w ir in a sentence x i , the interpolation is comparatively estimated from individual random variable and two random variables as shown in p w ir |w iq , c k = β 1 p w ir |c k + β 2 p w ir |w iq , c k , 17 such that the total β is ∑ 2 i=1 β i = 1 Another instance, three history terms (w iq , w ir , w is ) in a sentence x i are given; the likelihood estimation as shown in (18) can be derived similarly as the previous estimator by interpolation of individual random variable, two random variables, and three random variable estimators.
such that the total α is ∑ 4 i=1 α i = 1 Finally, we compute p w iq |w ir , c k , p w iq |w is , c k , p w ir |w is , c k , p w is |w iq , c k , and p w is |w ir , c k with the similar manner as (17) and calculate p w iq |w ir , w is , c k and p w ir |w iq , w is , c k using the same way as shown in (18).
In the same manner as the NB model, it is necessary to estimate the four sets of parameters θ whereas any terms w q , w r , w s ∈ W. θ = p t+1 w q |c k , p t+1 w q |w r , c k , p t+1 w q |w r w s , c k , The iterative learning using EM approach is applied to estimate the parameter θ. For the E-step, for each iteration, the θ parameter is applied to re-estimate the model probability as shown in (20) and (21). This process will repeat until convergence. The same setting as the iEM model, the value of 10 −7 for the convergence threshold and the value of 50 for the maximum number of iterations, is applied for dEM model as well.
For the M-step, the Laplace smoothing factor λ > 0 is implemented as well as in NB model to avoid zero count issue. However, with the BN dependency representation, there are four parameter estimation of the t + 1 th iteration probability of p t+1 w q |w r , w s , c k , p t+1 w q |w r , c k , p t+1 w q |c k , and p t+1 c k , which can be estimated from tth-iteration probability as expressed in whereas W is a total number of terms and any term w z ∈ W.
Then, we can derive p t+1 w r |c k and p t+1 w s |c k using the similar calculation as (22). For the dependency representations of two random variables w, that is, p t+1 w q |w s , c k , p t+1 w r |w q , c k , p t+1 w r |w s , c k , p t+1 w s |w q , c k , and p t+1 w s |w r , c k can be computed by following the similar approach as (23). Similarly, the estimation of p t+1 w r |w q , w s , c k and p t+1 w s |w q , w r , c k can be obtained by the same way as shown in (24). Finally, the coefficients γ, β, and α of interpolation approach are employed in order to weigh the knowledge from multiple dependency representations. Algorithm 1 explains pseudocode for iEM model, and Algorithm 2 expresses our proposed dEM method.

The Incorporation of Unlabeled Data.
In the environment of insufficient labeled data, SSL is one solution that utilizes an inexpensive and ubiquitous source of data. The transductive learning [65], one type of SSL, begins its process with making use of a limited number of labeled data (D L ) to build a rough model and then aggregated a large number of unlabeled data (D U ) (test set) to revise and improve the model iteratively. In the experiment, we investigated the three alternative approaches of initialization and iterative weighting of relation labels for unlabeled data incorporation.

(i) T p M L
: This method is equivalent to the general transductive learning, in which the label of the test set D U can be derived by a classifier that is trained on the D L . Then, the augmented D L with the labeled D U , so called D L+U , is used for the further iteration.
(ii) T p 0 5 : The class probability of the D U is equally assigned to D L and used as an initial probability. In this approach, the D L+U can be derived earlier and integrated in training process for the first iteration. Therefore, in the next iteration, the D U is not strictly guided by the labeled data. The revision process is probably the same manner to the previous method by combining both data set D L+U for the further iteration.
(iii) T p random : Similarly, the initial probability of D U is assigned randomly rather than the fixed value of 0.5. The degree of likelihood for each label can be varied from 0 to 1 whereas the total probability of ADR and IND labels equals 1.
In order to evaluate our proposed method, three types of text representation across three parameters of unlabeled data incorporation are investigated. Finally, our proposed methods and its enhancement, MIL-dEM-S-S (supervised learning) and MIL-dEM-T-S methods (transductive learning), are compared to TSVM and three MIL models, MISVM, MINB, and MILR, which are implemented in WEKA [66].

Input:
C = the number of labels T = the maximum number of iteration j=1 α j = 1 Output: θ parameter 1 t ← 0 2 θ = p t+1 w q , c k , p t+1 w q , w r , c k , p t+1 w q , w r , w s , c k , p t c k ; ∑ C k=1 p t c k = 1 3 repeat 4 for i = 1 to n do 5 E-Step: Estimate model probability: p t y i = c k |x i (21)

Input:
C = the number of labels T = the maximum number of iteration Output: θ parameter 1 t ← 0 2 θ = p t+1 w q , c k , p t+1 c k ; ∑ C k=1 p t+1 c k = 1 3 repeat 4 for i = 1 to n do 5 E-step: Estimate model probability: p t y i = c k |x i (9) M-step: Update class-conditional probability: p t+1 w q |c k (10) Update class probability: p t+1 c k (11) 6 t ← t + 1 7 until convergence or t = T Algorithm 1: Pseudocode for EM with NB-independent assumption (iEM).

Evaluation
We assess our proposed method using various parameter settings as shown in Table 3 and evaluate by the hold-out evaluation through the k-fold cross validation whereas k = 5. The three main measures as defined by (26), (27), and (28), that is, precision, recall, and F1, are used for model evaluation, while the positive class in our experiments is ADR label. In our experiment, we use MetaMap Java API for NER and Stanford CoreNLP Java API for OpenIE and implement Python program for EM-based methods. For model comparison, we execute WEKA Java-based software and SVM light (http://svmlight.joachims.org), which is implemented in C programming language, on Mac OS with Intel Core i5 processor running at 2.5 GHz and 8 GB of physical memory.  [67]. The data is freely available at PhysioNet (https://mimic.physionet.org) and is accessed on Apr 25, 2016 with the version 1.3. The over 58,000 hospital admissions for 38,645 adults and 7875 neonates are presented in the data source spanning up to 12 years from June 2001. In our work, the discharge summary from two main hospital sections, that is, brief hospital course (BHC) and the history of present illness (HPI) are preliminary explored. For data preparation, we employ SBD, stop word removal, tokenization, NER, and normalization. We consider two semantic types of UMLS CUI regarding CHEM and DISO for drug and event entities, respectively. As the results, nearly 1.6 million sentences are extracted and used as our corpus.

Results and Discussion.
We conduct four main experiments in order to evaluate the effectiveness of our proposed method: (i) the key phrasal pattern analysis, (ii) the evaluation on the effectiveness of the key phrasal patterns, (iii) the effectiveness of the pattern-based feature with MIL-iEM and MIL-dEM, and (iv) the evaluation on overall performance with advanced machine learning methods.

Key Phrasal Pattern Analysis.
We initially analyze the discovered key phrasal patterns to investigate the degree of characterization of relation labels. Given a key phrasal pattern pattern, we compute the pattern score (S) by performing the conditional entropy (H) inversion and polarity adjustment to visualize the performance of the extracted key phrasal patterns.
From Figure 5, a pattern that is located far from the middle line (score 0) and closed to the top left or the top right corners expresses the high effectiveness of semantic discrimination ability relevant relation labels. For example, the key phrasal patterns "be-hold-in," "contribute-to," "be-think," and "improve-with" are strongly relevant to ADR label and "be-add-for," "be-initial-for," and "be-on" are rather Random probability T p random associated to IND. Opposite to the key phrasal patterns, "be" and "be-with" are presented near the middle line in the graph that indicates the fuzziest patterns. Additionally, the figure clearly illustrates that the patterns relevant to ADR are more efficient than the pattern relevant to IND, the small number of ADR patterns are located nearby the original point, and most of the ADR patterns are placed with spread distance. On the one hand, patterns relevant to IND are presented to dense at the location which is nearly zero score and zero frequency. Table 4 presents the example of the sentences that are relevant to key phrasal patterns and pattern direction. Finally, the key phrasal patterns with a pattern score over than the threshold are selected for the further process.

Evaluation on the Effectiveness of the Pattern-Based
Feature. The comparison of the multiple feature types across varying of initial weighting of relation labels for unlabeled data incorporation throughout the MIL-iEM are assessed in order to examine the effectiveness of the pattern-based feature. We divide the experiments into two parts based on the decision methods in EM algorithm. The former refers to soft decision making (MIL-iEM-S) in which the predicted result is directly yielded by the estimated class probability.  Figure 5: The x-axis exhibits the pattern score with polarity whereas the score > 0 represents the distribution of pattern relevant to ADR (blue circle marker), the score < 0 represents the distribution of pattern relevant to IND (orange square marker), and score = 0 indicates no relevance between pattern and both labels. The y-axis is the frequency of patterns that appear in the clinical texts.
The latter is so-called hard decision making (MIL-iEM-H) in which the predicted outcome is considered the cutoff value of the probability and assigned class label instead of likelihood ratio. We initially perform the experimental setting on the traditional-independent assumption through MIL-iEM model. Table 5 expresses an assessment of five text transformation across three alternative document representations and three initial weighting of unlabeled data D U based on soft decision making and hard decision making. In the table, the pattern-based feature is expressed in the top 4 of each experimental setting, that is, S-P, S-W, L-P, and L-W. From the experimental results, we found that the pattern-based feature outperformed traditional bag-of-words (BOW). The highest F1 score value, 0.841, is resulted by MIL-iEM-SP-TF-S-T p 0 5 model which outperformed the baseline MIL-iEM-BOW-TF-S-T p 0 5 up to 4.4%. In addition, B and TF document representations have slightly better performance than TFIDF for all types of initial weighting method. The similar results are found on hard decision making approach as well. The pattern-based feature performed better performance than BOW feature. The MIL-iEM-LW-TFIDF-H-T p 0 5 model obtains the highest performance of F1 score 0.807 and 3.3% improvement from the MIL-iEM-BOW-TFIDF-H-T p 0 5 baseline model. However, it is noticed that the hard decision making results in poor performance when compared to the soft version.
The performance comparison across the number of features is exhibited in Figure 6. The number of features relevant to pattern-based features is ranged from 737 to 1322 dimensions, and the number of BOW feature is 1853 dimensions. From the graph, even though our proposed pattern-based features with MIL-iEM-T p 0 5 and MIL-iEM-T p random provide slightly different F1 score from the BOW feature, their number of dimension are less than half of BOW, especially S-W and L-W features. Therefore, our proposed pattern-based feature is more efficient than BOW feature due to the small number of features but yield similar model performance.
Accordingly, the experimental results confidently support that the simplified sentence using relation tuple of a drug, a key phrasal pattern, and an event is a potential feature transformation for relation classification task. Moreover, ignoring the insignificant contexts can reduce redundancy of feature and avoid computational time issue that is frequently caused by the curse of dimensions.

Evaluation on the Effectiveness of MIL-dEM-SL and
MIL-dEM-T. In this experiment, the comparison between our proposed method based on SL (MIL-dEM-SL) and transductive learning (MIL-dEM-T) across varying parameters such as feature types, pattern-weighting models, and the initial weight methods for unlabeled data incorporation are examined. Our proposed method is based on dependency representation of texts, and the posterior estimation is based on the interpolation of Markov property. The experiment is set up with supervised learning-based model and three transductive learning-based models with different initial weight methods of D U incorporation. The two types of pattern-based features such as surface lexicon-based (S-P) and syntactically lemmatized lexicon-based (L-P) are used for examination. The parameter tuning is also performed for all approaches.
As the results in Table 6, among transductive learning models, the performance of S-P feature is slightly different from L-P feature for all models. The simple binary (B) weighting model presents the higher F1 score over TF and TFIDF. Moreover, MIL-dEM-S-T p M L model exhibits the higher performance than the fuzzy guideline by MIL-dEM-S-T p 0 5 and MIL-dEM-S-T p random models for all evaluation matrices.
On the other hand, the F1 score of MIL-dEM-SP-S-SL surface lexicon-based feature is better than MIL-dEM-LP-S-SL syntactically lemmatized lexicon-based feature with 1% and 0.8% for TF-and TFIDF-weighting model, respectively.
Similarly, the F1 score of the pattern-based feature S-P across the three types of pattern-weighting model, that is, B, TF, and TFIDF models is also slightly different; 0.928 for MIL-dEM-SP-B-S-SL, 0.946 for MIL-dEM-SP-TF-S-SL, and 0.938 for MIL-dEM-SP-TFIDF-S-SL. Among models within MIL-dEM-S-SL setting, the highest F1 score is presented by TF-weighting model with 0.946.
One of the interesting results shows that the unlabeled data incorporation is exhibited to increase the model performance. The highest effectiveness, 0.954 of F1 score, is presented by MIL-dEM-SP-B-S-T p M L model which is the simple binary weighting model, and the model shows 2.6%, 1.6%, and 0.8% improvement over MIL-dEM-SP-B-S-SL, MIL-dEM-SP-TFIDF-S-SL, and MIL-dEM-SP-TF-S-SL, the best performance of our proposed supervised learning, respectively.
According to the result from the parameter optimization of our proposed method, the model performance is strongly relevant to the dependency representation of random variables as follows: (i) an event and the clinical outcome and (ii) a pattern, a drug, and the clinical outcome. In the contrast, the model is shown to have less relevance between a drug and an event or a pattern and an event.

Evaluation on Overall Performance with Advanced
Machine Learning Methods. The comparison of our proposed method and advanced machine learning methods is presented in Table 7. The best models of each set of models are used for assessment. The well-known MIL methods, that is, MISVM, MINB, MILR are executed using WEKA. On the one hand, we customize the original TSVM using the source code from the author and incorporate the MIL assumption as discussed in the previous section (see Section 2.2). We divide the discussion into three parts: the effectiveness of supervised learning model, the effectiveness of transductive learning model, and the overall performance.
Firstly, the experimental results among baselinesupervised learning methods, that is, MISVM-TFIDF, MINB-B, and MILR-B, show that BOW feature works well for all MIL methods; conversely, the pattern-based feature S-P contributes a dramatic improvement when combined with our proposed method MIL-dEM-TF-S-SL. The TFIDFweighting model yields the high performance for MISVM with   F1 score 0.901, while binary weighting model (B) is exhibited to improve the performance for MINB and MILR with F1 scores 0.880 and 0.861, respectively. However, our proposed MIL-dEM-TF-S-SL with S-P feature outperforms all MIL methods, and 4.5% F1 score is better than the highest performance of advanced machine learning method which is resulted by MISVM-TFIDF with BOW feature. The precision of MIL-dEM-TF-S-SL with S-P feature is slightly lower than MISVM-TFIDF with BOW but the recall is significantly improved. Accordingly, our proposed method contributes to reducing the type II error which is always considered in the medical domain. Secondly, the comparison among transductive learning methods, the BOW feature with TSVM-B is shown to achieve an F1 score of 0.889, while applying the pattern-based feature S-P, its performance is presented to degrade around 2%. Conversely, the pattern-based feature S-P with MIL generative method exhibits to enhance the effectiveness of the models. The accuracy of MIL-iEM-TF-S-T p 0 5 model increases up to 6.3% when the pattern-based feature is deployed instead of the BOW feature.
Lastly, in the overall evaluation, the generative models with dependency representation, that is, MIL-dEM-TF-S-SL and MIL-dEM-B-S-T p M L , outperform for all models. The highest performance is exhibited by our transductive learning MIL-dEM-B-S-T p M L method with 0.934 precision, 0.975 recall, 0.954 F1 score, and 0.949 accuracy, respectively. Moreover, improving the generative model by substitute assumption of word-dependency MIL-dEM-B-S-T p M L model to word-independency MIL-iEM-TF-S-T p 0 5 model is shown to dramatically improve 11.3% F1 score and 12.2% accuracy.
From multiple aspect assessments, the experimental results confidently support that our proposed method, MIL with the two generative models, has the comparative advantage in performance for relation classification task. The proposed pattern-based feature contributes to reduce the curse of dimension issue and preserve text dependency structure. The incorporation of a generative model with proper model assumption and transductive learning can potentially estimate the distribution of patterns relevant to harmful or beneficial event of drug usage with high precision and recall. Our proposed method can provide the supporting evidence based on the relevant clinical sentence rather than only prediction of result which is expected to further assist a professional medical for decision making on treatment or diagnosis process.

Conclusion
This paper presents a framework of distant supervision with MIL and transductive learning for detecting adverse reaction hidden in clinical texts. Our work aims to deal with two main difficulties: (i) the limitation of handlabeled data and (ii) intractable processing of large-scale unstructured clinical texts.
The first issue is coped with distant supervision paradigm by knowledge base incorporation. Therefore, we can automatically assign either ADR or IND label to each drugevent pair and use as labeled examples. For the second issue, we propose the pattern-based feature to present semantic comprehension of a sentence and proposed alternative parameters learning of a generative model using dependency representation model assumption. However, such training data set derived by distant supervision is formed as the instance-level, while the predictive goal is focused on the entity-level. Therefore, MIL paradigm is involved into the framework. The collected statistics from the tagged drugevent pairs are used to examine the semantic distribution relevant to ADR and IND. Exploiting EM algorithm as the base model for our supervised learning and transductive learning, it is helpful to estimate the probability of an unknown relation of given drug-event pair and then classify this relation to either ADR or IND. From the experimental results on multiple assessments, we found three significant findings.
Firstly, the pattern-based feature contributes to improve model performance of generative models. The MIL-iEM-SP-TF-S-T p 0 5 model is shown to achieve the highest performance among all MIL-iEM-based methods with 0.844 precision, 0.838 recall, and 0.841 F1 score, and the model Table 7: The comparison of overall performance among MIL-dEM-SL, MIL-dEM-T, advanced machine learning methods, and MIL-iEM-T using fivefold cross-validation.

Models
BOW S-P P R F1 Acc. P R F1 Acc. provides the outstanding improvement over the traditional BOW method, MIL-iEM-BOW-TF-S-T p 0 5 model, up to 4.4% F1 score.

Supervised learning
The second potential result, the traditional assumption of word independency is rather improper for natural clinical texts. Therefore, we tackle such fundamental problem by integrating Markov assumption on dependency representation of texts in order to estimate the prior probability and likelihood probability in a generative model. Given the same set of the pattern-based input features, the performance of MIL-dEM model is dramatically improved from MIL-iEM model. The MIL-dEM-SP-B-S-T p M L model exhibits the improvement over MIL-iEM-SP-B-S-T p 0 5 up to 8.9% precision, 13.9% recall, and 11.4% F1 score.
Lastly, the incorporation of unlabeled data D U and labeled one D L using MIL-dEM-SP-B-S-T p M L model achieves the highest effectiveness with 0.954 F1 score. In addition, our proposed MIL-dEM-SP-B-S-T p M L model also outperforms the advanced machine learning methods by F1 score improvement up to 5.3% of MISVM-BOW-TFIDF, 7.4% of MINB-BOW-B, 9.3% of MILR-BOW-B, 6.5% of TSVM-BOW-B, and 11.3% of MIL-iEM-SP-TF-S-T p 0 5 . However, our work presents some limitations that can contribute to support further improvement of the framework. The projection from distant supervision to corpus currently is employed by MetaMap tools and can be improved by advance method such as word embedding to increase high potential entity-level relation for instance examples. The key phrasal pattern extraction in the current work is scoped by the sentence boundary, but a drug and an event possibly associate throughout across different sentences. This issue would be challenged by coreference problem. Even though the discovered key phrasal patterns provides the significant role for relation classification, the number of patterns is rather limited and probably encounters the problem of out of vocabulary (OOV) when applied to the framework with a huge unseen data. Therefore, the semantic representation is the promising method to increase the number of key phrasal patterns.

Conflicts of Interest
The authors declare that they have no competing interests.