Drug Disease Relation Extraction from Biomedical Literature Using NLP and Machine Learning

Extracting the relations between medical concepts is very valuable in the medical domain. Scientists need to extract relevant information and semantic relations between medical concepts, including protein and protein, gene and protein, drug and drug, and drug and disease. )ese relations can be extracted from biomedical literature available on various databases. )is study examines the extraction of semantic relations that can occur between diseases and drugs. Findings will help specialists make good decisions when administering a medication to a patient and will allow them to continuously be up to date in their field. )e objective of this work is to identify different features related to drugs and diseases from medical texts by applying Natural Language Processing (NLP) techniques and UMLS ontology. )e Support Vector Machine classifier uses these features to extract valuable semantic relationships among text entities. )e contributing factor of this research is the combination of the strength of a suggested NLP technique, which takes advantage of UMLS ontology and enables the extraction of correct and adequate features (frequency features, lexical features, morphological features, syntactic features, and semantic features), and Support Vector Machines with polynomial kernel function. )ese features are manipulated to pinpoint the relations between drug and disease. )e proposed approach was evaluated using a standard corpus extracted from MEDLINE. )e finding considerably improves the performance and outperforms similar works, especially the f-score for the most important relation “cure,” which is equal to 98.19%. )e accuracy percentage is better than those in all the existing works for all the relations.


Introduction
Biomedical information is abundantly available in journal articles and research studies in various databases, such as MEDLINE, PubMed, and Medscape. Scientists need to automatically extract relevant information, for instance, semantic relations between medical entities, from these databases. For example, scientists need to know which drug cures a given disease or which diseases are the side effects of a given drug. ese relations can help specialists update their knowledge and improve their expertise in their field. ese relations can be discovered from a variety of texts in biomedical literature.
Various methods have been applied to extract relations from the biomedical literature [1][2][3][4][5]. e relationship extraction studies have focused on specific types of relations, including interactions between protein and gene, protein and protein [6], drug and disease, and drug and drug [7]. erefore, the objective of this study is to contribute to a better understanding of drug-disease relation.
is paper aims to explore the extraction of drug-disease relation from biomedical texts. e paper proposes a semantic relation extraction approach between biomedical entities (drug and disease) which exploits the specific features of these entities, which can be discovered by using a suggested NLP technique and UMLS ontology. ese extracted features will form the input to the Support Vector Machine (SVM) classifier for the classification of relations between these entities.

Extraction of Relations between Medical Concepts.
Many different biomedical text relation extraction strategies have been proposed to discover relationships, including protein and protein, gene and gene, gene and protein, gene and disease, gene and drug, and drug and drug. e works about protein-protein relation extraction are generally based on the identification of protein features (lexical features) rather than similarity methods [8][9][10] or classification methods [11], which are applied to discover the interaction between pairs of proteins.
For gene-gene relation extraction, the researchers focused on the use of ontologies, such as Gene Ontology [12] or statistical models [13,14].
To identify gene-protein relations, various works have proposed the use of machine learning and NLP techniques [15,16,17].
To discover gene-disease relations, classification models that support these relationships were built [18]. In other works, NLP tools and ontologies were exploited [19][20][21].
For gene-drug relation extraction, various works recommended text mining approaches supported by classification models [22].
To discover drug-side effect relation, dictionaries and ontologies were built from the Unified Medical Language System (UMLS) Metathesaurus [29].

Drug-Disease Relation Extraction.
Discovering the relationship between drugs and diseases plays a crucial role in medical domain development. e huge medical literature sources allowed the automatic identification of significant relations hidden in free text. Various computational methods have been proposed to discover the relations between drugs and diseases.
Rosario and Hearst [30] proposed a method that distinguishes seven relations between two semantic entities, "treatment" and "disease." Five graphical models and a neural network have been presented. Seven relations were detected, but only three relations, namely, cure, prevent, and side effect, were represented with accuracy levels of 92.6, 38.5, and 20, respectively. Abacha and Zweigenbaum [31] suggested a hybrid approach associating a pattern-based method and a statisticalbased learning method (linear SVM) to extract two relations between a disease and a treatment. F-scores were given as effectiveness measure, and they are 95 and 15.15 for cure and prevent, respectively.
Frunza et al. [32,33] have applied a machine learning technique to extract diseases and treatments from medical papers. Six classification algorithms were used, including probabilistic models, adaptive learning models, decisionbased models, and linear classifiers like SVM. ree data representation techniques were adopted to extract treatment relations as follows: Bag-of-Word, NLP, and medical concepts. e effectiveness measures of the three detected relations, namely, cure, prevent, and side effect, are 93.6, 76.5, and 50, respectively. Suchitra and Sudah [34] used NLP and machine learning techniques to extract relations between drugs and treatments. Rule-based approaches, statistical models, and logic techniques were used for cooccurrence analysis. A Bloom filter was applied to remove unwanted data. Naive Bayes, SVM, inductive logic techniques, and statistical models were used. e obtained results had an overall F-score of 90.3 and an overall accuracy of 90 for the three extracted relations, namely, cure, prevent, and side effect.
Muzaffar et al. [35] used the Unified Medical Language System and ranking algorithms to rank verb phrases. e relations between drugs and treatments were classified using SVM and Naive Bayes techniques. ree relations were detected, namely, cure, prevent, and side effect. e F-scores were 98.05, 93.55, and 88.89 for cure, prevent, and side effect, respectively. e accuracies were 96.1, 97.4, and 96.4 for cure, prevent, and side effect relations, respectively.
Wang et al. [36] suggested a pattern-based relationship extraction method to extract two types of relations between drugs and diseases, namely, treatment (a drug treats/cures a disease) and inducement (the side effect of a drug). ey created a drug and disease lexicon from the UMLS and used drug-disease pair seeds for the pattern-based method to extract the relations between drugs and diseases. e reported results showed an F-score of 90.49 for cure relation and an F-score of 87.56 for the side effect relation.
Some researchers proposed a relation extraction between three concepts, namely, drug, disease, and protein [37] or drug, disease, and gene [38]. Other researchers have focused on a particular disease or a particular drug when looking for relations, for example, the extraction of treatments for psoriasis [39], the association between diabetes and the treatments for diabetes [40], and the effect of estrogen replacement therapy on Alzheimer's disease and Parkinson's disease [41]. Table 1 shows a comparison between the most important works in the field of relation extraction between drugs and diseases. e existing works based on drug-disease relationship did not take into account many important features about drugs and diseases. ese features (frequency features, lexical features, morphological features, syntactic features, and semantic features) can be very useful for the detection of good and valuable relations.
To overcome this issue, we proposed a novel methodology that discovers the drug-disease association based on 2 Mobile Information Systems Natural Language Processing strategy with the help of the UMLS ontology and a machine learning technique, such as the SVM model, for automatic relations extraction from biomedical texts.

The Proposed Approach
e methodology adopted in this study was developed from studies and concerns related to relation extraction from medical literature, text mining, and machine learning. e proposed approach entailed three main components, namely, preprocessing, features extraction, and relation extraction. e first component started with free-text sentences, performs a preprocessing task, and outputs a set of annotated words. e second component identified various features about sentences, which later helps the relation extraction. irdly, the output of the previous component was fed into a machine learning component, thereby completing the identification of associations between drug and disease entities. e architecture of the proposed approach, named "DDRel," is shown in Figure 1. e steps outlined in Figure 1 are discussed in detail in the following subsections.

Preprocessing.
Preprocessing, the first step of the approach, was based on Natural Language Processing (NLP) techniques. It eliminated noisy data and outputs all words in medical texts related to the biomedical concept (treatments and diseases). It included four major stages, namely, (i) splitting, (ii) tokenization, (iii) part-of-speech tagging, and (iv) semantic annotation.

Sentence Splitting.
is step divided texts into smaller units, and an identifier is assigned to each unit. Texts are segmented into sentences using punctuation markers "," "?," and "!" In this step, ANNIE English Sentence Splitter was used as a cascade of finite-state transducers to spill the text into sentences, as shown in Figures 2(a) and 2(b).

Tokenization.
After the sentence splitting, each sentence was segmented into tokens. Tokenization is the segmentation of sentences into a sequence of words using nonalphabetic characters, such as alien break, space, or punctuation characters. e result of tokenization was presented as an XML file that gathers tokens associated with the following: (i) the sentence identifier (id-sentence); (ii) the token identifier (id); (iii) the token length (length); (iv) the token orthography (orth); (v) the token kind (kind); and (vi) the token (string). e display of the XML file for the user is presented in Figure 3(a).

Part-of-Speech
Tagging. Part-of-Speech (POS) tagging is the method of associating words in a text according to their grammatical function, definition, and context, such as noun (NN), verb (VB), adjective (JJ), conjunction (CC), and proper noun (NNS). e algorithm of ANNIE POS Tagger has been implemented. e output is an XML file, in which each word has association with its grammatical function. e display of the XML file for the user is presented in Figure 3(b).

Semantic Annotation.
is step involved the extraction of named entities of drugs and diseases. It was difficult to extract drugs and diseases for many reasons. Each medical concept can be identified by several synonymous, different terms, and abbreviations. Moreover, simple dictionaries cannot be used for new drugs and diseases in our context. e Meta-Map system was configured to detect the concepts of the UMLS Metathesaurus hidden in the biomedical texts. e UMLS is a medical ontology that originated from the National Library of Medicine. e output of this step was the identification of concepts as Concept Id, Concept name, Preferred Name, and Semantic Type. e most important information extracted from this step was the Semantic Type, which was defined in UMLS. is significant knowledge will help determine the nature of concepts of drugs or diseases. Figure 4 shows the results of the semantic annotation.

Feature Extraction.
e feature extraction was the second step of the proposed approach. It sets features as combinations of some characteristics and is inspired by Rosario and Hearst [30] in relation to the semantic type. e  features for each word in a sentence were as follows: the semantic types, such as Word, Part of Speech (POS), and Phrase Constituent, belong to the same chunk as in the previous work; the MeSH mapping of the words; Domain Knowledge; and morphological features. In this work, unlike that of Rosario and Hearst [30], the features were built for each sentence instead of each token.
Moreover, new kinds of features were created, and these were assumed to be more suitable for extracting drug-disease relations.
In this work, new features were proposed to extract drugdisease relations, including the following: (i) frequency features, (ii) lexical features, (iii) morphological features, (iv) syntactic features, and (v) semantic features.

Frequency Features.
e frequency features represented the following: (i) Order of words present in NE (ii) Order of words present in every two NEs (iii) Sequence of "n" words preceding every NE (iv) Sequence of "n" words after every NE

Morphological Features.
In this step, morphological features were extracted and included the following: (i) Lemmas order of the words among every two NEs (ii) Lemmas order of the "n" words preceding every NE (iii) Lemmas order of the "n" words after every NE

Syntactic Features.
ese features concern the POS of each NE and include the following: (i) POS order of words among every two NEs (ii) POS order of "n" words preceding each NE (iii) POS order of "n" words after every NE (iv) Verb sequence among every two NEs (v) First verb preceding every NE (vi) First verb after every NE

Semantic Features.
e purpose of this step is to extract the combination of words in the sentence. e values of these semantic types are DIS (DISease) and TREAT (TREATment).
(1). Example of Feature Extraction. Consider the following sentence: "Preliminary evidence suggests that interferons beta may also induce regression of metastatic renal cell carcinoma." e output of the feature extraction step from this sentence is provided in detail in Table 2.
e result of feature extraction is displayed for the user in Figure 5.

Relation Extraction.
e relationships between drug and disease were extracted using a machine learning classifier. e relation extraction process is based on a classification process, which proceeded according to relation classes, as follows: CURE, PREVENT, SIDE EFFECT, NO CURE, and OTHER RELATION. is classification helped extract relations between entities and was performed by exploiting the extracted features and outputs of the previous step. e traditional machine learning classification techniques performed poorly when the classified data were immense. erefore, this approach used an SVM, which scaled up relatively well to highdimensional data [3].
SVM is a well-known supervised learning algorithm. e input of this algorithm is a set of features detected from the Mobile Information Systems 5 previous step. ese features are used by a machine learning method to find a hyperplane that separates the feature space into classes with a maximum margin. When maximizing the margin, the SVM algorithm attempts to achieve maximum separation between classes and then minimize misclassification errors.
In this paper, a supervised classifier SVM was used to classify the drug-disease relations from biomedical databases. e objective of SVM was to discriminate between classes of relations. SVM was used with polynomial kernel, because this type of SVM has a kernel function and is very well suited with our context. e first step of relation extraction was to provide the classifier a training set. e training set is composed of feature vectors. It was labeled data assigning a relation class for each sentence as follows: CURE, PREVENT, SIDE EF-FECT, NO CURE, and OTHER RELATION. e training set is used by SVM to build a model that predicts the target relation class. e second step was the prediction. To predict the relation class for each sentence in the data file, SVM applies the model on feature vectors, already created in the preprocessing step and semantic annotation. ese vectors gather all the features related to each sentence in this data file (one vector for each sentence). Figure 6 shows the results of the relation extraction. By clicking on drug-disease relations extraction, the list of drugs and a list of relations are displayed. Alternatively, when choosing a drug and a type of relation (prevent, cure. . .), the diseases that have such a relationship with this drug are displayed.

Experiment Setup.
To validate the proposed approach, a system was implemented. Screenshots are presented in Figures 2-6. For the experiments, we used the standard corpus obtained from MEDLINE 2001.
is corpus was annotated with types of semantic relationships between treatment (TREAT) and a disease (DIS). ese relationships were CURE, PREVENT, SIDE EFFECT, and NO CURE.  is corpus was validated using MEDLINE 2001 Database of biomedical papers [30]. e corpus was used to guarantee the validity of the comparison of the results.

4.2.
Results. For the evaluation, performance measures were deduced from a confusion matrix, which is a matrix shown in Table 3 with rows and columns and with the following classes: False Positives (FP), False Negatives (FN), True Positives (TP), and True Negatives (TN). A particular row in the matrix recorded the instances in an actual class, and each column recorded the instances in the predicted class. e confusion matrix for the implemented system is for multiclass classification as shown in Table 4.
For the class CURE, 785 TP classes exist, because they are CURE classes and are predicted as CURE classes. e number of FN classes is 25 � (10 + 5 + 10), because they belong to the class CURE, but they are not predicted as such.
e number of FP classes is 4 � 2 + 1 + 1, because they are predicted as CURE classes, but they are not. e number of TN classes is 82 � 57 + 25 + 0, because they are not predicted as CURE classes, and they are not.
For recall and precision, only the results of Abacha and Zweigenbaum [31] and Wang et al. [36] were available. e recall and the precision of the class NO CURE for Abacha and Zweigenbaum [31] were not available (NA). e recall and the precision of the class NO CURE for Abacha and Zweigenbaum [31] were not available (NA). Also, the recall and the precision of the classes PREVENTand NO CURE for Wang et al. [36] were not available, because this work was interested only in two relations, namely, CURE and SIDE EFFECT.
e recall in Table 5 shows that Abacha and Zweigenbaum [31] had a better recall (100%) compared with Wang et al. [36] (89.8%) and with a proposed approach (96.91%) for the extraction of CURE relation. For the rest of the relations, the proposed approach performed better. Also, for precision measures in Table 6, the proposed approach performed better than those in the works of Abacha and Zweigenbaum [31] and Wang et al. [36]. e F-score measure was not reported in the work of Rosario and Hearst [30]. Also, the F-score measure of the class PREVENT was not available for Wang et al. [36], because this work was interested only in two relations. Moreover, the F-score measure of the class SIDE EFFECT was not available for Abacha and Zweigenbaum [31]. e accuracy measure was not available for the works of Abacha and Zweigenbaum [31], Frunza et al. [33], and Wang et al. [36]. Nevertheless, the results demonstrated in Table 8 show that the proposed approach achieved a higher accuracy compared with all similar works for all the relations. e accuracy of NO CURE was not reported in any work except in the proposed approach. Table 9 represents the specificity measure of the implemented system. is measure is not available in the other works.
e results computed in this study were promising and showed that the combination of the used techniques outperforms the majority of the previous approaches using the same corpus. e possible reasons for this aspect are the appropriate mixture of the suggested NLP technique and UMLS ontology in the detection of relevant features (frequency features, lexical features, morphological features, syntactic features, and semantic features) for drug and disease and machine learning methods (SVM). e proposed approach seems to be suitable when dealing with semantic relations in natural language texts.  Mobile Information Systems e novel idea presented in the study is the integration of a novel NLP approach reinforced by the UMLS ontology and a machine learning method that performed better in a multidimensional context.

Conclusion and Future Work
We proposed a novel computational approach for relation extraction between drugs and diseases from a biomedical      is study significantly contributed to the existing literature on relation extraction between drugs and diseases from the medical literature. e main contribution of this work is the identification of specific features (lexical, semantic. . .) related to medical concepts (drug and disease).
is finding is a confirmation that, in the field of text mining, these features are relevant for the discovery of interesting relationships between concepts.
e experimental results proposed an improvement in the performance compared with other similar works.
e upcoming research will focus first on further improvements of the proposed approach. More investigations on the features of medical concepts will be conducted. en, the next direction will focus on updating the method to assist the professionals in finding relevant and authentic information in extracting semantic relations between other medical entities.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.