Relation Extraction Based on Fusion Dependency Parsing from Chinese EMRs

. The Electronic Medical Record (EMR) contains a great deal of medical knowledge related to patients, which has been widely used in the construction of medical knowledge graphs. Previous studies mainly focus on the features based on surface semantics of EMRs for relation extraction, such as contextual feature, but the features of sentence structure in Chinese EMRs have been neglected. In this paper, a fusion dependency parsing-based relation extraction method is proposed. Speciﬁcally, this paper extends basic features with medical record feature and indicator feature that are applicable to Chinese EMRs. Furthermore, dependency syntactic features are introduced to analyse the dependency structure of sentences. Finally, the F1 value of relation extraction based on extended features is 4.87% higher than that of relation extraction based on basic features. And compared with the former, the F1 value of relation extraction based on fusion dependency parsing is increased by 4.39%. The results of experiments performed on a Chinese EMR data set show that the extended features and dependency parsing all contribute to the relation extraction.


Introduction
Electronic Medical Record (EMR) contains a vast of medical entities that provide rich medical knowledge. It is worth noting that there are certain interdependent relations between entities rather than isolated ones, which truly reflects the medical knowledge and the judgment and application of medical knowledge by doctors. e relations between entities in EMPs represent the health of patients from different perspectives. Relation extraction plays a fundamental role in medical knowledge graph (MKG) construction and completion and supports many other tasks, such as question answering, semantic understanding of texts, and recommender systems.
Entity relation in EMRs mainly includes the relation between treatment and disease, treatment and symptom, test and disease, test and symptom, and disease and symptom. At present, the machine learning method is widely used in the field of medical texts [1][2][3][4], including the task of relation extraction of English EMRs [5], and most of the feature selections rely on English medical dictionaries and data sets [6] as well as syntactic analysis [7]. However, the relation extraction of Chinese EMRs is still scarce, which is reflected in two aspects: one is the relation between two specific entities and the other is neglecting the unique features of Chinese EMR texts and sentences.
To cope with the above shortcomings, we proposed a fusion dependency analysis method for relation extract of Chinese EMRs. e underlying idea is to extend features according to the unique features of Chinese EMRs, such as medical records feature, indicators feature, and extended context feature. Considering that the entity relations in two sentences with similar structure and context are often the same and the structural similarity of sentences in Chinese EMR is high, the sentence structure information is fused based on the feature extension. Among the methods of machine learning, some research studies [8,9] have verified that SVM is a better method for entity relation extract; thus, this paper directly adopts SVM to train the model and predict.

Related Work
e concept of relation extraction was first put forward at the Message Understanding Conference (MUC) and supported by the Defense Advanced Research Projects Agency (DARPA) at the end of the 1980s. After that, the Automatic Content Extraction Conference (ACE) promoted the development of relation extraction technologies. Recently, the development of knowledge graph (KG) once again emphasizes the importance of relation extraction.

Relation Extraction of English EMRs.
e relation extraction methods of EMRs are evolved from the early methods based on rules and dictionaries to the current classification based on machine learning, where entity relation refers to the relation between entity pairs appearing in a sentence. For the relation extraction of English EMRs, an SVM model [10] was utilized to identify the relationships among disease, symptom, test, and treatment. In this research, semantic lexical features, the order of entity pairs appearing in sentences, and syntactic features have been added as classifier and present an SR classifier, which can recognize 84% of the relations in the BIDMC corpus and achieve microaveraged F-measures of 0.89. A model was described in a study [11] to identify the semantic relations among medical concepts, including problems, tests, and treatments, from the medical texts and to analyse three types of relations which are the relation between treatment and problem, test and problem, and problem and problem. To extract the above relations, a hybrid method was proposed based on machine learning, dictionary, and rules [12]. In the view of the I2B2 (Informatics for Integrating Biology and the Bedside) 2010 (https://www.i2b2.org/NLP/Relations/), Rink [6] used GENIA15 to pre-processed the medical record texts, and then selected the context similarity as the new feature based on the lexical feature and context feature. e task of feature extraction used knowledge bases such as Wikipedia, WordNet and general inquirer [13]. is model also uses the SVM model to achieve the F-measures of 0.74. e relations between concepts in UMLS were used as a substitute feature to solve the problem that some entities in EMRs do not have rich context features [14], and the experimental results obtained an F-measures of 0.67.

Relation Extraction of Chinese
Text. At present, the research studies of relation extraction in Chinese mainly focus on the open domain and the methods of relation extraction in Chinese EMRs are still in the preliminary stage. A pipeline of NLP techniques was employed [15], a.k.a., word segmentation, POS-tagging, and syntactic parsing, to extract entity relations for an open domain. is system was considered as the first attempt to handle Chinese open relation extraction. In the medical field, the dependency graph was used to automatically learn the syntactic pattern of relation extraction and extracted the relation between disease and symptom by this model [16]. Also, a rule-based method was used to extract medical information for unstructured text data in EMRs [17]. e bootstrapping framework based on semisupervision was proposed in a study [18], combined TCM bibliographic literature database in China and MEDLINE (https://jgc128.github.io/mednli/), to discover the knowledge of gene functional including extracting the relation between symptom and gene, symptom and disease, and disease and gene. According to the characteristics of the relation between entities in the EMRs, a semisupervision learning method was used [19], SVM was adopted as the classifier to predict the labeled samples combined with auxiliary classification information, and then the classification after adding the samples with low confidence to the training set was repeated, which shows that the entity relation can be extracted effectively by the method of classifying and calculating entity co-occurrence.

Methods
In this section, we first introduce the preprocessing method of Chinese EMR data. Second, we briefly describe the basic features for relation extraction of Chinese EMRs. And we extend the features based on the basic features, according to the characteristics of Chinese EMR texts. Finally, by fusing sentence structure information, a method of relation extraction based on dependency parsing is proposed. e relation extraction process is shown in Figure 1.

Data Preprocessing.
e data set used in this paper for the research of entity relation extraction comes from XML EMR texts that were preprocessed initially and the files of entity and entity relation that have been tagged from EMR by a semiautomatic annotating method, which is described in Section 4.1. Among them, discharge summaries and progress notes [20] are selected as the Chinese EMR texts. e discharge summary includes the basic information of the patient at the time of admission, the diagnosis of the doctor, the tests and treatments received in the process of hospitalization, the basic information and the doctor's advices at the time of discharge, and the final treatment results. e details of discharge summaries are shown in Figure 2. e process note mainly records the clinical manifestations of the patients during hospitalization and the medical behaviours such as test and treatment received. e process of data preprocessing is roughly divided into three parts. First, the EMR texts should be segmented by using "。," "；," and "\n" as the boundary of sentences. en, entity pairs need to be identified from the EMR texts. Finally, the EMR texts that have completed sentence segmentation are tagged with word segmentation and part-ofspeech, with the help of NLPIR (http://ictclas.nlpir.org/) that is a word segmentation tool.

Relation Types.
Relation extraction is used to find the relation between entities from the text, while the relation extraction of the EMRs entity mainly studies the relation between entities such as disease, symptom, test, and treatment recognized from the EMRs. ese entity relations reflect the health information of patients and medical treatment measures for patients, as well as the professional knowledge of doctors. For the first time, the assessment task of I2B2 2010 systematically classifies the entity relation of EMRs, including the relation between medical problem and medical problem, medical problem and test, and medical problem and treatment. According to the characteristics of Chinese EMR texts, this paper divides the medical problem in I2B2 2010 into two categories as disease and symptom and then redefines the relation between medical entities as the relation between treatment and symptom, treatment and symptom, test and disease, test and symptom, and disease and symptom. e specific definitions are shown in Table 1 The patient came to the hospital because of "The mass of right neck has existed for 5 years and increased significantly in more than 2 months" after physical tests, it was found that there was a 0.6 * 0.6cm bag on the right neck. No redness or swelling,tender, no skin rash, no ulceration 01/03/2015 Right cervical mass Null After admission, relevant tests were carried out, and the right neck mass was removed under local anesthesis immediately. The operation was smooth and the patient returned safely. Now the patient is in a stable condition and can be discharged from the hospital and become an outpatient Come to the hospital ten days later to see the pathology reports

Right neck sebaceous cyst
The patient has no discomfort 1 Follow-up, Monday, Chief physician shi baomin ward round and records. Scientific Programming which are mainly divided into lexical feature, contextual feature, entity feature, and location feature.
(1) Lexical: this involves the two entities themselves, which play a certain role in the relation extraction between them, because even if two specific entities appear in different places, the relation between them may be the same. For instance, the relation between "感冒 (cold)" and "发烧 (fever)" in "患者因感冒而 发烧 (patients have fever due to cold)" is usually "DCS (disease causes symptoms)", so this paper also takes the two entities themselves as a feature. (2) Contextual: in Chinese EMR texts, the bag-of-words and part-of-speech in a certain range before and after two entities play a key role in the extraction of the relation between the two entities. e entity relation is judged by the context information, which refers to three bag-of-words and part-of-speech before and after two entities in this paper.
(3) Entity: the entity feature refers to the type of entity, which is an extremely important feature because the entity relation in this paper is classified by the two types of entity. Among them, the entities of test and treatment type only have relations with two types of entities that are disease and symptom, and there is a relation between disease and symptom instead of the relation between test and treatment. is feature has important guiding significance for the boundary judgment and specific type of judgment of entity relation.
(4) Relative position: the relative position of two entities, E1 and E2, has a certain indicative function for entity relation extraction in a sentence of Chinese EMR texts. For most sentences in the Chinese EMR data set of this paper, the disease entities and symptom entities appear in front of test entities and treatment entities, while the disease entities generally appear in front of symptom entities. For example, the disease entity "胆结石 (gallstone)" is in front of the treatment entities "全胆囊切除术 (total cholecystectomy)" in the Chinese EMR text "1974年因胆结 石于瑞金医院行全胆囊切除术 (in 1974, total cholecystectomy was performed in Ruijin hospital due to gallstones)," and the relation between two entities is "TrAD (treatment applied to disease)". ere are four categories of relative positions of two entities in this paper: E1 is on the left of E2, E1 is on the right of E2, E1 is in E2, and E2 is in E1. (5) Distance: the distance between two entities refers to the number of words between them. In general, the more words there are between two entities, the farther apart they are, and the less likely there is a relation between them. e distance between two entities is expressed by measuring the numbers of words between two entities after word segmentation, in which words contain punctuation marks.

Extended Features.
In order to achieve the task of extracting entity relation of Chinese EMRs more accurately, after analysing the texts of Chinese EMR, this paper extends the features of EMRs based on the basic features that are named extended features, which are mainly divided into medical record features, indicator features, and extended context features.
(1) Medical record: the chapter in which the entity located has a certain effect on entity relation extraction of Chinese EMRs. For example, in the "出院情况 (discharge situation)" chapter of discharge summary in Chinese EMRs, the probability of relation related to improvement is higher than that related to worsening. In addition, the modification information of an entity is also unique information in EMRs, which is a description of the entity. To sum up, the medical record features refer to chapters and modifications of entities. (2) Indicator: the mapping of entity context words and the indicator word base for entity relation are According to the characteristics of Chinese MERs, the judgment of entity relation is related to the context words of two entities. ere are some indicators that can directly classify the relation between two entities. If there are indicators such as "好转(improved)," "有所缓解(relieved)," "明显好 转(obviously improved)," and "控制稳定(stable control)," the entity relation is generally "TrID" or "TrIS". If there are indicators such as "控制不佳 (poor control)," "效果一般 (general effect)," "未见 明显变化 (no obvious change)," the entity relation is generally "TrWD" or "TrWS." After analysis and statistics, the indicator word base of all entity relation is established, and the mapping of the two entity's words in the indicator word base is regarded as an extended feature.

Dependency
Parsing. Most of the Chinese EMRs are long sentences, and the content and form of the sentences are relatively patterned, especially the structure of sentences that are mostly similar. erefore, it is worth adding the structure information of sentences to the task of entity relation extraction from Chinese EMRs. Dependency parsing reveals the syntactic structure of a sentence by analysing the dependency among its components. In a word, it is to recognize the grammatical components such as "subject predicate object" and "attributive adverbial complement" and analyse the relationships between them. It claims that the dominator of a sentence is the core verb [21] and that all the dominators depend on the core verb in one way or another. e language technology platform (http://ltp.ai/) (LTP) of the Harbin University of Technology is a complete set of Chinese language processing system developed by the social computing and information retrieval research center of the Harbin University of Technology. It provides rich, efficient, and accurate natural language processing technologies, including Chinese word segmentation, part-of-speech tagging, dependency parsing, and semantic role tagging. Using the LTP to analyse the dependency of the sentence "the patient having symptoms of wheezing and fever was given antiinfection treatment and relieved after the treatment of antiasthmatic (患者出现喘息, 伴发热, 予抗感染, 平喘治 疗后缓解)." e results are shown in Figure 3.
Dependency parsing is to analyse the structural information of a sentence, recognize the "subject predicate object" and "attributive adverbial complement," and analyse the relationships between the components. According to the dependency parsing of example sentences in Figure 3, the core predicate of the sentence is "出现 (has)," the dependency of entity "喘息(wheezing)" and "出现(has)" is VOB, and the dependency of the entity "抗感染(anti-infection)" is VOB as well. Table 2 shows the annotation relation obtained from dependency parsing by LTP.

Dependency Syntactic Features.
In this section, sentence structure and features will be integrated to get dependency syntactic features for better mining syntactic construction and semantic features, where the sentence structure is reflected in dependency parsing and sentence similarity calculation by using the algorithm of edit distance. e specific dependency syntactic features are defined as follows: (1) Sentence dependency relation of binary entities: this is referred to the syntactic relations between two entities in the syntactic structure of a sentence after dependency parsing. For instance, the dependency relation of entity "喘息 (wheezing)" after parsing is VOB in the above example ( Figure 3) and the dependency relation of the entity "发热 (fever)" is COO. erefore, this paper takes the dependency parsing value of each entity in the entity pairs as a feature. (2) Dependency relation combination of entity pair: the last feature is to take the dependency relation of entity pair as a feature input, while this feature refers to the dependency relation combination of entity pair, which is sequential. Because of this sequential, the syntactic structure of entity pairs in sentences can be shown more clearly by analysing the combinatorial feature than by analysing the independent dependency relation feature. For example, the dependency relation of entity pair <喘息(wheezing), 抗感染(anti-infection)> in the above example ( Figure 3) is VOB-VOB, indicating that both entities act as an object in VOB. Different types of relationships have different dependency relation combinations, so this dependency syntactic feature can better reflect the differences of relation types between different entities. (3) e distance between a binary entity and core predicate: after a lot of research studies and experiments on dependency parsing, it is found that the core predicate plays an important role in the extraction of entity boundary and entity relations. In a sentence, the distance between the entity and the core predicate is obviously different from that between the entity and the common predicate, so this paper takes the former as a feature. After the core Scientific Programming 5 predicate of a sentence is obtained by dependency parsing, the distance between the entity and the core predicate is calculated by calculating the number of words between them based on the location of the core predicate.

SVM Model.
e objective of the support vector machine model [22] is to find a hyperplane in an N-dimensional space (N is the number of features) that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. In order to find a plane that has the maximum margin, i.e., the maximum distance between data points of both classes, we turn it into a convex quadratic programming problem.
Given a training sample set D � (x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )}, where x i ∈ R n is the ith feature vector and y i is the label of classes, denoted as y i ∈ +1, −1 { }, i � 1, 2, . . . , n, the hyperplane is defined as follows: where w � (w 1 , w 2 , . . . , w d ) is the normal vector of the hyperplane and defines the direction of the hyperplane and b is the intercept that determines the distance between the hyperplane and the origin. Due to the correctness of classification being judged by observing whether w T x + b and y are both positive or negative numbers, a function of margin c ′ should be defined as follows: In order to unify the measurement, constraints are added to the normal vector w: e idea of the SVM is to maximize the margin, so that the distance from all points to the hyperplane is greater than or equal to a certain distance; then, all classification points are classified on both sides of the support vector, i.e., If the function of margin c ′ � 1, then equation (4) is reduced to Considering that maximizing the 1/||w|| 2 is equal to minimizing the 1/2||w|| 2 , the SVM model for solving the maximum partition hyperplane problem can be expressed as the following constrained optimization problem:

Results
In this section, we carry out three comparative experiments based on basic features, extended features, and dependency syntactic features. e experimental results show that structural information is very important for entity relation extract of Chinese EMRs, an irreplaceable role in the task of relation extraction, especially for Chinese EMR texts.

Data Set.
We evaluate our approach of entity relation extraction on the medical dataset from the existing research [23]; this dataset is semiautomatic and annotated from Chinese EMRs of a grade-three general hospital in Shanghai for a whole year, and the entity set is obtained through the method of feature-enhanced entity recognition. e detailed information of the data set is shown in Table 3. We use 70% of the dataset as training data and 30% for testing. For readers interested in this data set, it is recommended to read academic study [23].

Baseline.
In this paper, the task of entity relation extraction can be transformed into a multiclassification problem. e machine learning tool of LibSVM [24] is used to automatically build multiple binary classifiers according to the number of categories, which can be directly used for  erefore, this paper uses a LibSVM tool to train and test the SVM model, which has certain requirements for the data format of training and test data set, and the data format of input files is shown in Figure 4. Each row of data in Figure 4 represents a training vector, and the 'label' represents the identification of each classification label in this multiclassification, the 'index' is the number of features, and the 'value' is the value of features. In this paper, all data sets trained and tested by LibSVM are transformed into data files of this format for experiments after feature extraction and feature vector construction.
In order to compare the effects of extended features and sentence structure information on the experimental results of entity relation extraction in Chinese EMRs, three contrast experiments are set up in this paper. e first experiment is the baseline experiment, which selects the basic features including lexical feature, contextual feature, entity feature, and location feature. And the second experiment adds the extended features based on the basic features, while the last experiment adds the dependency parsing to the features to form the dependency syntactic features. e results of experiments are evaluated by 3 types of indicators [25]: Precision (P), Recall (R), and F1.

Results and Analysis.
e experimental results of relation extraction based on different features for the data set are shown in Table 4. As we expected, the method of fusing dependency parsing outperforms the relation extraction method based on basic features or extended features. For the baseline, the extraction effect of entity relations of TrCD, TrNAD, and TrNAS is poor. is is because these three types of relationships appear less frequently (less than 5 times) in the tagging corpus. While the precision of TeRS is high, not only because this relation type appears more frequently in the training corpus but also because the characteristics of this relation are obvious, in which sentence pattern is basically "胸片示 (chest X-ray shows): 双 肺纹理增多 (bilateral lung marking are increased), and 模 糊 (blurred)". In addition, the extraction effect of SID and TeRD is better, which is also due to the obvious surface features and more training data. However, the relation extraction precision of TrID, TrIS, TrWd, and TrWS is low because of the existence of long sentences in Chinese EMRs, and only the contextual features of before and after words are not obvious.
While after adding the extended features proposed in this paper, in addition to the three unextracted relation types of TrCD, TrNAD, and TrNAS, the precision and recall rate of all other relation types have been improved, among which the improvement effect of the four types is stronger: TrID, TrWd, TrIS, and TrWS.
is is because the medical record features in the extended features (including chapter information and entity modification information) have some influence on the location of entity relation. For example, in the chapter of "出院情况 (discharge situation)," the incidence of entity relation related to improvement is higher than that related to worsening. e indicator features in extended features are more effective for the relation types of improving and worsening because there are related demonstratives (好 转 (improvement), 稳定 (stability), 一般 (general), 不佳 (poor), etc.) before and after the entities of improving and worsening. e verb features of entity to the front and back in the extended features are also of great significance to the entity relation extraction of Chinese EMRs. Due to the long sentence in the texts of Chinese EMR, the words before and after many entity pairs are meaningless for entity relation extraction, while the verbs before and after entity pairs generally have certain indicative meanings.
As shown in Table 4, the precision and recall rate of all entity relations have been significantly improved after adding the dependency syntactic features. Dependency parsing is mainly to mine deeper structure information of sentences based on the surface semantic features. Obviously, the three dependency syntactic features added in this paper still greatly improve the precision rate of TrID, TrWd, TrIS, and TrWS, as well as the improvement effect on the TeRD and TeRS. It is because the sentence patterns of "treatment discover symptoms" and "treatment confirmed diseases" are very similar and unified. Many sentences are the patterns of "a certain test: symptom description or disease description" or "test shows: symptom description or disease description," so this characteristic can be mined by the dependency parsing.
e values of F1 for the relation extraction based on different features show the trend of the effects of entity relation extraction. In the case of limited training corpus, the performance of each entity relation is improved after fusing extended features and dependency syntactic features. Particularly, it is more effective for the several relation types (TrID, TrWD, etc.) that are relatively few in the corpus. However, our method is not very effective in the extraction of three types of relation TrCD, TrNAD, and TrNAS, because the number of these three relation types in the corpus is too small. e future research direction can be focused on how to generate relevant corpus or mine deeper features when the number of the corpus is small.   First disease process Total  Disease  905  1519  2424  Symptom  1407  2225  3632  Test  599  986  1585  Treatment  1045  1264  2309  Total  3956  5994  9950 Scientific Programming 7

Conclusions
is paper implements the extraction of entity relations in Chinese EMRs. e relation types of extraction include the relations between treatment and disease, treatment and symptom, test and disease, test and symptom, and disease and symptom. And the machine learning method is used to transform the task of relation extraction into the classification of entity pairs, which mainly uses the SVM model for training and testing. e similarity of sentences brings a lot of hints to entity relation, i.e., generally, the relation between two entities in sentences with similar sentence structures and semantics is the same. First, this paper proposes four basic features of general text, such as lexical feature and location feature. Second, due to the juxtaposition of many entities or words in Chinese EMR texts, the simple context information is redundant and noisy, so the extended feature is proposed, which is composed of chapter information and indicator feature. In addition, because the basic features and extended features are the only superficial semantic features, but ignoring the information of sentence structure, LTP tool is used to analyse the dependency parsing of Chinese EMR texts and introduce the dependency syntactic features. In this paper, an SVM model is adopted to train and test entity relation extraction.
ree comparative experiments are designed for the above three types of features. e results show that the extended features and dependency syntactic features proposed in this paper improve the accuracy and recall rate of entity relation extraction of Chinese EMRs to a certain extent. However, the training set and test set used in this paper are limited in scale. In the future, it is necessary to study the deep learning method for a largescale corpus to extract entity relations more efficiently.

Supplementary Materials
Due to the privacy of the data set of our medical EMRs, we selected some experimental data as samples. e details of the supplementary materials file are as follows: (1) the discharge folder is the data of discharge summary, which includes the data of training, test, discharge summary relation, and discharge summary entity. e files in the folders of train and test are the discharge summaries of patients, which are used to train and test models including condition of hospitalization, admitting diagnosis, diagnosis and treatment process, discharge diagnosis, hospital discharge, and discharge orders. e files in the folders named dis-chargeEntity are the medical entities of discharge summaries. Every line in the files corresponds with entity information in discharge summaries, which are tagged with "C = entity P = start: end T = entity type A = entity assertion," where C represents the concepts of entities in discharge summaries, P means the start and end position of entities in medical EMR texts, and T and A stands the type of entities and the modification of entities, respectively. e files in the folders named dischargeRelation are the entity relations of discharge summaries. Every line in the files corresponds with the relation between entities in medical discharge summaries, which are tagged with E = {entity[strat-end]entity type; ...;}‖R = ‖E = {entity[strat-end]entity type;...;}, where the first E represents the first entity, including the entity concept, the start-end position, and type of entity. Similarly, the second E represents the second entity. And the middle R represents the relation type between the two entities. (2) e progress folder is the data of progress record, which includes the data of training, test, progress record relation, and progress record entity. e files in the folders of train and test are the progress records of patients, which are used to train and test models including characteristics of case, preliminary diagnosis, and plan of diagnosis. e files in the folders named progress Entity are the medical entities of progress records. Every line in the files corresponds with entity information in progress records, which are tagged with "C = entity P = start: end T = entity type A = entity assertion," where C represents the concepts of entities in progress records, P means the start and end position of entities in medical progress records, T and A stands the type of entities and the modification of entities respectively. e files in the folders named pro-gressRelation are the entity relations of progress records. Every line in the files corresponds with the relation between entities in progress records, which are tagged with E = {entity [strat-end]entity type;...;}‖R = ‖E = {entity[strat-end]entity type;...;}, where the first E represents the first entity, including the entity concept, the start-end position, and type of entity. Similarly, the second E represents the second entity. And the middle R represents the relation type between the two entities. (Supplementary Materials)