DEERE: Document-Level Event Extraction as Relation Extraction

e descriptions of complex events usually span sentences, so we need to extract complete event information from the whole document. To address the challenges of document-level event extraction, we propose a novel framework named Document-level Event Extraction as Relation Extraction (DEERE), which is suitable for document-level event extraction tasks without triggerword labelling. By well-designed task transformation, DEERE remodels event extraction as single-stage relation extraction, which can mitigate error propagation. A long text supported encoder is adopted in the relation extraction model to aware the global context eectively. A fault-tolerant event integration algorithm is designed to improve the prediction accuracy. Experimental results show that our approach advances the SOTA for the ChFinAnn dataset by an average F1-score of 3.7. e code and data are available at https://github.com/maomaotfntfn/DEERE.


Introduction
e aim of the Event Extraction (EE) task is to extract structured event information from unstructured text [1]. EE can be divided into Sentence-level Event Extraction (SEE) and Document-level Event Extraction (DEE). Previous research has focused on SEE, but the description of a complex event usually involves multiple sentences, so we need to extract more complete event information from the whole document. SEE no longer meets our needs, and its methods are ill-suited for DEE tasks. e two main challenges of DEE are argument-scattering and multievents. Argument-scattering means that the arguments of an event are scattered in multiple sentences. As shown in Figure 1, the arguments of Event-2 are scattered in S18, S21, and S22. Multievents mean that a document includes multiple events, in which there may be arguments that overlap. Depending on the degree of overlap, the relationship between two events can be classi ed as (1) no arguments overlap, (2) arguments overlap between events of di erent types, and (3) arguments overlap between events of the same type (e.g., both Event-1 and Event-2 contain the argument Tacheng International). In addition, we will also encounter the DEE task without trigger-word labelling, which can be regarded as another challenge.
To address the above-given challenges, the most recent SOTA method, DE-PPN [2], designed an end-to-end model, where a document-level encoder is used to obtain the text representations, and a multigranularity decoder is used to generate events in parallel. DE-PPN encodes each sentence separately and concatenates them into document encoding after max pooling, which does not consider the interaction between sentences and cannot be fully aware of the global context. e extraction process of DE-PPN includes candidate argument identi cation, event prediction, and role lling. is kind of multi-stage structure is prone to error propagation.
In this paper, we propose an event extraction framework named DEERE (short for Document-level Event Extraction as Relation Extraction).
e key idea is to transform the complex DEE task into a relatively simple relation extraction task, which can deal with both challenges of argumentscattering and multievents. DEERE adopts a single-stage entity-relation joint extraction model to mitigate error propagation. A long text supported Transformer is used as the text encoder, which can e ectively aware of the global context.
In summary, our contributions include: (1) We propose a novel framework (DEERE) based on task transformation, which is suitable for DEE tasks without trigger-word labelling. (2) We design two key algorithms in the framework. e role selection algorithm can reduce the probability of event arguments overlapping, and the event division algorithm can further deal with the case of arguments overlapping mentioned. (3) Experimental results show that DEERE significantly outperforms the most recent SOTA method on the widely used DEE dataset, with an average F1-score improvement of 3.7.

Methodology
As shown in Figure 2, the architecture of DEERE includes three modules: task transformation, relation extraction, and event prediction. Labelled events are transformed into welldesigned relational triples, which are used as the training data of a relation extraction model. e relation extraction model adopts a long text supported Transformer to encode the whole document. During event prediction, the relational triples extracted from the input text are reorganized into basic events.

Relational Triplet Creation.
e labelled events in the training set are transformed into two kinds of relations: (1) role relation describes the role assignments between arguments, which is designed to resolve the challenge of argument-scattering and multievents without key role overlap; (2) co-event relation describes whether two arguments belong to the same event, which is designed to resolve the challenge of multi-events with key role overlap.
Suppose the event type E k includes m roles, denoted as [r 1 , r 2 , . . . , r m ]; an event instance of E k also includes m arguments correspondingly, denoted as [a 1 , a 2 , . . . , a m ]. Select one from the roles of E k as the key role, and the argument that plays the key role in a specific event is called the key argument. e key argument of the event is combined with each nonkey argument to form a role relation triple (a key , E k r i , a i ), where a key represents the key argument of a specific E k event, and a i represents a nonkey argument playing the i-th role in the same event (i ≠ key).
In a document, a key argument may involve several events of the same type. In order to distinguish these events, a subkey role is selected from the nonkey roles of the event type, and the argument that plays the subkey role in a specific event is called the subkey argument. e subkey argument of the event is combined with every other argument to form a S12: Tacheng International originally pledged 13,000,000 shares of the Company's unlimited shares for sale to Bank of Shanghai Corporation Hongkou Sub-branch were released from pledge at China Securities Depository and Clearing Corporation on December 29, 2016. S18: Now Tacheng International re-pledges its 23,550,000 restricted shares of the Company with Huarong Securities Co. Ltd for stock pledge repurchase transactions, accounting for 3.61% of the total share capital of the Company, with the initial transaction date on January 12, 2017 and the repurchase transaction date on January 10, 2019.  coevent relation triple (a skey , co event, a i ), which represents that a skey and a i belong to a same event (i ≠ key and i ≠ skey). According to the definition, there are multiple types of role relation, whose number is related to the total roles of all the event types. ere is only one type of coevent relation, which ignores event types and roles. ese relational triples will be used as the training data and prediction targets of the relational extraction model.

Key Role Selection.
Based on the above-given task conversion rules, in order to reduce the situation of a key argument involving multiple events of the same type, we hope to select the role that can best distinguish different events as the key role. For a group of events with the same type, we define role discrimination as the average probability that each argument on a role can accurately identify the event to which it belongs.
Formally, suppose that a document d contains n events of E k , denoted as [e 1 , e 2 , . . . , e n ] T ; the argument list of e i is denoted as [a 1 i , a 2 i , . . . , a m i ], the argument matrix for all events can be represented as follows: e discrimination of role r j to events of E k in single document d can be expressed as follows: In equation (2), p(e i ) represents the probability of the occurrence of event e i , and q(a j i ) represents the probability that the argument a j i can accurately identify the event to which it belongs. We assume that each event has an equal probability, that is, p(e i ) � 1/n, so the formula can be simplified as follows: In a group of events of the same type, the higher the repetition rate of arguments on a role, the lower the role discrimination for events. It is not difficult to prove that the summation part in equation (3) is numerically equal to the count of nonrepeated arguments on the role, where arguments with null value are not counted. We can get the following equation: In equation (4), count distinct represents the count of nonrepeated arguments, which can be easily obtained by set operation in practice. Suppose that there are T documents in the training set that contain events of E k , we can calculate the discrimination of role r j in each document d t separately, and regard the average as the global role discrimination: From the role list of E k , we can select the role with the highest discrimination as the key role and select the one with the second highest discrimination as the subkey role.

Relation Extraction Model.
e triples obtained from the task transformation will be input to the relation extraction model together with the original text as training data. In theory, any relation extraction model can be applied here. However, since the results of relation extraction will directly affect the results of EE, we adopt the recently proposed GPLinker [3], an entity-relationship joint extraction model based on GlobalPointer [4]. Preliminary experiments show that the performance of GPLinker is slightly better than the Casrel [5] and comparable to the  TPLinker [6]. In addition, the GPLinker model has fast training speed, high decoding efficiency, and theoretically no exposure bias. GlobalPointer is essentially a token-pair recognition model, which can be used in nested and nonnested NER. Multilabel categorical cross-entropy is used as the loss function during training. GPLinker converts the extraction of relational triples (subject, predicate, object) into three kinds of token-pair recognition: entity head/tail pair, subject/object heads pair, and subject/object tails pair. Each kind of token-pair is recognized by a specific GlobalPointer, and all GlobalPointer modules share the same text encoder.

Long Text Encoding.
In recent years, the relationship extraction SOTA models (Casrel, TPLinker, etc.) are mostly based on BERT [7] or other pretrained language models. e original BERT uses absolute position encoding and can handle a maximum text length of 512 tokens. e text length of document-level extraction tasks is usually beyond the above range. If the long text is truncated or segmented, it will inevitably affect the model's perception of the context of the full text, which is also a common problem with previous DEE models. For this, we try to use RoFormer [8], a Transformer that uses relative position encoding, to encode every document as a whole. Of course, other pretrained language models that support long text encoding can also be used here.

Event Prediction.
Shown as the event prediction module of Figure 2, we first extract relational triplets from the input text and then integrate these triplets as basic events. Since the triples are predicted by the relation extraction model, there are inevitably some errors or omissions, which requires the event integration algorithm to have a certain fault tolerance. In this regard, we first construct event-clusters by the key argument and role relation. If an event-cluster contains multiple basic events, it will be further divided according to the subkey argument and coevent relation.

Event-Cluster Construction.
Based on the predicted role relation triplets, a special structure called eventcluster is constructed around each key argument. Every event-cluster includes a key argument and related nonkey arguments, all of which belong to the same event type. An event-cluster can be classified as a single event-cluster or compound event-cluster according to the number of basic events included.
According to the construction rules, there is only one key argument in an event-cluster, but there may be several arguments on a nonkey role, which is called a multivalued role.
e higher the proportion of multi-valued roles in an eventcluster, the more likely it contains multiple events. If there are more than one subkey argument and the proportion of multivalued roles exceeds a certain threshold, it will be judged as a compound event-cluster; otherwise, it will be judged as a single event-cluster. e threshold can be adjusted as a hyperparameter.

Event Division.
A compound event-cluster needs to be further divided into basic events. In detail, we first create one basic event for each subkey argument and then assign arguments to other roles of each event according to the predicted coevent relations. If there are more than one candidate arguments for a role, it can be selected by relation strength.
We have also considered the maximal clique search algorithm for event division, which requires that any two arguments of the same event must be accurately judged as coevent relation. Assuming that an event has n arguments, even if only one-way relations are considered, n * (n − 1)/2 coevent relations need to be extracted. e missing of just one relation can lead to serious errors in event division. As shown in Figure 3 Although the maximal clique search algorithm is more complete in theory, its application conditions are too demanding. In contrast, our proposed algorithm only needs n − 2 coevent relations to divide a basic event, which not only reduces the application conditions of the algorithm but also has stronger fault tolerance. is point will be confirmed in the experiments.

Dataset.
Our proposed method will be evaluated on the ChFinAnn dataset, which is also applied by DCFEE [9], Doc2EDAG [10], and DE-PPN [2]. ChFinAnn is a DEE dataset without trigger-word labelling, which is automatically annotated by distant supervision. e dataset contains 5 financial event types: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO), and Equity Pledge (EP). It includes a total of 32,040 documents, about 30% of which contain multiple events.

Evaluation Metric.
We follow the evaluation metrics used in Doc2EDAG and DE-PPN. For each predicted event, the most similar ground truth is selected without replacement to calculate Precision, Recall, and F1-score. Microaveraged role-level scores are considered as the final metric for each event type. For the global performance, the previous literature only reported the macro-averaged F1-score of all event types, which does not consider the imbalance of sample distribution. erefore, we will additionally report the microaveraged F1-score that can better reflect the practical performance.

Experimental Setup.
e head size of the GlobalPointer in the relation extraction model is set to 64. e text encoder adopts the char-based RoFormer (Chinese_r-oformer_char_L-12_H-768_A-12), whose parameter scale is comparable to BERT and the vocabulary is reduced to 12000.
Since most of the documents in the dataset are within 2000 tokens, the maximum encoding length (max_len) of RoFormer is set to 2000. e threshold of multivalued roles proportion for multi-event judgement is set to 0.2. For each event type, we select the role with the highest discrimination as the key role and select the second highest one as the subkey role. e selection results are shown in Table 1.
All experiments run on a workstation with RTX3090. e model is trained for 20 epochs with the optimizer of exponential moving average Adam and the learning rate of 1e − 5. e best performing model on the development set is saved and its performance on the test set is used as the final test result.

Baseline.
Our framework DEERE is compared with the previous SOTA methods as follows: DCFEE [9] proposed a DEE method based on key-event detection and argument completion. DCFEE-O and DCFEE-M are the single-event version and multievent version, respectively. Doc2EDAG [10] is an end-to-end model that transforms event table filling into a path expending of entity-based directed acyclic graph. GreedyDec is a simple version of Doc2EDAG, which only fills one event table entry greedily. DE-PPN [2] is the most recent SOTA model that aggregates the document-level context to predict events in parallel. DE-PPN-1 is the simple version that only generates one event. Table 2, our framework DEERE achieves the best performance in all 5 event types compared to baseline methods. Specifically, DEERE improves 1.0, 5.4, 6.0, 2.1, and 3.5 F1-score over the DE-PPN on the event type EF, ER, EU, EO, and EP, respectively. Table 3 shows that our model gets the highest score in both single-event and multievent parts for every event type. As to global performance, the macro F1-score of DEERE is 3.6 higher than that of DE-PPN (3.7 higher for the single-event part and 3.4 higher for the multi-event part). In addition, the micro F1-score of DEERE on the test set is 83.7 (90.8 for the single-event part and 75.9 for the multi-event part), which is not reported in the previous literature.

Ablation Experiments.
To verify the effect of the two key mechanisms in our framework, we conducted a series of ablation experiments. As shown in Table 4, RoleSelection + means to select the key roles and subkey roles according to the discrimination for events, and RoleSelection-means to select them sequentially from the role list. EventDivision + means to perform the event division operation on event-clusters, EventDivision-means to treat every event-cluster as a single event.
Test results show that using either of the two mechanisms can improve the performance, but the improvements overlap significantly when both mechanisms are used. e reason is that after using role selection, the situation that the same key argument involves multiple events is greatly reduced, and the effect of event division will be relatively limited. Although the performance of using only role selection on the ChFinAnn dataset is close to that of using both mechanisms simultaneously, it does not mean that event division is dispensable. If the key roles selected in other EE tasks are not so ideal (i.e., the discrimination for events is not high enough), event division will become more important.
In addition, we also test the event division method based on maximal clique search. e micro F1-score is 81.3 with role selection and 76.8 without role selection. Compared with our event division method, it drops by 2.4 and 5.0, respectively.

Effect of Maximum Encoding Length.
We investigate the influence of the max_len of the text encoder on the performance. As shown in Figure 4, with the increase of max_len, the F1-score and Recall increases rapidly, while the Precision is basically flat (with a slight decrease). e model performance no longer improves significantly after the max_len exceeds 2000, which is consistent with our default value. e results affirm the idea of global context awareness and the feasibility of whole document encoding.

Effect of Relation Extraction Model.
Our framework does not rely on specific relation extraction models and text encoders, which can be freely chosen. With other settings unchanged, we try to adopt different combinations of relation extraction models and text encoders.
e training results are shown in Figure 5, where Casrel [5] is a two-stage entity-relationship joint extraction model, and NEZHA [11]   is another pretrained language model using relative position encoding. e performance of EE is basically synchronized with that of relation extraction, and the combination of GPLinker and RoFormer obtains the best score. e results show the single-stage relation extraction model (GPLinker) is significantly better than the multistage model (Casrel), while the effect of text encoder changes is much smaller.

Related Work
SEE uses only features obtained from intrasentences, and traditional feature engineering-based approaches [12,13],     cannot be adapted to tasks that rely on complex semantic relationships. Recent work on EE is based on deep learning to automatically learn features, mainly using pipeline and joint models. Pipeline methods [14,15], whether using CNN or RNN, use a pipeline approach to split the extraction process into two separate processes, extracting event trigger words and detecting arguments. is approach inevitably causes error transfer and makes it difficult to capture longdistance dependent information. To reduce error transfer, joint methods [16,17], consider simultaneous extraction of trigger words and arguments. To address the problem of overlapping roles, pretrained language models [18,19], are used to model intrasentence and intersentence contextual information, improving the accuracy of the task overall. Early classification models [20,21], divide DEE into two subtasks: recognition of event descriptors and detecting arguments, using SVM as classifiers. Neural network-based classification models [22,23], use word embeddings as the input to the decision tree, and then the structured information of the document is obtained through the integration of information.
To address the argument-scattering, one solution is to transform the extraction task into a sequence annotation task, which dynamically fuses two different levels of information at the sentence level and the document-level [24]. Another solution is to use a sentence as the event-centered sentence for arguments complementation. DCFEE [9] is based on sequence annotation, main event discovery, and arguments complementation strategy to construct the extraction model, which solves the argument-scattering to a certain extent. For the multievents, Doc2EDAG [10] utilizes an end-to-end approach to integrate arguments scattered across different documents and transforms the documentlevel event table filling task into an entity-based path expansion task for directed acyclic graphs. e multilayer bidirectional network MLBiNet [25] fuses cross-sentence semantic and associative event information to enhance the discrimination of each event mention. In addition, there are methods [26,27], that transform EE into other tasks such as reading comprehension and intelligent quizzing.

Conclusion
We propose a novel framework (DEERE) that cleverly transforms the task of DEE into a relation extraction task.
e new SOTA performance on the ChFinAnn dataset illustrates the reasonability of the framework design, and ablation experiments verify the effectiveness of the key mechanisms in the framework. e single-stage relation extraction model can mitigate error propagation, and the event integration algorithm with fault tolerance can compensate for some errors of relation extraction. Increasing the maximum length of the text encoder is beneficial to improve document awareness, but it also requires more computational resources.
e ChFinAnn dataset used in our experiments has a considerable scale, but labelled data in other application scenarios are often insufficient. In future work, we will focus on few-shot event extraction and data augmentation for it.
Data Availability e code and data are available at https://github.com/ maomaotfntfn/DEERE.

Conflicts of Interest
e authors declare that there are no conflicts of interest.