BERT_LF: A Similar Case Retrieval Method Based on Legal Facts

With the development of smart justice in China, the Supreme People ’ s Court implements the system of compulsory retrieval for similar cases continuously and comprehensively, more and more judicial big data has been continuously disclosed, and the application of case retrieval is more extensive, and then, the accuracy of similar case search results needs to be urgently improved. Legal case retrieval is a special search task; for a given query case, it refers to the searching for similar cases. Di ﬀ erent from traditional text search, legal case retrieval has di ﬀ erent characteristics and greater challenges, for its query case is longer and more complex than common keyword queries and short article queries. In addition, the de ﬁ nition of dependencies between query cases and candidate cases di ﬀ ers from general dependencies based on text or topic. In order to solve these problems, we propose a method for similarity case retrieval based on the legal facts, and our model combine the topic distribution and legal entity facts to make the document representation vector more suitable for legal scenarios. At the same time, the method of paragraph aggregation based on BERT is used to encode context semantic information and solve the problem of long text. The experimental results show that our method is superior to the existing method.


Introduction
In many legal systems, similar case retrieval is of great significance to ensure legal fairness. With the development of smart justice in China and the increasing number of digitized legal documents, automatic retrieval of legal cases has attracted more and more attention in the research field of information retrieval (IR) [1][2][3]. In recent years, researchers have made many typical contributions to the retrieval of legal information [4][5][6][7][8].
The purpose of legal case retrieval is to identify cases that are similar to the given case. China has provided a guiding case series that can be referred to for the trial of similar cases. Guiding cases are composed of titles, keywords, key points of adjudication, relevant laws and regulations, basic case facts, adjudication results, adjudication reasons, and notes including the names of the effective adjudication and adjudicators. The problem of studying similar cases is essentially the study of text similarity. However, the legal case retrieval task differs greatly from traditional text retrieval in terms of the length of the case text, the definition of relevance, and the accessibility of the legal dataset. Based on the research of Shao et al. [9], there are several challenges in solving the problem under the existing text similarity method: Challenge 1. Legal cases are often long texts, which causes models to fail to handle all useful information when establishing vector representations of text. At present, the memory ability of the most commonly used neural network models in the text field, such as LSTM, is not strong, and their application effect in long text is not good, which also leads to the poor effect of the general text similarity model in the legal field. Xiao et al. [10] proposed a model that combines local sliding window attention and global task-driven full attention, called Lawformer, for processing long texts Challenge 2. The similarity of legal cases differs from the generic textual similarities and, to some extent, also goes beyond the general definition of subject matter relevance [2]. It needs to explore the similarity of the legal facts contained in the legal case text. Traditional text similarity method can indeed learn the semantics similarity, but the model does not understand the knowledge of the legal field, so it may not be able to learn the deeper legal-related logical relationship under the surface semantics, which leads to the failed of finding highly similar legal cases using text similarity methods alone. Therefore, it is crucial to identify the similarities of cases in terms of legal issues and legal processes, which requires a full understanding of the legal case text Challenge 3. Collecting large amounts of legal case data, as well as similar case datasets, is a challenge. On the one hand, in many legal systems, the download of large legal documents is restricted. On the other hand, the cost of obtaining accurate correlation judgments is higher due to the need for expertise in the legal field. The lack of data hampers the training process for deep neural models What is more, the text structure of legal judgment documents is different from that of an ordinary text. The generic text similarity model mainly considers the structural characteristics of the text, such as syntactic structure, but although the legal judgment instrument is an unstructured text, it often has specific format requirements, so the general text similarity method cannot accurately represent the legal text; if the structural characteristics of the judgment document can be combined with the calculation of the similarity for the general text, it may produce better results.
To solve the problems above, we proposed a BERT-LF model; for challenge 1, literature [9] and literature [11][12][13][14] explored the long text problem applied by BERT, respectively, and their work included sentence-level fraction aggregation, paragraph-level fraction aggregation, and paragraphlevel representation aggregation, so that the problem has been roughly solved, which inspired us to infer the similarity of the entire legal case by aggregating paragraph-level semantic interactions. For challenge 2, we proposed a legal case representation method based on legal facts, combining with the topic distribution. The deep combination of legal facts, document topic, and semantic information makes the document representation vector more suitable for legal scenarios. Further, we used an attention mechanism to distinguish the importance of legal information between paragraphs. For challenge 3, we crawled the judgment document data from "China Judgements Online" to train the topic model to adapt it to the legal scene. Our experiments were conducted on the legal case retrieval dataset [15], and the results proved the effectiveness of the proposed method.

Related Work
In the past researches, a large number of text retrieval models have been proposed, especially for specific texts. Literature [16][17][18] solved the problems of complex feature dimension and difficult retrieval of text data. Common approaches for early semantic representation include vector spatial models, topic models, and their variants such as the classic LDA [19]. But research in literature [20] showed that the similarity under the same topic still needed to be improved. With the advent of word embedding, information retrieval has now shifted to neural information retrieval. Researchers began using dense vector representations of words and documents based on deep learning models [21][22][23][24] as input of machine learning algorithms. Traditional bag-of-words IR models include BM25 [25], TF-IDF [26], and LMIR [27]. Mandal et al. [28] compared the effects of four unsupervised text vector generation models, TF-IDF, word2vec, LDA, and doc2vec, when calculating legal text similarity, and tested it on an Indian dataset containing 47 case pairs; the result showed that doc2vec worked best. Vo et al. [29] also indicated that text semantic representations based on word embedding are helpful in the field of legal text retrieval. Meanwhile, researches in literature [30,31] showed the effectiveness of neural embedding of texts in legal information retrieval.
In view of the text similarity problem for Chinese, some scholars made a series of improvements to the classical similarity method. Li et al. [32] proposed a text similarity calculation algorithm based on improved VSM, which took into account the influence of the same feature words between similar texts. Huang et al. [33] proposed a supervised-WMD algorithm, which added new document features and movement costs to WMD and solved the problem that the WMD cannot take useful classification information into account.
Further, in the field of Chinese legal case retrieval, Lv and Hou [34] improved topic distribution model and designed a legal case recommendation algorithm. They argued that the words generated by the topic distribution model have different representations of legal texts. Thus, they reduced the probability distribution of words which appeared frequently but carried little weight with the legal text; what is more, they improved the probability distribution of words that did not appear frequently but were helpful for the representation of the legal text. In the similarity module of Xiang [35], the keywords were extracted by natural semantics and TF-IDF, the keywords of the judgment document were formed by semantic and frequency, and then, the judgment document was converted to vectors by the keyword table. Obviously, if you just consider whether the keywords are the same, you will ignore the contextual information. Wang et al. [36] compared the effect of the TF-IDF model, the LDA model, and the improved LLDA model on the task of case similarity and found that the TF-IDF had the worst effect and the LLDA model had the best effect. And they also point out that if you want to get a good effect, the parameters of the topic model need manual intervention. These similarity calculation methods based on word perspectives ignore the meaning of word order and context. Some subsequent studies have used the vector representation of word2vec to calculate case similarity. Deng [37] fused the word2vec, doc2vec, and TF-IDF algorithms and used them in the calculation of case similarity. Li [38] designed a method for calculating the similarity of documents that combined bipartite diagrams and syntactic information. The compressed document content was used to calculate the text similarity, and good results were obtained. Liu et al. [39] proposed a similar case recommendation model based on neural networks, which first used legal facts to 2 Wireless Communications and Mobile Computing guide the generation of text representation vectors for each case, and then used the generated vectors to calculate the similarity scores of any pair of cases, and the set of cases with the highest similarity was used as the recommended similar cases. Although the above researchers have achieved certain results, most of the models are not designed for long legal documents.
Since BERT [40] has made significant improvements in various NLP tasks and achieved state-of-the-art performance in 11 missions, pretrained language models have attracted a great deal of attention in the field of information retrieval. Recently, several studies have elucidated the application of BERT in legal case retrieval, such as literature [9] and literature [11][12][13][14].

Materials and Methods
3.1. Task Description. Legal case retrieval task refers to finding cases similar to a given query case in the candidate cases set [8]. Formally, given a query case q, and a set of candidate cases D = fd 1 , d 2 , ⋯, d n g, the task of legal case retrieval is to determine the supporting case D * = fd i * jd i * ∈ D∧noticedð d i * , qÞag, where noticed ðd i * , qÞ indicates that d i * is legally similar to the query case q. Both the queries and candidates are long texts containing descriptions of legal facts. Figure 1, in general, the entire framework of our model contains three modules, the first part is the legal feature encoding module, which contains the semantic encoding part based on BERT, the topical encoding module based on LDA model, and the encoding part based on legal entities; the second part is the aggregation of encoding; before entering the third part, the second part is responsible for encoding and aggregating the output of the first part; and the third part is the relational computation based on the attention mechanism.

Architecture Overview. As shown in
3.3. Legal Feature Encoding. When judging whether two cases are similar, we are actually considering whether the legal facts and the logic of the events contained in the two cases are similar. Therefore, legal fact information is extracted through three coding modules; we capture the semantic context information of the case through the BERT-based module, cluster the topic information by the topical encoding module, and strengthen the role of legal facts information more accurately through the legal entity encoding module. And then, we aggregate all the above encodings to represent the paragraph-level information.

Semantic Encoding.
Drawing on the analysis of the literature [9,[11][12][13][14], in the part of semantic encoding, we use paragraph aggregation architecture based on BERT. Firstly, we divide the long text into paragraphs that BERT can handle and then get semantic encoding of the query and candidate paragraphs based on a pretrained BERT model. On the one hand, it can take advantage of BERT's strong semantic learning ability, and on the other hand, it can solve the problem of long text encoding for legal cases.
Formally described, for a query document q and candidate document d k which can be represented as q = ðp q1 , p q2 , ⋯, p qN Þ and d k = ðp k1 , p k2 , ⋯, p kM Þ where N and M denote the total number of paragraphs for q and d k , respectively. For each paragraph in q and d k , we construct a paragraph pair ðp qi , p kj Þ, where 1 ≤ i ≤ N and 1 ≤ j ≤ M, along with the reserved tags (i.e., [CLS] and [SEP]), serve as the input of BERT. We use the pretrained Chinese BERT model which called BERT-Base-Chinese (https://github.com/googleresearch/bert/blob/master/multilingual.md) to obtain a representation for each passage. Positional embeddings are added to capture word order, and these embeddings are fed into the transformer layers, where each layer of transformers generates a new upper and lower culture embedding representation by calculating the weighted sum of the token embeddings. The weight value is calculated by multihead attention mechanism. Words with a large attention weight are considered more relevant to the target word. Different attention matrices capture different types of word relationships, such as exact matching or synonym relationships. Finally, the final hidden layer vector output of the first token [CLS] is represented as the semantic aggregate of querycandidate paragraphs. As shown in Figure 1, we use the output embedding of the first token as the representation for the entire query-passage pair: By this way, we can get an interaction matrix of all query-candidate paragraphs C, where the semantic representation of each paragraph for p qi and p kj is C ij , C ij ∈ R HB . Next, interaction matrix C is further encoded with GRU model. Then, we get a sequence of hidden states generated by the forward GRU h qk = ½ h qk1 , h qk2 , ⋯, h qkN , h qki ∈ R HR .
3.5. Topical Encoding. In this section, we obtain the topic probability interaction matrix of the query paragraph and the candidate paragraph pairs according to the inverse process of generating documents in LDA model.
As we all know, the process of generating documents in LDA is document generation, topic generation, and word generation, which are divided into the following five steps: (1) Select a document m based on the prior probability where θ m ! is the topic distribution, Dirð θ According to formula (2) the topic distribution of paragraphs P qi and P kj is ZP qi = ½ZP qi−1 , ZP qi−2 , ⋯, ZP qi−v and ZP kj = ½ ZP kj−1 , Z kj−2 , ⋯, ZP kj−v , respectively. Then, we use a similarity formula (3) to get the similarity interaction matrix between the query paragraphs and the candidate paragraphs about topic probability distribution T qk = ½ T q1k , T q2k , ⋯, T qNk , where each element is represented by T qik .
where v is a hyperparameter, representing the number of topics.
3.6. Legal Entity Encoding. We mainly focus on criminal cases in China. Referring to the previous research [39,41,42], this study mainly focuses on several parts of legal case including criminal offence, criminal entity type and compensation behavior, criminal consequences, reconciliation, and criminal charge. These legal entities contain the legal facts that have a decisive influence on the judgment.
Although legal judgment documents are unstructured text, the composition and writing order of judgment documents often depend on certain writing norms. In this research, we use regular expressions to extract legal facts and synonymously expand the legal facts contained in each case. Different types of cases have different legal facts. Taking the crime of intentional injury as an example, the legal entities are as follows: Criminal entity type: government officials, minors, mentally ill, first offender, previous convictions, recidivists, etc.   Firstly, we splice all entities of query paragraphs and candidate paragraphs into two short texts, T qi and T kj , in the order of criminal entity type, criminal offence, criminal consequences, compensation behavior, reconciliation, and criminal charge.
And then, we use the pretrained Chinese BERT BERT-Base-Chinese to obtain a representation for T qi and T kj separately and obtain tensor TS qi and tensor TS kj .
Finally, we calculate the cosine similarity of tensor TS qi and tensor TS kj , as the entity similarity value of querycandidate paragraph pair.

Encoding Aggregation and Similarity Calculation.
In this section, semantic encoding, topic distribution encoding, and legal entity encoding are aggregated, and the similarity of query-candidate pairs is calculated as follows: where β s , β T , and β L are weight parameters. Then, for each paragraph of the query document, we use max pooling operation to get the strongest matching paragraph in candidate documents, resulting in a sequence vector expressed as E qk = ½E ′ qk1 , E ′ qk2 , ⋯, E ′ qkN , where E ′ qki is obtained by the following aggregation operation: For the output of aggregated encoding, we add an attention mechanism to further encode the location information. The attention weight is calculation as follows: where u qk is calculated by where W u ∈ R HR×HR , and b u ∈ R HR . Then, we use the following attentive aggregation operation to get the documentlevel representation: Finally, we use a softmax function on d qk to predict the relationship between two legal documents.

Datasets and Evaluation Metrics.
In this study, we use two legal text datasets, one is legal judgment document crawled from "China Judgements Online", and our topical encoding module is trained on this dataset. The other one is the LeCaRD open-source dataset provided by Tsinghua University, and our BERT-LF model is experimented on the dataset of LeCaRD.
The crawled legal judgment documents contain about 3.6 million legal judgment documents, covering more than 100 kinds of charges, of which the charge distribution with more than 3500 is shown in Table 1.
LeCaRD is a legal case retrieval dataset in China's legal system. It consists of 107 query cases and 10700 candidate cases, which are selected from more than 43000 criminal judgment corpora in China. The dataset is based on a series of key factors combined with subjective and objective evaluation as the correlation judgment standard. In order to ensure the diversity of cases, the dataset adopts sampling strategy, containing common query cases and controversial query cases.

Baseline Methods and Experimental Settings.
We compared our model with the following baseline models: 4.2.1. Traditional Bag-of-Words Retrieval Models. We chose the traditional bag-of-words retrieval models including BM25, TF-IDF, and LMIR, following the previous work [15]. And all parameters of these three models are set to default values in an existing package [43].

Neural Network Model
. We compared our model with BERT-PLI [9] since it is BERT-based model and solved the problem of long text of legal cases; most importantly, our model is an improvement based on BERT-PLI. For the baseline module of BERT-PLI, we set N =2 and M =8, HB = 768.

Wireless Communications and Mobile Computing
As for RNN, HR is set as 256 and only one hidden layer is used. During the training process, we use the Adam optimizer and set the start learning rate as 2e-5.
In our BERT-LF model, the parameter settings are as follows: for legal feature encoding module, we set the total number of paragraphs for query document N = 2 and the total number of paragraphs for candidate document M = 8, HB = 768, which is determined by the size of the BERT hidden vector. As for GRU, HR is set as 256 and only one hidden layer is used. In training set, 10% queries from the training set and all of their candidates are treated as the validation set. We train the model on the training data left for 40 epochs and select the best model in the training process according to the precision measure on the validation set. During the training process, we use the Adam optimizer and set the start learning rate as 2e-5. For LDA model in topical encoding module, we set the quantity of topic v = 7.

Overall Results.
Comparison results between models are shown in Table 2. The comparison among the traditional three bag-of-words retrieval models show that LMIR performs best among the precision metrics, including P@5, P@10, and MAP, which is consistent with the conclusion of literature [15]. Under the same experimental conditions of this study, LMIR performed best in the three bag-ofwords models. However, traditional retrieval models are difficult to handle long text retrieval, and the input length of these models is limited and cannot represent documents well. BERT-PLI outperforms traditional bag-of-words retrieval models in all ranking metrics, and it is structurally able to consider the entire case document and has better semantic understanding than traditional models. And BERT-LF is the best in all six indicators, and it not only considers the completed case documents and improves the semantic understanding ability but also adds the topic model and entity model to logically judge and analyze the legal elements between paragraphs.

Model Ablation.
In order to further analyze the effects of each module, we conducted ablation experiments, removing the gain embedding and gain mask from BERT-LF, both or one at a time, and observe the impact on the performance compared to the full model. The experimental results are shown in Table 3. Only the topical encoding module is represented by SEM-TP. Only the legal entity encoding module is represented by SEM-EE. Only the semantic encoding module is represented by SEM-EC, the topical encoding added by the legal entity encoding module is represented by SEM-TE, the topical encoding added by the semantic encoding module is represented by SEM-T, and the legal entity encoding added by the semantic encoding module is represented by SEM-E. First, the ablations of the main components result in performance declining, verifying the effectiveness of these components for BERT-LF. SEM-EE and SEM-TE achieve a very large drop, indicating that using only legal element entities or paragraph topics of paragraphs cannot represent the entire text, and the semantic module is very important in the function of text encoding in this experiment. In addition, the performance of SEM-EC (LSTM) and SEM-EC (GRU) also drops significantly, which proves that the addition of topic model and legal element entity model improves the accuracy of paragraph logic judgment. Second, SEM-T (LSTM) and SEM-T (GRU) only use semantic encoding and topic encoding, with reduced accuracy. Third, the performance of SEM-E (LSTM) and SEM-E (GRU) drops slightly, which indicates that the encoding

Conclusions
In this study, we proposed a model BERT-LF for similarity case retrieval based on the legal facts; our model combined the topic distribution and legal entity facts to make the document representation vector more suitable for legal scenarios. The study adopts the architecture of cutting and aggregation on paragraph, divides the long legal text into short paragraphs according to the logical order of the case, and then represents the query-candidate paragraph pairs through the BERT-based text encoding method. On the one hand, we can use the powerful semantic encoding ability of BERT; on the other hand, we can solve the problem of long text coding of legal cases. In order to accurately excavate the legal elements in legal cases, this study excavates several legal entities that have a decisive impact on the case judgment, including charges, crime, types of criminal entity, criminal consequences, compensation behavior, and reconciliation. Through convolution neural network and attention mechanism, it not only encodes the position information of legal semantics in paragraphs but also logically judges and strengthens the legal elements between paragraphs. The experimental results demonstrate that our approach is effective in legal case retrieval and the combination with topic distribution and legal entity facts can further improve models for this task.

Data Availability
The dataset used to support the topic encoding module of this study is available from the corresponding author upon request.