Legal Judgment Prediction Based on Multiclass Information Fusion

. Legal judgment prediction (LJP), as an eﬀective and critical application in legal assistant systems, aims to determine the judgment results according to the information based on the fact determination. In real-world scenarios, to deal with the criminal cases, judges not only take advantage of the fact description, but also consider the external information, such as the basic information of defendant and the court view. However, most existing works take the fact description as the sole input for LJP and ignore the external information. We propose a Transformer-Hierarchical-Attention-Multi-Extra (THME) Network to make full use of the information based on the fact determination. We conduct experiments on a real-world large-scale dataset of criminal cases in the civil law system. Experimental results show that our method outperforms state-of-the-art LJP methods on all judgment prediction tasks.


Introduction
Legal judgment prediction (LJP) aims to predict the judgment results according to the information based on fact determination, which consists of the fact description, the basic information of defendant, and the court view. LJP techniques can provide inexpensive and useful legal judgment results to people who are unfamiliar with legal terminologies, and they are also helpful for the legal consulting. Moreover, they can serve as a handy reference for professionals (e.g., lawyers and judges), which can improve their work efficiency.
LJP is regarded as a classic text classification problem and has been researched for many years [1]. For example, Liu et al. proposed to extract shallow textual features (e.g., Chinese characters, words, and phrases) for charge prediction [2]. Katz et al. predicted the US Supreme Court's decisions based on efficient features from case profiles [3]. Luo et al. combined the fact description with the corresponding law articles to predict the charges [4]. Although great progress has been made in the LJP, there still exist some problems, such as multiple subtasks, topological dependencies between subtasks, and cases of similar descriptions with different penalties. Zhong et al. pointed out that law articles prediction was one of the fundamental subtasks in some countries (e.g., China, France, and Germany) with the civil law system, and these subtasks had a strict order in the real world [5]. Further, Yang et al. proposed a neural model for the interaction between subtask results [6].
Despite these efforts in designing efficient features and employing advanced Natural Language Processing (NLP) techniques, LJP still confronts two major challenges.

e Lack of External Information. Some existing works
propose various mechanisms to extract information from the fact description, such as the Word Collection Attention mechanism. Some other works propose various frameworks to build the dependencies between subtasks, such as DAG Dependencies of Subtasks and MPBF. However, for the judgment document in Figure 1, there are many other information items that can be utilized except the fact description. Such information is called the external information including the basic information of defendant and the court view. erefore, how to utilize the external information effectively is a major challenge.

Encoding Long Document Is Difficult.
e fact description in judgment document is often long document containing the long-term dependency problem. Many existing models, such as Recurrent Neural Network (RNN) [7] and Convolutional Neural Network (CNN) [8], which perform well in the text processing are unable to deal with the long-term dependency problem. ere are some keywords in the judgment document that are very important for LJP. It is very difficult to find them in the judgment document.
In order to resolve the above challenges, in this paper, we propose the Transformer-HAN-Multi-Extra (THME) Network. It contains a structured data encoder to extract the semantics of the external information as well as a Transformer-Hierarchical Attention Network (TH) encoder to encode the fact description. Specifically, as shown in Figure 1, from the basic information of the defendant, we can get the defendant's gender, age, and education level and the content related to the criminal records of the defendant by using regular expressions. Similarly, we can get some objective attributes of a case, such as amount, plot, and consequences, from the court view. Based on the statistical analysis of large samples, we can find the relationship between the data and the terms of penalty as is shown in Table 1, where the symbol "+" represents "related." For example, given the same conditions, male's terms of penalty is longer than female's for certain cases. We use the symbol "↑" to denote positive correlation. For example, the more serious the case's plot is, the longer the defendant's terms of penalty will be. We use the symbol "↓" to denote negative correlation. For example, the better the defendant's guilty attitude is, the shorter the defendant's terms of penalty will be. It is worth noting that the case's conclusion in judgment document is significant for terms of penalty but it cannot be used as an input to predict the terms of penalty. If it is used as an input to predict the terms of penalty, it seems like that the cat shuts its eyes when stealing. erefore, we first use the external information to predict the case's conclusion and then use it together with the external information to predict the terms of penalty. Meanwhile, according to the data attributes, we divide the data into continuous and discrete types. en, we extract the required information via the continuous data encoder and the discrete data encoder. In order to reduce the information loss in the process of converting sentences into fixed-length vectors, an attention mechanism is adopted. But, it cannot solve the polysemy problem.
en, we choose a proper Transformer [9]. Transformer has attention structure; it has advantages over the RNN in solving long-term dependency problem and performs better than attention on polysemy. e Hierarchical Attention Network (HAN) can catch the keywords in a long document easily [10]. us, we can combine the Transformer with the HAN to solve the long-term dependency problem. Experimental results show that the performance of Transformer-HAN is better than Gate Recurrent Unit (GRU)-HAN. e main contributions of this paper are summarized as follows: (i) We propose a novel text processing structure, namely, Transformer-HAN, to improve the text encoding ability. is model can solve the longterm dependency problems better than the GRU-HAN. Transformer-HAN encoder uses the attention mechanism in addition to the necessary fully connected layer of the parameter matrix, and it works much faster than the encoder structure based on GRU and Long Short-Term Memory (LSTM). (ii) We propose a structured data encoder. To introduce the external information as an auxiliary, we extract fact-related data from the defendant's basic information and the court view as supplementary information of the model. According to different attributes of data, we design both continuous and discrete data encoders. Experiments show that information based on fact determination can effectively improve the judgment prediction, especially for the prediction of the terms of penalty. e court believes that the defendant's torrent secretly steals other people's property worth RMB 2275 for the purpose of illegal possession, which is a large amount, and his act has constituted the crime of the . ‧‧‧‧‧‧ e defender argued that the defendant had a better attitude of pleading guilty a er returning to the case, and could accept the fact that the case was light, consistent with the facts, evidence and legal provisions of the case. 2 Complexity (iii) Experimental results show that the THME Network can effectively improve the prediction accuracy of few-shot data. e macro-average indicators of the three tasks of law article prediction, charge prediction, and terms of penalty prediction are relatively improved compared with other models, which indicates that the prediction accuracy of few-shot data has been greatly improved. e rest of this paper is organized as follows. Section 2 briefly reviews the related work. In Section 3, we propose the overall THME framework and detailed methods. e experimental results and analyses are presented in Section 4. Finally, Section 5 contains the concluding remarks.

Legal Judgment Prediction.
With the development of Chinese legal digitalization process, as one of the most critical task steps in LegalAI, LJP has become more and more important. anks to the development of machine learning and text mining techniques, more researchers formalize this task under text classification frameworks. Most of these studies attempt to extract textual features [11][12][13] or introduce some external knowledge [4,14]. However, these methods can only utilize shallow features and manually designed factors; usually the effect of these methods becomes worse when applied to other scenarios. erefore, researchers take advantage of other technologies to improve the interpretability and generalization of the model. For example, Jiang et al. utilized the deep reinforcement learning to derive short snippets of documents from the fact descriptions to predict charges [15], and Chen et al. proposed a Legal Graph Network (LGN) to achieve high-precision classification of crimes [16]. Due to the rareness of some types of cases in real life, the few-shot problem is inevitable. While some researchers hardly solve this problem using machine learning, others find that neural networks have good results. For example, Chen et al. proposed a neural network model by embedding law articles and fact descriptions into the same embedding space in the same way [17]. Yang et al. proposed a repeated interactional mechanism to simulate the process of judge's decision [18].

Multitask
Learning. Multitask models have many beneficial effects for deep learning tasks. Sulea et al. proposed multiple tasks, which include law articles predictions, charge predictions, and terms of penalty predictions, to test the application of machine learning in the judicial field [19].  [20], which can improve the work efficiency. e emergence of multitask learning has promoted the development of LJP; however, due to the lack of external information, it has also resulted in unsatisfactory prediction of terms of penalty. In this work, we propose a framework to utilize the external information effectively. Different from most existing works, we extract the information from both the fact description and the external information and merge them together into a topological classifier to predict the three subtasks of LJP.

Method
In this section, we will describe the THME Network. We first give the essential definitions of the LJP task and the composition of THME Network in Sections 3.1 and 3.2, respectively. We describe a text encoder for fact descriptions in Section 3.3. We introduce the structured data encoder in Section 3.4. Finally, the classifier is proposed in Section 3.5.

Problem Formulation.
In most tasks of the Chinese text processing, the char-granularity processing is superior to the word-granularity processing [21], so for each judgment document, we set each Chinese character as a token. e fact description is a token sequence T � (t 1 , t 2 , t 3 , . . . , t N ), where N is the number of tokens. is can reduce the complexity of model and make it fit easier. Besides the input T, the basic information of the defendant and the court view are also deemed as external inputs of the structured data encoder. Given these inputs, we will predict the judgment results of applicable law articles, charges, and terms of penalty, which is a multitask classification problem.

Overview.
Our THME consists of three parts, i.e., the text encoder, the structured data encoder, and the classifier. e text encoder is composed of text embedding layer, text convolution layer, main encoder layer, and information extraction layer. Due to different attributes of the structured Complexity 3 data, we divide structured data into discrete data and continuous data, for which we propose discrete data encoder and continuous data encoder, respectively. e classifier is implemented with a topological structure, which utilizes the topological dependencies between subtasks in LJP. e general framework of the THME is shown in Figure 2.
We employ a text encoder to extract the information from the fact description; the fact description is embedded into CNN, so that advanced features are gradually extracted from the shallow textual features. c ij represents the j-th Chinese character in the i-th sentence.
e main encoder layer is actually Transformer-HAN, which includes two layers: the first layer aggregates token-level features into sentence-level features, and the second layer aggregates sentence-level features into text-level features. Finally, we generate four hiddenlayer states T 1 , T 2 , T 3 corresponding to three subtasks of LJP and T 4 corresponding to the case's conclusion which is critical in predicting the terms of penalty through the information extraction layer. Next, we employ the regular expression to extract the discrete data and the continuous data from the external information. en, we standardize the continuous data, embed the discrete data, and input them into the discrete data encoder and continuous data encoder, respectively. e outputs of these two encoders are combined to generate the structured data vector T m ′ . T m ′ and the hidden-layer state T 4 are concatenated into a full connection network to predict the case's conclusion T dc . e case's conclusion vector T dc and the structured data vector T m ′ make up the output of the structured data encoder T m . Finally, T m and the hidden-layer state of all subtasks in LJP T 1 , T 2 , T 3 are concatenated into the classifier with topological structure to predict the law articles, charges, and terms of penalty.

Text Encoder for Fact Description.
We employ a text encoder to generate the vector of fact description as the input of the classifier. We will briefly introduce this encoder which is composed of lookup layer, convolution layer, Transformer-HAN layer, and information extraction layer.

Lookup and Convolution.
Taking a token sequence T as input, the encoder computes a simple text representation through two layers, i.e., lookup layer and convolution layer.
(1) Lookup. We first convert each token t i in T into a natural number d i ∈ N by preprocessed dictionary mapping. e token sequence T is converted into an integer sequence D � (d 1 , d 2 , d 3 , . . . , d N ). Next, we propose an initialized word embedding sequence E � (e 0 , e 1 , e 2 , . . . , e s ), e i ∈ R k , where s is the size of dictionary. d i is mapped to x i via the word embedding sequence E. us, we can obtain the text where k is the length of word embedding.
(2) Convolution. For X, we make a convolution operation with the convolution matrix W ∈ R m×(l×k) given by where x i:i+l− 1 is the concatenation of word embeddings in the i-th window, b c ∈ R m is the bias vector, m is the number of filters, and l is the size of a sliding window. We apply the convolution over each window i and finally obtain C � (c 1 , c 2 , c 3 , . . . , c N ). e Chinese character vector after convolution has n-gram features; that is to say, the Chinese character vector after convolution has context features and is no longer isolated.

Transformer-HAN Encoder and Information
Extraction.
(1) Transformer-HAN encoder. Transformer is currently the most mainstream information extractor, mainly due to its unique attention mechanism, which achieves the true bidirectional encoding. However, the number of parameters of the multilayer Transformer encoder is very huge. In order to fully take advantage of Transformer and meanwhile constrain the number of parameters, we design the Transformer-HAN as our main encoder.
Transformer-HAN encoder is divided into two layers: the first layer uses Transformer for Chinese character-granularity coding, then uses the attention mechanism to extract the most important information in each word embedding, and combines them into sentence vectors. e second layer uses Transformer for sentence-granularity coding, then uses the attention mechanism to extract the most important information in sentence vectors, and combines them into a chaptergranularity vector. erefore, the fact description is divided into m sentences C � (c 1 , c 2 , . . . , c m ), and the i-th sentence consists of n Chinese characters Since the Transformer encoder is less sensitive to the position of Chinese characters, we need to add the position embedding to the word embedding before input. For Chinese character in the j-th sentence c j , we calculate its position vector P j as P(pos, 2i) � sin pos 10000 2i/d model , where pos is the position of this Chinese character in the sentence, i is the index of the i-th value in its word embedding, and d model is the dimension of its word embedding. e position vectors of all Chinese characters in the sentence c j form the sequence P j . en, we merge the position sentence P j with c j to obtain the sentence sequence with the information of position C pj given by where ⊕ is an element-wise addition operation. e Transformer encoder is composed of Multihead Attention (MHA), Add & Norm Layer, and Feed Forward (FF). Multihead Attention is composed of Self-Attention, for which the inputs Q, K, and V are the same. Multihead

Complexity
Attention converts Q, K, and V into Q ′ , K ′ , and V ′ through linear transformation by using a parameter matrix. Next, we apply the Self-Attention mechanism to extract the semantic information. is process is repeated h times. e results are concatenated together, and then the linear transformation is performed. e calculation process is given as follows: where concat() is the vector concatenation operation, d k is the size of head, and W 0 , W Q i , W K i , W V i ∈ R k×(k/h) are the parameter matrices.
Add & Norm Layer contains the Add layer and the Norm layer. First, we merge the input of Multihead Attention C pj with the output of MHA and obtain the fact semantic vector M j as ere are two reasons for this: First, it can make up for the lack of information. Second, it is equivalent to introducing a highway in the network. When the network is backpropagating, a part of it can be directly propagated into the original information without going through the complex network, preventing gradient explosion or gradient disappearance. en, we employ the Layer Normalization [22] to normalize M j and obtain M j ′ � (m j1 ′ , m j2 ′ , m j3 ′ , . . . , m jn ′ ), m ji ′ ∈ R k . erefore, we obtain the sentence sequence M j � (m j1 , m j2 , m j3 , . . . , m jn ) as where W 1 , W 2 ∈ R k×k are the parameter matrices and b 1 , b 2 are the basic vectors. en, we use the attention vector to extract the main information. In order to get the sentence vector s j , we initialize an attention vector u w ∈ R n and obtain s j as Similarly, we get the sentence sequence S � (s 1 , s 2 , . . . , s m ). e sentence encoder is basically the same as the  Chinese character encoder. e difference is that the token vector is replaced with a sentence vector which is produced by the Chinese character encoder. Since we still use the Transformer to encode the sentence sequence, we first calculate the sentence's position vector P s and merge it with the sentence sequence S by As the input of the Transformer, C p s passes the Transformer's MHA, Add & Norm Layer, and Feed Forward to obtain a new sentence sequence S ′ � (s 1 ′ , s 2 ′ , . . . , s m ′ ), which has higher-level characteristics and more comprehensive and useful information.
(2) Information extraction. Finally, for our three subtasks of LJP and case's conclusion, we need four different attention vectors to extract four different kinds of information from the same information sequence. We first initialize four attention vectors u s1 , u s2 , u s3 , u s4 ∈ R m and obtain the vector T j ∈ R n as where W T j is the fully connected matrix and b T j is the bias vector.

Structured Data
Encoder. e deep learning model is like a judge. We train the model and keep feeding data to the model, just like constantly showing different cases to the judge and training the professional quality of the judge. However, most of the previous work only gave the model to "see" the fact description. In practice, the judge would not sentence the defendant only based on the fact description at the time of judging. In the process of judgment prediction, we sometimes need some explicit data to convict and sentence the defendant. For example, information such as the defendant's guilty attitude, whether to commit recidivism, and the amount of money involved directly affect the final judgment. Based on the above facts, we use the regular expression to extract discrete data and continuous data from the external information, as shown in Tables 2 and 3. In order to well integrate data into THMA, we design both the discrete data encoder and the continuous data encoder, as shown in Figure 3.

Continuous Data Encoder.
We normalize each category of continuous data as where μ c is the mean of continuous data and σ c is the variance. We can obtain the continuous data sequence C ′ � (c 1 ′ , c 2 ′ , c 3 ′ , . . . , c g ′ ), where g is the number of types of continuous data. en, we employ a full connection network to fuse different types of continuous data and obtain the continuous data vector T c as where W c is the fully connected matrix, b c is the bias vector, and T c ∈ R p .

Discrete Data Encoder.
Since there are few discrete data categories, we use the word embedding method to create a discrete data vector space for each category of discrete data. We convert each category of discrete data into its word embedding d i ′ ∈ R w . Similarly, we obtain the discrete data vector T di as where W di is the fully connected matrix, b di is the bias vector, and T di ∈ R p . e discrete data sequence is then represented as where q is the number of categories of discrete data.

Case's Conclusion Prediction.
e specific content of the case's conclusion is presented in Table 4.
In order to predict the case's conclusion, we firstly obtain the combination of discrete data sequence and continuous data vector as T m ′ , given by Case's conclusion is very helpful for LJP, especially for the prediction of terms of penalty. For prediction of case's conclusion, the input U is the concatenation of the case's conclusion corresponding vector T 4 and T m ′ . Similarly, we obtain the vector of case's conclusion T dc as where W dc is the fully connected matrix, b dc is the bias vector, and T dc ∈ R dc . Finally, we obtain the output of the structured data encoder as 3.5. Classifier. When a judge decides a case, he/she often first searches for the legal basis related to this case such as the fact description. en, according to the relevant laws, the conviction is made. Finally, intergrating all the evidence and facts, the judge passes the sentence. erefore, there are topological dependencies among multitask results [5]. We evaluate the performance on three LJP subtasks, including law articles (denoted as t 1 ), charges (denoted as t 2 ), and 6 Complexity terms of penalty (denoted as t 3 ). Note that we implement the classifier with dependency in Figure 2; i.e., where H i represents the input of t i and ϕ is the empty set. is means that the charge prediction depends on law articles, and the terms of penalty prediction depend on both law articles and charges. Such explicit dependencies conform to the judicial logic of human judges, which will be verified in later sections. In order to combine the fact description and the structured data, we concatenate the structured data vector T m and the i-th subtask's corresponding vector T i to obtain the vector T i m as Considering the topological dependencies between subtasks, we predict the law article first, then the charge, and finally the terms of penalty. We obtain the law article's vector T l as e processes of charge prediction and terms of penalty prediction are similar with the law article prediction. Different from the law article prediction, the input of the charge prediction is the concatenation of T 2 m and T 1 l , while the input of terms of penalty prediction is the concatenation of T 3 m , T 1 l , and T 1 ch . Finally, we obtain T l ∈ R x , T ch ∈ R y , and T p ∈ R z , where x, y, z are the number of categories of label for subtasks 1, 2, 3, respectively. In order to learn parameters of THME model, we use the Adam algorithm [23]. We adopt the cross-entropy loss in the training process as follows: where y is the prediction result, y is the real result, l is the law articles prediction, and i is the i-th sample. Equation (20) represents the loss function of one sample in the prediction of the law articles. When there are multiple samples, we add   all the losses together to form the total loss of the law articles. We have three subtasks, so the sum of losses of the three subtasks constitutes the final loss of the model. We train our model in an end-to-end fashion and utilize the dropout [24] to prevent overfitting.

Experiments
In this section, we verify the effectiveness of our proposed model. We first introduce the datasets and the data processing. en, we provide the necessary parameters of our model. Finally, we did some experiments to verify the advantage of our model and the importance of external information.

Dataset Construction.
Since there are no publicly available LJP datasets in previous works, we collect and construct an LJP dataset CJO. CJO consists of criminal cases published by the Chinese government from China Judgment Online 1 . e data used in this experiment is all from the judgment documents published by the Supreme People's Court of China. Before the formal data processing, we first clean the data. Our experiment aims at criminal offense, so other types of judgment documents except criminal offense are screened out. en, we filter out the multi-criminal judgment documents. e structure of the multi-criminal judgment documents is complicated, and we will research it in our future work. e terms of penalty for a single-criminal judgment document are up to 25 years, so we screen out the judgment documents with the terms of penalty more than 25 years (except death penalty and life imprisonment). Finally, we screened 5480000 judgment documents and obtained 750000 available data pieces. We used the selected 750000 pieces of data for experiments.
Our model's inputs include the token sequence T, the discrete data, and the continuous data. However, we find that our processing approach is not suitable for the terms of penalty of previous convictions. It cannot solve the problem of uneven distribution. erefore, we discretize the terms of penalty. e specific method is shown in Table 5.
For the majority data in the CJO dataset, their terms of penalty are no longer than 12 months. Meanwhile, the amount of data decreases as the terms of penalty increase. Especially for those with terms of penalty longer than 3 years, the amount of data has dropped significantly. In order to solve the problem of uneven distribution, we use small intervals where data is dense and large intervals where data is sparse, so as to ensure the stability of the amount of data in each interval.

Baselines.
To evaluate the performance of our proposed THME framework, we employ the following text classification models and judgment prediction methods as baselines: e main idea is that repeated iterations between subtasks can reduce the error accumulations, thereby improving the effectiveness of the tasks.

Experimental Settings.
We set the word embedding size k as 256. For the discrete data encoder, the dimension of the discrete data embedding w is 32. e dimension of the output vector of the discrete data encoder q is 64, the dimension of the output vector of the continuous data encoder p is 64, and the dimension of the case's conclusion's vector dc is 256.
We use the TensorFlow framework to build neural networks. In the training part, we set the learning rate of Adam optimizer as 0. 0001 and the dropout probability as 0.5. e padding length of the text N is 320 tokens, the length of each sentence n is 16 tokens, and each text is divided into 20 sentences. We set the batch size as 256 for all models. We train each model for 256 epochs, and if overfitting occurs, we will terminate the training early.

Results and Analysis.
All the models are repeated 3 times, and we evaluate the performance on three LJP subtasks, including law articles, charges, and terms of penalty and report the average values as the final results for clear illustration. Experimental results on the test set of CJO are shown in Table 6. It is shown that THME achieves the best performance on all metrics. us, the effectiveness and robustness of our proposed framework are verified. Compared with TOPJUDGE and MPBFN-WCA, THME takes advantage of the information of the fact determination and

Ablation Study.
To further illustrate the significance of modules in our framework. Compared to THME, we designed the following models: the role of continuous data and discrete data based on the fact description in multitasking, we design THM to compare the effect with THME. (iv) GRU-HAN-Multiextra (GHME): In order to prove the role of Transformer in the model, we design the GHME model and the THME to compare their effects.
As shown in Table 7, compared with THS, THM can improve the performance by 1.52%, 4.8%, and 4.53% for law article prediction, charge prediction, and terms of penalty prediction in our dataset, respectively. us, multitask model is beneficial to improve the performance of each task. THSE performs better than THS, especially in terms of penalty prediction. THSE has enhanced the performance by 2.51%. us, the structured data based on the fact description plays an important role, even if the single-task model is also significantly better than the multitask model without the addition of structured data. Hence, the structured data plays a more important role compared with the multitask structure.
rough comparing GHME and THS, we can see that THS performs better, which indicates that the performance of Transformer is better than the traditional GRU model in handling long documents and the effect of Transformer-HAN on LJP is greater than that of the multitask topological structure and the external information. is also proves that the proposed Transformer-HAN is a state-of-the-art model to deal with long-term dependency problems.

Information Source Study.
To further show the significance of the external information and explore the impacts of the information source, we evaluate the performance of THME under various information sources. We remove all the external information (fact), court view (-court view), defendant's information (-defendant's information), and case's conclusion (-case's conclusion), respectively. Results are summarized in Table 8.
It is shown that the performance of THME gets worse for all tasks after removing either origin of information. More specifically, when we remove all the external information, tremendous decrease is observed for the terms of penalty prediction. is demonstrates that the external information is beneficial for terms of penalty prediction. When we remove the defendant's information, the performance is better than when removing the court view.
is also demonstrates that the court view is more significant than the defendant's information and it plays a decisive role in LJP. e case's conclusion comes from the court view. When we remove the case's conclusion, the effect of THME is worse than the situation of removing the defendant's information, which is similar to the situation of removing the court view.
is demonstrates that the case's conclusion plays a very important role in LJP.

Error Analysis and Solution.
Prediction errors induced by our proposed model can be traced down into the following causes.

Data Imbalance.
Data imbalance is a natural phenomenon, because the number of cases with long terms of penalty is significantly less than those with short terms of penalty. Although we have adopted effective techniques to discretize the terms of penalty to reduce the impact of data imbalance, for the subtasks of law articles and charges, our model achieves more than 90% on accuracy, while only about 75% for macro-F1. is issue is much more severe on the subtask of the terms of penalty, for which our model yields a poor performance of only 40% macro-F1. e bad performance is mainly due to the imbalance of category labels; e.g., there are only a few training instances where the term is "life imprisonment or death penalty." Most judgment prediction approaches perform poorly (especially for recall) on these labels as listed in Figure 4.

Terms of Penalty Problem.
It can be seen from the results that although our model surpasses other models in terms of penalty prediction, the effects of terms of penalty prediction is still very poor. e accuracy rate is only 56.89%, and the macro-average index is even less than 50%. Such an index is far from meeting the actual needs. e actual cases are often multiple criminal cases, which are much more complicated than the cases we are analyzing, but complex cases often contain more information, which also provide us with ideas for solving the problem of terms of penalty prediction. In multiple criminal cases, we can split the case into multiple subcases and then comprehensively consider the categories of subcases, the number of subcases, and the severity of subcases to provide more information for terms of penalty prediction. e specific implementation method remains to be explored.

Conclusion
In this paper, we have studied the multi-extra and multi-task of LJP with topological dependencies between subtasks and address the problem of insufficient information and insufficient coding in LJP. Based on the topological structure between multiple tasks, we extract the information from the fact description via the Transformer-HAN encoder, extract the external information from the judgment document by the structured data encoder, and then integrate them into the classifier to reduce the misjudgment of penalty prediction. Experimental results show that our model achieves significant improvements over baselines for all judgment prediction tasks.
In the future, we will seek to explore the following directions: (1) It is interesting to explore the multitask legal prediction with multiple labels and multiple defendants. In recent years, the rise of knowledge graphs and graph neural networks (GNN) has made this possible [25][26][27][28]. (2) We will explore how to incorporate various factors into LJP, such as defendant's subjective viciousness, defendant's criminal means, and defendant's identity, which are not considered in this work. (3) When a judge decides a case, similar cases are crucial to the judgment result for this case. erefore, we can also recommend similar judgment documents to judges [29][30][31]. (4) With more and more research on the transfer learning, GPT, Bert, and other natural language models are also produced and continuously improve the ability to extract information from the text. e use of transfer learning in the process of dealing with the fact descriptions may improve the effectiveness of models [32][33][34].

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Kongfan Zhu and Rundong Guo contributed equally to the paper.