An Approach Based on Multilevel Convolution for Sentence-Level Element Extraction of Legal Text

In the judicial field, with the increase of legal text data, the extraction of legal text elements plays a more and more important role. In this paper, we propose a sentence-level model of legal text element extraction based on the structure of multilabel text classification. Our proposed model contains an encoder and an improved decoder. The encoder applies multilevel convolutional neural networks (CNN) and Long Short-Term Memory (LSTM) as feature extraction networks to extract local neighborhood and context information from legal text, and a decoder applies LSTM with multiattention and full connection layer with an improved initialization method to decode and generate label sequences. To our best knowledge, it is one of the first attempts to apply a multilabel classification algorithm for element extraction of legal text. In order to verify the effectiveness of our model, we conduct experiments not only on three real legal text datasets but also on a general multilabel text classification dataset.The experimental results demonstrate that our proposed model outperforms baseline models on legal text datasets, and our model is competitive to baseline models on the general text multilabel classification dataset, which indicates that our proposed model is useful for multilabel classification tasks of ordinary texts and legal texts with an uncertain number of characters in words and short lengths.


Introduction
With the development of the economy, there are more and more civil legal disputes, so that legal practitioners have to deal with more and more legal texts; however, the number of legal practitioners has not expanded with the increase in the number of documents. To alleviate the contradiction between the large number of cases and the small number of legal practitioners in China's judicial field in recent years and to improve the work efficiency of legal practitioners, it is necessary to use automated extraction technology to extract text sentence elements from legal texts to help legal practitioners understand important information in texts quickly. The development of natural language processing (NLP) and the availability of legal texts provide a foundation for the achievement of the above demand. At present, there are relatively few researches about the element extraction of legal texts. In this paper, the extraction of legal text elements is defined as the assignment of labels with specific legal attributes to each sentence in the legal text according to the semantic information it represents. For example, in divorce cases, labels can be "婚后有子女(children after marriage)," "有夫妻共同债务 (joint debt of husband and wife)," etc. In labor cases, labels can be "签订劳动合同 (signed labor contract)," "支付经济补偿金 (pay economic compensation)," etc. In loan cases, labels can be "免除保证人保证责任 (exempt the guarantor from guarantee responsibility)," "保 证人不承担保证责任 (guarantor does not assume responsibility for guarantee)," etc. Through the statistical analysis of the labels of legal text datasets, it can be found that there is collinearity among the labels of legal texts; that is, a sample may belong to 0 to N categories at the same time. Therefore, the element extraction of legal text can be regarded as the multilabel classification (MLC) problem of texts rather than a multiclass classification (MCC) problem.
The early method to solve the task of multilabel text classification is to transform it into a number of dichotomy problems [1] and determine whether the sample belongs to each class by setting a threshold value. This approach ignores the correlation between labels which has limited performance. After that, [2] proposes Classifier Chain (CC) method that applies several binary classifiers to construct a transfer chain for the multilabel text classification task to model the correlation between labels. The performance of this approach is affected by the random arrangement of label categories in the chain and the possibility that the previous classifier in the chain may propagate false predictions along the chain to the next classifier. Decision tree, SVM, and KNN methods are also applied to solve MLC tasks, respectively, in [3][4][5] which can only capture labels with first or second order correlation and are computationally intractable when high order correlations are required. The multilabel classification for legal text needs to recognize all labels rather than top-K labels, so [3][4][5] can not be applied to the legal text. With the development of deep neural network technology, many methods based on neural network have been proposed to solve MLC tasks. [6] proposes the fully connection neural network method with a pair-sorted loss function, but this method ignores the label correlations. [7] is the first to address the problem of translating MLC task into sequence-to-sequence (Seq2Seq) text label predictions for the given text which models the correlation between labels. From then on, methods based on the Seq2Seq structure have been widely proposed [8][9][10][11]. Research shows that local information is effective for text classification [12][13][14][15][16], but these Seq2Seq-based methods based on simple recurrent neural networks (RNN) or simple CNN have very limited ability in capturing local information of text which reduces the effectiveness of model. The difference between legal text and general text lies in the uncertainty of the number of characters in text words. The number of characters in a word of legal text is not fixed. Therefore, CNN based on a single convolution kernel can only extract the surrounding features of characters in the fixed window size, whose ability to extract features is limited. [17] introduces a model which applies multilayer dilated convolution to extract semanticunit information. However, the method based on multilayer dilated convolution loses some important semantic information due to the discontinuity of convolution and affects the acquisition of semantic information.
To address these problems, we propose a multilabel legal element extraction model based on Seq2Seq structure, which is composed of encoder and decoder. The encoder adopts the multilevel convolution neural network (MCNN) to alleviate the number of characters in each word in the legal text is not fixed problems, applying different window size of convolution kernels to extract more features, and applies LSTM to capture long-term context dependencies between texts. The decoder is composed of LSTM, multiattention module, and fully connection layer. The LSTM in decoder is applied to model the association between current state and previous labels. Next, at each time step t, the output of the decoder not only applies the feature information encoded by the encoder LSTM, label distribution information that is encoded by LSTM of decoder, but also applies the local semantic information encoded by MCNN through the attention mechanism. Finally, the decoder applies the fully connect layer whose parameter is initialized by an improved method according to the cooccurrence numbers of the labels according to the statistics of the training dataset to generate the output.
To investigate the performance of our proposed model, experiments are conducted on a generic dataset (RCV1V2) for multilabel text classification task and three legal text datasets. The experimental results demonstrate that our proposed model outperforms baseline models on legal text datasets, and our model is competitive to baseline models on the RCV1V2 dataset in evaluation metrics.
Our contributions in this paper are summarized as follows: (1) To the best of our knowledge, this is the first study that apply the multilabel classification algorithm for element extraction of legal text (2) We propose a Seq2Seq model containing multilevel convolution network that is applied to alleviate the number of characters in each word in the legal text is not fixed problems by applying different window size of convolution kernels to extract ong distance features, an improved decoder structure based multiattention and fully connection layer whose parameter is initialized by an improved method (3) We conduct experiments on three real-world datasets of Chinese legal text and a general multilabel text classification dataset. The results demonstrate that our model is competitive to baseline models on the Chinese legal text and the general multilabel text classification task The rest of the paper is organized as follows. In Section 2, we briefly review the related work about multilabel text classification. In Section 3, we introduce the Bi-directional Long Short-Term Memory (BiLSTM) and CNN network. In Section 4, we describe the architecture of our model. In Section 5, the experimental results and analyses are presented. In Section 6, concluding is presented.

Related Work
Existing multilabel text classification models can be divided into three categories: problem transformation method which is to transform problem data to use existing algorithms, algorithm adaptive method, which extends a specific algorithm so as to be able to process multilabel data, and deep learning method.
In some methods based on problem transformation, the correlation between labels is considered, while in others, it is not. The simplest nonassociative algorithm does not model the correlation between labels. Instead, the labels in multiple label texts are treated as an independent label, and a common classification algorithm is implemented for each label. The Binary Relevance (BR) [1] models a separate classifier for each label, resulting in correlation of labels being ignored. To model label dependencies, Label Powerset (LP) [18] builds a binary classifier for each label group validated in datasets. Classifier Chains (CC) [2] converts the MLC task 2 Wireless Communications and Mobile Computing into a binary classification problem chain, taking into account higher-order label dependencies. Instead of transforming the problem into different subsets of the problem, the algorithm adaptive method is applied for multilabel classification directly. ML-DT [3] applies a decision tree with multilabel entropy algorithm for multilabel classification. Rank-SVM [4] applies the support vector machine (SVM) model with similar learning system algorithm for multilabel classification. ML-KNN [5] applies k-nearest neighbor and the maximum posterior principle algorithm to determine the label sets of each sample. [19] ranks the label by using a pairwise comparison. [20] applies CBM to simplify multilabel classification tasks by converting them into standard binary and multiclass problems to perform classification.
In recent years, with the wide application of deep learning in natural language processing (NLP), the method of multilabel text classification based on deep learning has been proposed continuously. [6] proposes a BP-MLL model by applying a fully connected neural network and a pair of sorting loss functions to perform classification. A better training can be obtained by changing the ordering loss function to the cross entropy loss function in [21]. [12] proposes a model based on neural network initialization method, in which some neurons are applied as specialized neurons to model label correlations. [8] proposes a model that applies convolutional neural network and recurrent neural network simultaneously to capture local and global semantic information and construct correlations between labels. [7] proposes to generate labels sequentially. SGM [9] and SU4MLC [17] methods both use the Seq2Seq model structure: one applies an improved decoder with applied global embeddings, and the other contains additional semantic units obtained form dilated convolution with attention to enhance the presentation of information. [22] proposes a multilabel reasoning model based on iterative reasoning mechanism, which uses a binary classifier for each label and predicts all labels at the same time to achieve the disorder of labels.

Preliminaries
3.1. BiLSTM. LSTM network is a special form of RNN network, which can solve the long time dependence problem well by storing the past data in its memory cell and can alleviate the problem of gradient disappearance and explosion in RNN network. The LSTM network consists of three gate structures (input gate, output gate, and forget gate) and a memory unit. LSTM can add and delete letters to memory cells through gate structures. Each node of the LSTM network is calculated as follows: where x t represents the input embedding at time t; W i , W f , W c , and W o are weight parameter matrices that the network needs to learn; f t , o t , and c t represent forget gate output, output gate output, and cell output, respectively; σ represents the sigmoid activation function; h t represents hidden outputs at time t; b i , b c , b f , and b o represents the biased values of the respective gate structures.
For BiLSTM, the forward LSTM unit and backward LSTM unit are simultaneously calculated by equations (1), (2), (3), (4), (5), and (6), respectively. The output h t of forward and output h t ! of backward are then joined together to form the final output h t at time t. The calculation process is as follows: Finally, the overall output h of BiLSTM network is h = ðh 1 , h 2 , ⋯, h n Þ.

CNN.
The difference between CNN and ordinary neural networks lies in that contains a feature extractor composed of convolution layer and pooling layer. In the convolutional layer of convolution neural network, one neuron only connects with some adjacent neurons. CNN usually consists of convolutional layer and pooling layer, which is used to capture local features of text classification.
The core of CNN is the convolutional layer that contains a set of convolution kernels. The convolution computation is performed by using convolution kernel and local windows of input embeddings. The calculation process is as follows: where W represents the convolution kernel parameter and W ∈ R i * j , where i is the height of convolution kernel W and j is the width of convolution kernel W. b represents bias parameter value, δ is a nonlinear function, e i+l−1 represents the input embeddings from i layer to i + l − 1 layer, and c i is output of convolution calculation. The convolution window moves with the specified step size to capture the local neighbor feature.
The function of the pooling layer is to sample and compress the convolution results to prevent overfitting. The pooling method is divided into maximum pooling and average pooling. Maximum pooling means to maximize the feature points in the neighborhood, which can retain texture information well. Average pooling means only averaging the feature points in the neighborhood, which can preserve the background features well.
3 Wireless Communications and Mobile Computing

Method
In this section, we introduce our proposed methods in detail. We firstly give an overall structure of our proposed method in Section 3.1. Then, we introduce the structure of encoder in Section 3.2. The decoder with multiattention and fully connection layer with special initialization will be introduced in Section 3.3.

4.1.
Overall. Firstly, we define some symbols to make the presentation clear that describes the multilabel classification (MLC) task. Given a input sequence X = ðx 1 , x 2 , ⋯, x n ), the multilabel classification task is aimed at predicting the label sets Y ∈ L corresponding to X, where L = ðl 1 , l 2 , ⋯, l m ) is the label space and the number of labels in Y belongs to 1 to m in samples, n is the length of sequence X, x i is the i -th word in the sequence, and m is the total number of labels. MLC task can be defined to find the maximum conditional probability pðY | XÞ of labels.
The model structure we proposed is shown in Figure 1, which includes the encoder and the decoder. In the encoder, multilevel CNN consists of multiple CNNs from lower to higher layers is applied to capture the local representation of text, and CNN networks with higher layers can capture more long-distance information. The LSTM is applied to extract the context representation of the texts. For the decoder to output the label results at each time step, it should not only refer to the previous label state but also process the local representation from the output of the multilevel CNN and the context representation that is from the encoder LSTM in a mixed attention to generate the current label state. Finally, a fully connected layer followed by the softmax layer converts the output of the decoder into the final probability distribution for output.

Encoder.
In this section, we introduce the encoder module in detail, which is applied to capture local semantic information by CNN and context representation by LSTM.
CNN has been widely applied in text classification tasks [17,[23][24][25], because of its strong ability for local semantic extraction. LSTM has also been widely applied in various sequence-to-sequence tasks recently [26] due to greatly capturing context features between words with gate units. We apply multilevel CNN to capture the local semantic feature and LSTM to capture the context feature among words in legal texts in encoder. We run the above two networks in parallel to extract semantic and contextual information of the words.
In word embedding layer, we first convert x i to embedding vector e i by an embedding lookup table E ∈ R k * jνj , a random initialization embedding matrix that can be continuously optimized during training, where jνj is the size of the vocabulary and k is the dimension of the embedding vector.

Multilevel CNN.
In CNN, convolution kernel and maxpooling layer are usually applied to extract the most important local semantic features. However, due to the fact that the location feature information between words is ignored during the pooling of the maxpooling layer, part of semantic information of extracted features is missing [25]. Through the above analysis, we apply the multilayer CNN networks without pooling layer to generate local semantic representation units.
In the CNN feature extraction network, a convolution filter W ∈ R p * d is applied to extract word features in the sentence (x 1 , x 2 , ⋯, x n ) by window size of p at each layer by moving from left to right to extract local features of words, where d is the dimension of the input embeddings and p is the size in the convolution kernel.
where b ∈ R indicates a bias term and f refers to a nonlinear activation function. Finally, we obtained a feature map of words in sentence.
The MCNN applies multiple above filters by varying window sizes with different convolution kernel and stack the multiple layers to obtain a wide range of local information.
The output of C i−1 is the input to C i , where C i is for the i-th convolution network. Specifically, the calculation is as follows: where f refers to a nonlinear activation function, g l ½i : i + p l is the i-to-(i + p l )-th column in the l-h convolution network, and p l represents the size in the l-th convolution kernel. We regard the word embedding layer as the first layer g 1 . Similarly, the feature in the l-th layer g l represents the output of l-th convolution network to capture local information. The structure of nultilevel CNN is shown as Figure 2. The final result of MCNN is calculated as follows: where L is the number of kernels.

LSTM.
We apply the BiLSTM [27,28] where x i represents the input embedding of step i, h i−1 represents hidden state of step i − 1, and s i−1 represents cell state of step i − 1.

Decoder.
In this subsection, we introduce the decoder module in detail. In order to improve the multilabel classification results, our model applies three steps to generate the label sequences in the decoding process. First, an LSTM is 4 Wireless Communications and Mobile Computing applied to generate the corresponding label state sequence by the relationship between the generated labels. Second, to pay attention to the local semantic information generated from multi-CNN and the context information generated from the LSTM in the encoder, we take advantage of multiattention to capture feature information inside the sentences and form the final textual representation. Finally, we apply a fully connection layer with improved initialization method as the final output layer.

LSTM.
At time-step t, the hidden state s d t in LSTM of decoder is computed as follows: where W c and W s are weight parameters, h i is the hidden state in LSTM of encoder at step i, y t is the output of the method at timestep t, m is the number of words, and concat represents the concatenation operation of the vectors.

4.3.2.
Multiattention. Our proposed model learns the semantic features of the text according to the multiattention method in [17]. The structure of multiattention is shown as Figure 3. For the output o t of the decoder, it not only considers the context features from the LSTM encoding in the encoder but also considers the local semantic features from the multilevel CNN. In our model, the decoder first applies the attention mechanism to pay attention to the local information from the multilevel CNN and decoded sequence information of the labels, calculates the semantic features that can represent the sentence, and generates the new representation. Next, the attention mechanism is applied to pay attention to the newly generated representation, the decoded information of labels, and the text context information captured by LSTM in encoder. Firstly, s d t that the LSTM in decoder output and the semantic representations where W e , W c , W t , U t , and U t ′ are weight parameters, O t is an m-dimensional vector, and m is the number of labels. Each dimension value represents the probability that the sentence belongs to the corresponding class. The higher the score, the more likely it is to belong to the class.

Fully Connection Layer with Frequency Initialization (FCFI).
We propose an initialization method by normalization to better model the cooccurrence between labels. Score output of label o t ′ through FCFI layer is calculated as follows: where O t is output of LSTM in decoder and W F ∈ R L * L is weight parameter that is initialized with a symmetry matrix, where L is number of lables. The element in i-th column and j-th row in W F represents cooccurrence between label i and label j. The initialization value of matrix W F is calculated as follows: where α+β=1, count i,j is the number of cooccurrence between labels i and j in the training dataset, and A i is the number of samples that contain label i. The initialization value on the diagonal is set to 1. The higher value of W Fði,jÞ is, the more likely the label i and label j are to appear together. [9] is applied to predict the label y t at timestep t

Softmax Layer. Softmax layer
where κ t ∈ R L is a mask vector that is applied to prevent current label from duplicating the previous label and is calculated as follows [25]:

Experiments
In the following, we introduce the datasets containing one general multilabel text classification dataset to verify the effectiveness of our model by comparing with models that work well on general datasets and three legal text datasets of civil cases, preprocessing of legal datasets, experimental parameter setup, and related baselines we compare with.

Datasets
(i) RCV1-v2 (http://www.ai.mit.edu/projects/jmlr/ papers/volume5/lewis04a/lyrl2004_rcv1v2_ README.html). RCV1-v2 [29] is provided for research purposes and consists of more than 800,000 manually categorized news made available by Reuters Itd. Multiple ones can belong to multiple topic types, and the number of topics reached 103 (ii) Legal Text Datasets. The legal text dataset is provided by CAIL2019 (https://github.com/china-ai-  We preprocess the legal datasets to remove the sentences without any labels and sorted the labels according to the number of times they appear in the dataset from the highest to the lowest. Each subset of legal text dataset has 20 label categories. After statistical analysis of the preprocessed data, there are often correlations between labels. In this paper, the legal text dataset is divided into the training set, development set, and test set, and the model performance is evaluated on the test sets. The three datasets of legal text are evaluated, respectively. Statistics of the preprocessed dataset information are shown in Table 1.

Baselines.
To verify the effectiveness of our proposed model, the results are compared with the following models.
(i) Binary Relevance (BR) [1] transforms the MLC task into multiple single-label classification problems by ignoring the correlations among labels (ii) Classifier Chain (CC) [2] converts the MLC task into chains of binary classification in which first classifier is trained only on the input dataset, and then, each classifier is trained on all previous classifiers in the input space and chain by taking highorder label correlations into consideration (iii) Label Powerset (LP) [18] transforms the multilabel classification problem into a multiclass problem where the classifier is trained on all unique label combinations found in the training dataset (iv) CNN [12] applies multikernels to extract features of text, which are then feeded into fully connection layer with sigmoid function to capture the label probability distribution of label (v) CNN-RNN [8] applies CNN and RNN to capture local semantic information and global semantic information and models the correlation among labels (vi) SGM [9] proposes a sequence attention-based generation model with a new decoder structure to solve the problem of multilabel classification and models the correlation among labels (vii) SU4MLC [17] generates text local representations with multidilated convolution and attention mechanisms to generate the maximum probability sequence of labels (viii) ML-Reasoner [22] proposes a new iterative method to focus on label information and applies a binary classifier to predict all labels at the same time

Experiment Setup. We conduct our experiments with
PyTorch on the Nvidia Titan-V GPU. The batch size is set to 16 on RCV1V2 dataset and 32 on other legal text datasets, and the size of both word embedding is set to 512 on all datasets with random initialization. The hidden sizes of LSTM in the encoder and decoder are 512, and the number of LSTM layers in the encoder and LSTM layer in the decoder is 2 and 2. The kernel sizes of convolution in encoder are [1,3,5] and [2,3,5] on RCV1V2 dataset and legal text datasets. (α = 0:95, β = 0:05) and (α = 0:75, β = 0:25) on RCV1V2 dataset and legal text datasets. The number of convolution filters is 512. To avoid overfitting, we employ the dropout [30]. The initial learning rate is 0.0003, and it is drop as the epoch changes.

Evaluation Metrics.
Following the previous work [8,9], we measure hamming loss [31] and micro-F1 score [32] as our main evaluation metrics. For reference, microprecision and microrecall are also reported.
(i) HammingLoss. HammingLoss evaluates the fraction of misclassified instance-label pairs, where a relevant label is missed or an irrelevant is predicted, which is calculated as follows: where N is the number of samples, L is the number of labels, Y i,j is the true value corresponding to the j-th label of the i-th prediction, P i,j is the predicted value corresponding to the j-th label of the i-th prediction, and XORð0, 1Þ = XORð1, 0Þ = 1.
(ii) Micro-F1. Micro-F1 can be interpreted as a weighted average of the precision and recall. It is calculated globally by counting the total true positives tp j , false negatives fn j , and false positives fp j , which is calculated as follows:

Experiment Results
(i) RCV1-v2 Dataset. The experimental results on RCV1-v2 dataset are shown in Table 2. Compared to with the baselines, our model ranks third in the hamming loss with 0.00817 and second in the micro-F1 with 87.55% among the methods in Table 2. Compared with the best model (LP) in baseline based on traditional machine learning, our approach achieves an improvement of 2.04% micro-7 Wireless Communications and Mobile Computing F1 score and a reduction of 5.0% hamming loss on RCV1-v2 test dataset. Compared with the best baseline model based on deep learning, our approach is competitive on RCV1-v2 test dataset.
(ii) Divorce Dataset. The experimental results of our model compared with the baselines on divorce dataset are shown in Table 3. Compared with the baselines, our model ranks first in the hamming loss with 0.02126 and micro-F1 with 87.90% among the methods in Table 3 Compared to the best baseline model, our approach achieves a reduction of 3.26% hamming loss and an improvement of 0.65% micro-F1 score on divorce test dataset.
(iii) Loan Dataset. The experimental results on loan dataset are shown in Table 4. Compared with the baselines, our model ranks first in micro-F1 with 86.03% and ranks second in the hamming loss with 0.02772 among the methods Compared with the best baseline model, our approach achieves an improvement of 1.06% micro-F1 score on loan test dataset.
(iv) Labor Dataset. The experimental results on loan dataset are shown in Table 5. Compared with the baselines, our model ranks first in the hamming loss with 0.01689 and micro-F1 with 86.99% among the methods in Table 5 Compared with the best model in baseline model, our approach achieves a reduction of 1.56% hamming loss and an improvement of 0.05% micro-F1 score on labor test dataset. Since our proposed model can obtain more important local semantic information using multilevel CNN network than single convolution and dilated convolution and applies the multiattention method to integrate local feature information obtained by multi-CNN, context information captured by LSTM, and label sequence information in the decoding process, our model outperforms the baseline models on both general text datasets and legal text datasets.

Ablation Test.
To evaluate the effects of the modules in our proposed model, we perform ablation tests on our model. We analyze the results of CNN layer number and initialized fully connection layer with different initialization method on three test sets. 5.7.1. The Impact of CNN Layer Number. In order to verify the influence of the number of convolution layers in our model on the effect of the model, we conduct comparative tests on legal text datasets, respectively. During the coding period, we extract the local feature information around words by using multilevel convolution without maxpooling layer and single convolution layer without maxpooling layer, respectively, as the local feature representation of words. In the single convolutional network, we use convolution kernels of sizes 2, 3, and 5 to capture local semantic representation, respectively, while we use the convolutional kernel (kernel size = 2,3,5) in the multilevel convolution network to extract long distance local semantic representation of words. The experimental comparison results are shown in Table 6.
The results in Table 6 show that, in terms of legal text datasets, the results of our model are still better than those of the single CNN model, which indicates that multikernels with different sizes can capture abundant n-gram information with rich semantics and generate better local semantic features of words than single kernel for legal texts where the number of characters that make up a word is uncertain and the text length is relatively short.

The Impact of Fully Connection Layer with Frequency
Initialization (FCFI). In the decoding process, we employ a fully connection layer with initialization to pay attention to the correlation between any two labels. To evaluate the effectiveness of our initialization approach, we construct the model without fully connection layer and model initialized using [25] separately. We apply the aforementioned two   Table 7. The symbol IFC, w/o, no-order represents the model with initialization using [25], without the initialized fully connection layer and our model, respectively. From Table 7, it can be seen that compared with the model without initialized layer, the model with our initialized method on or reduces the hamming loss by 9.53%, 8.27%, and 43.88% and improves the micro-F1 by 1.71%, 2.70%, and 3.41%, on the divorce, loan, and labor datasets with label order, respectively, and compared with the model with initialization using [25], the model with our initialized method reduces the hamming loss by 4.1%, 2.43%, and 13.4% and improves the micro-F1 by 0.81%, 1.44%, and 0.90% on the divorce, loan, and labor datasets with label order, respectively. The experimental results of our proposed model also exceed that of IFC and w/o models, indicating that the intro-duction of normalization in the initialization can improve the performance of the model. It can be seen from Table 7. The performances of the methods proposed by [25] and us all exceed the performance of the method without initialization of parameters of the full connection layer, indicating that initialization of parameters of the full connection layer is conducive to improving the classification effect of the model. According to our analysis, the reason why the result of our method exceeds that of the method proposed by [25] is that when the method in [25] initializes parameters, the calculation of Fði, jÞ only considers the calculation of the inner part of the i-th row and ignores the effect of the j-th row. Our method considers the correlation between the i-th row and j-th row according to Formula (17).

The Impact of the Input Embedding of Each Levels in MCNN.
When CNN is used to extract local neighbor information of text, multilayer convolution structure is used in this paper. In the MCNN proposed by us, the input of each layer takes the output result of the bottom CNN as the input of the top CNN, among which the CNN input of the bottom layer is the embedding vector of the text. In order to verify the effectiveness of our approach, we constructed a multilayer CNN network with multiple different convolution kernels. In this structure, the input of CNN at each layer is the embedding vector of text. We also conducted experiments on three legal datasets. The experimental results of two models are shown in Table 8. The symbol "TSI" represents the model, and the input of CNN at each layer is the same, which is the text embedding vector. From Table 8, it can be seen that compared with the TSI model, the model with our method reduces the hamming loss by 2.97%, 1.91%, and 1.80% and improves the micro-F1 by 1.54%, 1.14%, and 1.07% on the divorce, loan, and labor datasets.
The results show that the proposed method of using the output of lower-layer CNN as the input of upper-layer CNN can improve the effect of the model.  Table 9. The w/o represents the model with the top-level CNN output as the final output of MCNN model. From Table 9, it can be seen that compared with the w/o model, the model with our method reduces the hamming loss by 5.8%, 4.7%, and 10% and improves the micro-F1 by 1.82%, 2.03%, and 2.98% on the divorce, loan, and labor datasets.
Experimental results show that the proposed concatenation of outputs of different CNN layers as the final output of MCNN results can improve the effect of the model.

Conclusion
In this paper, we propose a model based on multilevel convolutional network for sentence-level element extraction of legal text. Our proposed model can combine textual local semantic information obtained by the multilevel CNN and context information obtained by LSTM to generate higher level semantic representation of sentences by applying multiattention network. The initialization method we proposed can improve the classification effect of the model by normalizing the cooccurrence relationship between labels. Experimental results on a general text dataset and three legal domain datasets demonstrate that our model achieves the expected results in the evaluation metrics by comparing with the baseline model. By comparing the results with single-layer CNN on the legal text dataset, our proposed multilevel CNN is more capable of extracting the semantic features of the legal text by applying different window sizes of convolution kernels to alleviate the number of characters in each word in the legal text is not fixed problems. In addition, by comparing the initialization method with other initialization methods, our proposed initialization method can make a contribution to improving multilabel classification task of the legal text.

Data Availability
The processed legal text data used to support the findings of this study are currently under embargo, while the research findings are being commercialized. Requests for data 6-12 months after the publication of this article will be considered by the corresponding author.