A Hybrid Neural Network BERT-Cap Based on Pre-Trained Language Model and Capsule Network for User Intent Classification

School of Computer Science, South China Normal University, Guangzhou 510000, China Guangzhou Key Laboratory of Big Data and Intelligent Education, Guangzhou 510000, China School of Science and Technology, *e Open University of Hong Kong, Kowloon, Hong Kong SAR 999077, China Institute for Advanced Study of Educational Development in Guangdong-Hong Kong-Macao Greater Bay Area, South China Normal University, Guangzhou 510000, China


Introduction
In question-answering systems and task-driven dialogue systems, the classification of user intent is an essential task to understand the target of user questions or discourses. Spoken dialogue system enables users to use natural language as a medium of communication and more conveniently obtain information [1]. However, it is difficult for a computer to understand human natural language in dialogue. To solve this problem, spoken language understanding has become a public topic of research in recent years. Spoken language understanding usually involves two sub-tasks, namely, user intent classification and semantic slot filling [2]. In question-answering systems and taskdriven dialogue systems, users express their purposes using short sentences. User intent classification is essential in the identification and analysis of users' intents from short sentences and predicts the intent labels of dialogue sentences to understand what the users truly want [3]. For example, in spoken dialogue systems, a sentence "I need a forecast for Jetmore Massachusetts in 1 hour and 1 second from now" expresses the propose of getting weather information and the pre-defined intent label of this sentence is "acquireweather." And a question "How do I turn online numbers by default in TextWrangler on the Mac?" corresponds to the predefined intent label "seek guidance" in question-answering systems.
In natural language processing, word encoding has evolved from one-shot to word2vec. e emergence of word2vec has greatly promoted the development of user intent classification. Recently, it is relatively common to analyze user intent based on neural network methods [4][5][6]. e short user sentences as input are mapped into a highdimensional semantic space through word2vec, which is the process of converting words into computable and structured vectors. In the semantic space, words with similar meanings demonstrate their similarity through a special distance [7]. However, one problem of word2vec encoding is that it cannot solve polysemy. e Bidirectional Encoder Representations from Transformers (BERT) [8] obtains a contextbased language representation model by pre-training on a large number of corpora; thus, leveraging the pre-trained BERT may optimize sentence representation compared to encoding the sentences based on word2vec.
With the development of deep learning algorithms in natural language processing, deep neural networks such as Convolutional Neural Networks (CNN) [9] and Recurrent Neural Networks (RNN) [10] are frequently applied to text classification tasks. With sentence coding, there are usually some networks to extract higher-level features. CNN treats sentences as spatial sequences and extract deep features through convolution kernels in different sizes [4]. RNN treats sentences as time series and forward sentence information through hidden state cycles [6]. Capsule networks [11] are also used to extract key information for text classification. Encoded sentences are used as the low-level capsule input, and the high-level capsule output is obtained through dynamic routing [12]. Deep neural network methods are frequently used to extract user short sentence features and classify users' hidden intentions. In order to promote the development of natural language understanding, previous works have constructed many publicly available datasets. However, certain existing datasets have the problem of uneven distribution of category samples. Focal loss is an improved loss function based on the softmax function to improve the accuracy of classification task for uneven distribution datasets. It is initially used in image detection tasks and has a positive effect on solving the imbalance of category distribution [13].
To further study user intent classification, a model BERT-Cap is proposed in this paper combining focal loss to solve the problem of uneven distribution of data. is model uses stacked transformer encoder to encode sentences and utilizes the pre-trained BERT as the initial parameters of the encoder. e weight parameters are continuously adjusted to obtain a context-dependent sentence representation during the training process. e capsule network is used in this model to extract key information of the sentences. e sentence representation obtained by the encoder is converted into vectors as the input of the low-level capsule.
rough the iterative process of a dynamic routing algorithm, the key features of the sentences are transferred to the high-level capsule as the output. e focal loss focuses on the samples that are difficult to classify. Four publicly available datasets are used to evaluate the performance of the model. e results on these datasets show the effectiveness of our model in user intent classification. e main contributions include the following: (1) a new hybrid model BERT-Cap based on pre-trained BERT language model is proposed. (2) Capsule network captures deep features of sentence representation obtained by the encoder and transfers iteratively important information from the lower-level capsule to the higher-level capsule through the dynamic routing mechanism. (3) Extensive experiments on four public available datasets demonstrate the effectiveness of our model. e rest of the paper is organized as follows: Section 2 introduces related work on user intent classification. Section 3 introduces the design process of the model in detail. Section 4 demonstrates the experimental details and experimental results, and Section 5 draws conclusions of this work.

Related Work
User intent classification is mainly used in question-answering systems and dialogue systems to identify users' potential purposes. Most of the current research focuses on short sentence text classification. Text classification as an important task of natural language processing has been studied by a large number of methods. In the early days, traditional machine learning methods used manually extracted features for text classification [14,15]. However, short sentences cover fewer semantic features and are difficult to extract manually [16]. Furthermore, manually extracting features is very expensive and requires a lot of resources.
e deep neural networks [17][18][19] have shown the ability to automatically extract text features and are widely used in various text classification tasks. e deep neural networks models include CNN [4] to extract n-gram features in sentence sequences for text classification and RNN [6] to extract sensitive patterns and rules in the sentence sequence, model non-Markovian dependence, and capture useful information of the sentence sequence for text classification, attention-based long short-term memory networks (LSTM) [20] to focus on the key words of the sentence sequence and reduce the effect of other irrelevant words, and others.
Recently, the pre-trained language model has become a popular method in natural language processing by finetuning parameters during the training of downstream tasks to have a better effect compared with deep neural networks models. Based on the pre-trained BERT, He et al. [21] proposed the method by combining CNN for intent determination. With the development of transfer learning, some work focuses on discovering new intentions never seen before. Xia et al. [22] studied zero-shot intent classification by capsule neural networks and used category similarity to classify new intents. A model [23] was proposed to classify new unknown intent by the algorithm of Local Outlier Factor. In addition, some researches focused on the classification of user intent with few-shot learning. Casanueva et al. [24] proposed the intent classifier in few-shot setups by pre-trained dual sentence encoders. Lin et al. [25] tried to improve the performance of user intention classification through supervised and unsupervised alternating training based on few-shot learning.

Complexity
Two English datasets, ATIS [26] and SNIPS [27], were widely used in user intent classification task which contained the pre-defined user intent categories and semantic slot values. e joint model was proposed based on the two English datasets to improve the performance of user intent classification and semantic slot slotting. e joint model [28] was proposed with recursive neural networks. Liu and Lane [29] proposed the joint model with attention-based recurrent neural network. And the joint model based on BERT [30] improved the performance of user intent classification. Compared with English, other languages rarely have datasets with semantic slot values and generally only contain intent category labels. Khalil et al. [31] explored the intention classification based on the multilingual transfer ability of English and French. Xie et al. [32] used the multiple semantic features to study Chinese user intention classification based on ECDT [33] dataset. Attention-based BiGRU-CNN [16] model was proposed for Chinese question classification based on the Fudan University Chinese question dataset.
However, the previous research is mostly based on distributed word embedding lacking contextual information for user intent classification tasks. Distributed word embedding expresses words as the same vectors by looking for pre-trained word embedding and cannot handle the problem of polysemous word in different contexts. e pretrained language model can be used as an encoder to obtain context-dependent sentence representations and promote the development of natural language processing. In order to explore the effectiveness of the pre-trained model in the classification of user intentions, we propose the hybrid model based on Chinese and English datasets and previous researches mostly focused on Chinese or English only. e model applies stacked transformer encoder to obtain context-dependent sentence encoding representation, and the publicly available pre-trained language model is used as the initial parameters of the encoder. Our model uses the dynamic routing mechanism of the capsule network to capture the deep features of the sentence. In practice, there are some low-data few-shot scenarios where only a handful of annotated examples of certain intent are available. We design experiments to explore the impact on the performance of our model when some categories have few samples in the datasets to simulate few-shot learning scenarios. For the datasets with uneven distribution of categories, we focus on samples that are difficult to classify and improve the accuracy of user intent classification with focal loss.

The BERT-Cap Model
A BERT-Cap hybrid model with focal loss based on pretrained BERT and capsule network is newly proposed for user intent classification. e BERT-Cap model consists of four modules: input embedding, sequence encoding, feature extraction, and intent classification. e architecture of our model is shown Figure 1. Given a sentence as input, the sentence is represented by the input embedding module to a sequence of embedding by retaining token information, position information, and segment information. e sequence encoding module loads the pre-trained language model obtained by transfer learning, using the encoder of transformer to perform sentence encoding. e sequence encoding module can obtain the context-dependent sentence representation by multi-head self-attention mechanism. In the feature extraction module, the capsule network extracts rich features of sentence representations from the sequence encoding module and the higher-level capsule outputs key information for subsequent module. e intent classification module maps the higher-level semantic capsule to the label space by a fully connected operation and uses the focal loss based on a softmax function to improve the performance of the model.

Input Embedding.
e input embedding of our model consists of three parts: token embedding, positional embedding, and segment embedding. Our model splits original sentence sequence by WordPiece [34] into token sequences. At the beginning of a token sequence, the special character [CLS] is used to store the semantic information of the entire input sequence. At the end of the sequence, the special character [SEP] is used to indicate the end of the sentence sequence. In the token sequence, the i-th token is denoted as t i ϵ R H and H is the dimension of hidden layer. In order to use the sequential information of the sequence, position embedding is added to encode position information. e positional embedding is denoted as P i ϵ R H . In the sequence, the segment embedding of i-th token is the same S i ϵ R H since the input of model is a single sentence. Token embedding, positional embedding, and segment embedding have the same dimension in the high-dimensional space, and the input embedding E i ϵ R H is the summation of the three embeddings.

Sequence Encoding.
e pre-trained language model transfers knowledge learned in large unlabeled corpora to downstream tasks through transfer learning and accelerates the development of natural language processing. e BERT model proposed by Google can obtain context-dependent sentence representation by two pre-training tasks, namely, masking language model and predicting the next sentence. Google has released two public pre-trained models, namely, BERT base and BERT large , based on abundant text corpus. In order to promote the development of Chinese natural language processing, Joint Laboratory of HIT and iFLYTEK (HFL) trained Chinese language models with whole word masking strategy based on a massive Chinese corpus and released BERT-WWM-Chinese [35] and RoBERTa-WWM-Chinese based on RoBERTa [36] which was an improved model of BERT. Our model employs BERT's multiple transformer encoder structure to obtain context-dependent sequence encoding and uses the public pre-trained model as the initial parameters of the encoder. e sequence embedding E � (E 1 , E 2 , . . . , E m−1 , E m ) retaining token information, position information, and segment information is denoted as the input of the lowestlevel transformer encoder. In the encoder, the sequence embedding first obtains three matrices of query matrices Q ϵ R m×H , key matrices K ϵ R m×H , and value matrices Complexity V ϵ R m×H through linear transformation. e linear transformation is as follows: (1) W Q , W K , and W V ϵ R H×H are three different parameter matrices. en, the query matrices Q, key matrices K, and value matrices V are the input of the scaled dot product attention function to obtain self-attention value. During calculating selfattention, the multi-head attention mechanism is used and Q, K, V are linearly mapped h groups to obtain different H/hdimensional vectors. e self-attention calculation is as follows: e multi-head mechanism executes the self-attention function on the h groups of H/h-dimensional Q i , K i , and V i in parallel. Each group produces an output result vector, and these result vectors are spliced together. e linear transformation is used to restore the vector of the H dimension.
e result vectors of the multi-head self-attention operation are added with the self-attention input X for layer normalization as the input of the feed-forward neural network which contains two linear mapping functions and a nonlinear ReLU activation function. e layer normalization operation and feed-forward operation are as equations (4) and (5), respectively.
Layer-normalization(X) � LayerNorm(X + multi − head(X)), en, the output F is added with the input of the feedforward neural network for layer normalization as the input of the next encoder. e number of transformer encoders is L in our model. e multiple transformer encoder structure obtains more sentence sequence syntax and semantic information in the process of sequence encoding.

Feature Extraction.
e sequence encoder output T � (CLS, T 1 , T 2 , . . . , T m−1 , T m ) contains sentence sequence syntax and semantic information is used as the input of the feature extractor. e feature extractor consists of capsule networks with dynamic routing mechanism. e main characteristic of the capsule structure is vector in and vector out, while ordinary neuron is vector in and scalar out. e vector output from the capsule expresses richer features than the scalar output from neuron. e input sequence encoding T is first converted as the   Complexity lower-level capsules U ϵ R m through linear transformation. e lower-level capsule u i consists of n vectors and each vector has k dimensions. e lower-level capsule u i is multiplied by the weight matrix c i and summed to obtain a higher-level capsule. e squash activation function compresses this higher-level capsule s and determines what information is retained in each input vector of the lower-level capsule. e calculation of the squash activation function is as follows: e result of the squash activation function is multiplied with the lower-level capsule input to update the weight matrix c r for the next routing process. e pseudocode of the dynamic routing Algorithm 1 is as follows: e output of the dynamic routing mechanism is a higher-level capsule, which retains the important features of the sentence sequence during the iteration process and uses the weight matrix to continuously adjust the acquired features. Finally, a vector output is used to represent the sentence sequence as the input of the intent classifier.

Intent Classification.
e input of the intent classifier is denoted as O ϵ R n×k containing important features of sentence sequence and we can calculate the intent representation I ϵ R N by dense operation, where N is the number of pre-defined intent category labels. e calculation of dense operation is as follows: W is the weight matrix and b is the bias. is dense operation maps the sentence sequence from the high-dimensional feature space to the low-dimensional category label space. en, we use the softmax nonlinear activation function to convert the category label distribution obtained by the dense operation into a probability distribution. e category corresponding to the maximum value in the probability distribution is selected as the predicted intent label. e calculation of intent label prediction is as follows: label � arg max(Softmax(I)).

Focal Loss.
Focal loss is the loss function solving the problem of the category imbalance. Focal loss is an improvement on the standard softmax cross-entropy loss. Focal loss responds to smaller losses for easy-to-classify samples and pays more attention to difficult-to-classify samples by responding to larger losses. e formula of focal loss is as follows: α is the weight coefficient corresponding to each category and c is the hyperparameter. e category with more samples has smaller weight coefficient α. e probability value p t is the output of the softmax function. In our model, focal loss is used to replace the cross-entropy loss function. When the samples are easily classified, our model will reduce its proportion in the overall loss. For the samples that are difficult to classify, the larger loss value is calculated. e model focuses on these samples in the subsequent training process and gradient update process. e ECDT dataset contains 31 intent categories and 3,736 samples from human-machine dialogue systems, while the FDQuestion dataset contains 9 intent categories and 15,408 samples from the music entertainment field of Baidu Q&A. Table 1 shows the statistics of these datasets.

Evaluation Metrics.
We choose four metrics including Precision (P), Recall (R), F1 score (F1), and Accuracy (Acc) that are widely used in classification tasks to evaluate the classification performance of our model. e higher the scores, the better the classification performance.
e calculations are as equations (10)- (13). TP represents the number of samples predicted correctly, FP represents the number of samples that are incorrectly predicted, FN is the number of samples that are incorrectly predicted of other categories, and TN is the number of samples that are correctly predicted of other categories.

Baseline Methods.
ese baseline methods (http://ir.hit. edu.cn/SMP2017-ECDT-RANK) from the best models in the shared task from SMP conference are compared with our method to evaluate the performance of our model based on the ECDT dataset. ese baselines include the following methods: (1) CNN + domain template: the two-level system with domain template and convolutional neural network to perform multi-domain classifications. (2) Lib-SVM + n-gram: this method designed a multifeature user intent classification system based on the Lib-SVM classifier while feature selection adopted n-gram.  [32]: the method used a traditional logistic regression with four feature expansions.

Parameter Settings.
In our experiments, we applied the pre-trained BERT base -uncased as the initial parameters of the sequence encoder for SNIPS dataset and the pre-trained BERT base -case as the initial parameters of the sequence encoder for StackOverFlow dataset. We used the pre-trained RoBERTa-Chinese-WWM as the initial parameters of the sequence encoder for two Chinese datasets, namely, ECDT dataset and FDQuestion dataset. e parameter settings of our model are shown in Table 2.

e Results.
To evaluate our model BERT-Cap for user intent classification, we firstly compared our model with base models with an ablation experiment. Table 3 shows the comparison of accuracy of these datasets using different methods for user intent classification on English datasets, while Table 4 shows the comparison result on Chinese datasets. e results show that the performance of the BERT-Cap method of using BERT as the sequence encoder and capsule networks as the feature extractor surpassed that using BERT only or using BERT as the encoder with CNN as the feature extractor on all the four datasets. It illustrated that deep features of the context-dependent sentence representation obtained by the sentence sequence encoding from BERT could be extracted by capsule network for user intent classification. We replaced the softmax cross-entropy loss with focal loss on the basis of BERT-Cap. e accuracy of the model after adding focal loss had been improved compared with using cross-entropy loss on 2 out of 4 datasets and focal loss responded to different losses for training samples, making the model focused on these difficult-to-classify samples in the subsequent training process and gradient update process.
In order to study the impact of focal loss on the performance of user intent classification, we designed another experiment by comparing it with some commonly used models including CNN, RNN, RCNN, RNN + Attention, and Transformer in classification tasks on the two English datasets. e results of our experiments are shown in Table 5.
e performance of the model with focal loss had been improved compared with the model without focal loss on SNIPS. However, the results on StackOverFlow have some reductions. ey demonstrated that focal loss performs better on unbalanced datasets overall.
ere were distribution differences of the categories on the datasets. To evaluate the classification accuracy of our proposed model on different categories, we calculated the recall value of each category obtained by the optimized model on the four datasets. Figure 2 shows the recall of each category achieved by the model of BERT-Cap with focal loss. In Figure 2(a), the recalls of two categories "SearchScreeningEvent" and "Search-CreativeWork" were 0.935 and 0.897, respectively, which were lower than other categories on the SNIPS. e reason might be that the two categories were relatively similar which made the model difficult to classify. For example, the sentence "where can I find paranormal activity 3 playing near me 1 hour from now" was difficult to classify with the sentence "where can I see the movie across the line: the exodus of Charlie wright." In Figure 2(c), the recall of the category "是非类" (Judgment) was 0.676, lower than others on the FDQuestion. We found that larger proportion of the samples with the category label "是非 类" (Judgment) were incorrectly classified as "评价类" (Evaluation). By looking at the original dataset, we found that the classification boundaries of the two categories were unclear and many samples were even difficult to distinguish by human annotators.
We designed the third experiment to analyze the relationship between the size of training data and the performance of the model and further analyzed the stability of our model based on the four datasets. We selected 0.5%, 1%, 5%, 10%, 25%, 50%, 75%, and 100% of the training data as the training subset, respectively. As shown in Figure 3, the accuracy of our model was significantly improved with the Input: all lower-level capsule U � [u 1 , u 2 , . . ., u m ] Output: the result of the squash activation function a (1) procedure Routing (u i , r) (2) for all lower-level capsules for r iterations do: (4) for all lower-level capsules u i : e higher-level capsule: s � i c i u i (6) e higher-level capsule: a � squash (s) (7) for all lower-level capsules u i : b i � au i (8) return a ALGORITHM 1: e dynamic routing algorithm.     increase of training data when the percentage was less than 25%. When the percentage exceeded 25%, the accuracy of the model changed relatively smoothly. Figure 3 illustrates that our model had a stable performance in the tasks of user intent classification. In real scenarios, there were some low-data problem and certain intent categories had only few annotated examples available. We designed the fourth experiment to simulate low-data scenarios and explored the performance of our model in the scenarios. We selected 50% of the categories as the low-data categories and 0.5%, 1%, 5%, 10%, 25%, 50%, 75%, and 100% of these categories original training data were selected adding to training dataset, and other categories training data remained unchanged. Figure 4 shows how the recall of each category changes on the SNIPS and the FDQuestion.

Complexity
From Figure 4, the recall of these selected categories improved obviously with the percentage changing on the two datasets. When the percentage was less than 5%, the recall of these selected categories had improved dramatically. When these selected categories were low-data, the model could classify these samples into the categories with abundant samples. erefore, the problem of intent categories had few-shot samples needed to be further studied in the future. We observed that the recall of these not-selected categories dropped as the samples of these selected categories increase on the FDQuestion in Figure 4(b). e main reason was that the classification boundaries between many categories on the FDQuestion were not obvious. With the addition of these categories samples, it was difficult for the model to accurately classify these samples. Finally, compared with FDQuestion, the recall values on these high-data categories were relatively stable on the SNIPS in Figure 4(a). e F1 scores of the seven different methods are shown in Table 6. Our proposed model achieved a F1 score of 0.967, a 2.2% improvement compared with the baseline methods. Our model could obtain context-dependency sentence representation by using the pre-trained language model and the capsule networks captured key features during the process of dynamic routing. e result proved the effectiveness of our proposed model for improving the performance of user intent classification and solving the problem of uneven distribution of categories.

Conclusions
is paper proposed a hybrid model BERT-Cap using a pretrained BERT to encode sentence sequence, applying capsule networks with dynamic routing mechanism to capture higher-level features, and combining a focal loss to improve the performance of user intent classification. Experimental results have demonstrated the performance improvement of our model compared with other baselines. In the future, we will try to introduce knowledge graphs to enhance sentence representation for improving the performance of user intent classification.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.