Emotional Dialogue Generation Based on Conditional Variational Autoencoder and Dual Emotion Framework

,


Introduction
With the development of privacy protection and incentive technology in the Internet of Things and mobile social networks driven by artificial intelligence, intelligent dialogue systems have entered our daily lives [1][2][3][4]. The enormous demands of privacy protection for dialogue systems have promoted the accuracy of speech recognition and semantic understanding, greatly improving the experience of humanmachine dialogue. At the same time, people have put forward increasing requirements for intelligent dialogue systems to produce more human-like dialogues. As an important part of human intelligence, emotional intelligence is defined as the ability to perceive, integrate, understand, and regulate emotions [5]. Thus, machines will be able to communicate at the human level only when they have the ability to perceive and express emotions.
Currently, deep neural networks have been successfully applied in various applications [6][7][8][9]. In dialogue generation tasks, the sequence to sequence (Seq2Seq) model [10] is a commonly used model. It is mainly based on the language ability learned from a large number of corpora to conduct dialogue and on the powerful calculation ability and abstraction ability to automatically summarize and extract valuable knowledge and features from massive data. In an opendomain dialogue system, there are multiple reasonable replies to a given query from a user. This phenomenon is called "oneto-many" diversity. However, for the dialogue system based on the Seq2Seq model and the maximum likelihood estimation (MLE) objective, the characteristics of the model determine the general utterance with a greater probability of its tendency to respond, such as "I don't know" and "Yes." To generate more informative and meaningful responses, much work has been carried out in the open-domain dialogue [11][12][13]. These methods focus on the consistency of the conversation content rather then on emotion. Based on the past progress of dialogue systems, Zhou et al. [14] first integrated emotional factors into large-scale dialogue generation using embedding of emotional tags, internal memory networks, and external memory networks. Subsequently, Asghar et al. [15] used emotion word embedding and emotion-based objective functions to improve performance. Zhou et al. [16] proposed to use the emoticon-rich Twitter corpus as a data set for emotional dialogue generation. However, the above work only considers the characteristics of target emotions and not the emotion of the input sentence, with the hope that the machine generates corresponding emotional responses; this will lead to the phenomenon of emotional drift, that is, the emotional response is incoherent and inconsistent with the emotion of the input sentence.
The generation of emotional dialogue needs to consider two main factors: one is the content of the generated response, and the other is the emotion of the generated response. In addition to avoiding the generation of a large number of general replies and increase the diversity of replies, it is necessary to consider the connection between the output emotion and the emotion of the user's input sentence, as well as the controllability of the output emotion. For example, if the user is sad, we can generate comforting words to make the user feel better.
The contributions of our work are summarized as follows: (1) We propose a dual-emotional framework for emotional dialogue generation, which comprehensively considers the impact of the emotion of the input sentence and the target emotion on emotional response in order to make our emotional response consistent with the user's emotion and ensure that the emotional response is controllable (2) We combine the conditional variational autoencoder [17] with the dual emotion framework to train an emotional generation system, and experiments prove that our model has strong performance (3) A multiclass emotion classifier based on the BERT [18] model is employed to obtain emotion labels, which improves the accuracy of emotion recognition and emotion expression.
The rest of the paper is organized as follows. In "Related Work," we outline the related work on emotional conversational agents. Then, we describe the proposed model in "Proposed Model." "Experiment" provides the experimental results. Finally, we summarize this article and propose directions for the future work in "Conclusion."

Related Work
With the popularity of social media, massive quantities of dialogue data can be accumulated and saved, allowing researchers to solve the problems of dialogue systems in a purely data-driven manner. Vinyals et al. [19] applied the Seq2Seq model in machine translation for dialogue generation for the first time, using an encoder to encode input sentences and generating a reply through a decoder. Bahdanau et al. [20] proposed an attention mechanism and applied it to the field of machine translation to improve the accuracy of machine translation. Shang et al. [21] first built a corpus based on Sina Weibo and used a Seq2Seq model that introduced an attention mechanism to implement a singleround dialogue generation system.
Depending on the dialogue object and the dialogue scene, some work introduces latent variables, samples the distribution of latent variables, and then decodes the distribution to generate responses based on latent variables. Cao et al. [22] proposed a single-round dialogue generation model based on latent variables, including random variables z of the variational autoencoder in the decoder. Serban et al. [23] introduced the method of latent variables into a hierarchical dialogue model. The latent variables can be either topics or emotions. Zhao et al. [13] constructed a dialogue model based on a conditional variational autoencoder model using multiple semantic intentions as conditions.
Emotion perception is an indispensable part of a successful and intelligent dialogue system. Zhou et al. [14] proposed the emotional chat machine (ECM), which first focuses on how to generate a response with a specific emotion. ECM uses emotion embedding, internal memory network, and external memory network, but it considers neither the influence of the input sentence content on the decoder output nor the influence of the input sentence emotion on the emotional response. We believe that to learn higher-level dialogue skills and logic from a real corpus, a more elaborate mechanism is needed to capture the relationship between the utterance and emotional response. Therefore, we focus on the extraction and expression of the content and emotions of input sentences and produce more human-like emotional responses.
In terms of emotional dialogue research, [16] is most similar to our work, but they mainly focus on emotions in the Twitter corpus to train emotional chat robots, and their work did not further consider the emotional characteristics of the input sentences. Sun et al. [24] proposed a model that takes a sequence containing an emotion category of the input sentence and an emotion category of the output response as input. Xu et al. [25] proposed a dual-attention mechanism that pays attention to the content and emotion of input statements. Song et al. [26] proposed an emotion dialogue system that can express the desired emotion explicit or implicitly. Li et al. [27] used generative adversarial networks to generate emotional responses. Su et al. [28] proposed a stylistic dialogue generation system, which is achieved by adopting an information-guided reinforcement learning strategy.

Proposed Model
3.1. Task Definition and Model Overview. Our task is defined as follows: given a post x = ðx 1 , x 2 ,⋯,x m Þ, input emotion label E x , and target emotion label E y , the goal is to generate a response y = ðy 1 , y 2 ,⋯,y n Þ. The input emotion label E x is obtained through the multiemotion classifier, x i is the token of the input sentence, and y i is the token of the output sentence. The response not only is consistent with the post in terms of both content and emotion but also corresponds to the target emotion.

Wireless Communications and Mobile Computing
An overview of CVAE-DE is given in Figure 1. E y is the emotion label of the response, E x is the emotion label of the post, vector v y represents the text features of the response, vector v x represents the text features and emotion features of the post, vector e y represents the emotion features of the response, and vector c is obtained by concatenation of v x and e y . In the training process, E x and E y are obtained from the BERT emotion classifier, post x and E x are encoded by the post encoder to obtain v x , E y obtains vector e y through a full connection network, and v x is concatenated with e y to obtain vector c. Then, c and v y are fed to the prior/recognition network, and the hidden variable z, which is sampled from the recognition network, is fed to the decoder. In the inference process, the response does not exist, E y is directly given by the user, z is sampled from the prior probability distribution pðz | cÞ, and we use an attention mechanism between the encoder and the decoder. Finally, the decoder will generate an emotional response that matches the post in content, is coherent with the post emotion, and corresponds to the target emotion based on attention memory, as well as c and z.

Multiemotion Classifier Based on the BERT Model.
Most existing models use word2vec or Glove to obtain pretrained word vectors. However, the word vectors trained by these models are a type of static encoding. The same word is the same expression in different contexts, and it does not solve the problem of polysemy, in which words have different meanings in different contexts. In response to this problem, this paper trains a multiemotion classifier based on the BERT model [18]. BERT is a new language representation model that can not only obtain the rich grammatical and semantic features of the corpus text but also solve the problem of traditional language feature representation ignoring word polysemy, ultimately improving the accuracy of emotion classification. The structure of the BERT model is shown in Figure 2.
The most important part of the BERT model is the bidirectional Transformer encoder [29] encoding structure, which uses the encoder structure in the Transformer model as the feature extractor. The encoder is composed of a selfattention mechanism and a feed-forward neural network, abandoning the RNN's cyclic network structure [30], and completely uses an attention-based mechanism to model a segment of text. The attention mechanism in the encoder is called self-attention, and its core idea is to calculate the relationship between each word in a sentence and other words to adjust the importance of each word in order to obtain a context-related word vector. The encoder structure is shown in Figure 3.
In the experiments in this article, the pretrained Chinese model "BERT-Base, Chinese" released by Google is used to train our classifier; it uses a 12-layer Transformer with a hidden size of 768, a multihead attention parameter of 12, and a total model size of 110 MB. First, we load the pretrained model, and then, we use the emotion classification data set to fine-tune our model. Finally, the final model will be employed in the CVAE-DE model as our multiemotion classifier.

Sequence to Sequence Model Based on the Attention
Mechanism. The basis of our model is a Seq2Seq model based on the attention mechanism [21]. The encoder and decoder of the model are implemented by GRU [31]. The role of the encoder is to map the post x = ðx 1 , x 2 ,⋯,x m Þ to the hidden feature state h = ðh 1 , h 2 ,⋯,h m Þ. For moment t, h t is defined as follows: where the initial hidden state h 0 is zero vector, r t represents the reset gate, z t represents the update gate, δ is the sigmoid activation function, and W r ,  3 Wireless Communications and Mobile Computing hidden state but also selectively remember the candidate hidden state and retain the long short-term information that is strongly dependent on the current moment. The above equations can be written as The current state of the decoder can be updated according to the state s t−1 at the previous time, the output y t−1 of the decoder at the previous time, and the context vector vc t at the current time. The probability distribution of the words output by the decoder is where g is the maxout activation function, the context vector vc t is the result of using the attention mechanism to weight the encoder state sequence h, and typically, we use Bahdanau attention [20], which is defined as: where v a , W a , and U a are the attention parameters that need to be learned. The attention mechanism is in fact a weighted sum of the hidden states of the encoder, which can dynamically capture the dependence of the decoder on the input utterance. The objective function of the Seq2Seq model based on the attention mechanism can be expressed as The VAE introduces a recognition model q ϕ ðz | yÞ in the inference network to replace the undetermined true posterior distribution p θ ðz | yÞ. To make q ϕ ðz | yÞ approximately equal to p θ ðz | yÞ, the VAE uses the KL divergence to measure the similarity between the two distributions and minimizes the KL divergence. In this case, the objective function of the model can be expressed as where ϕ is the parameter of the inferred network, θ is the parameter of the generated network, KLðq ϕ ðz | yÞ∥p θ ðzÞÞ indicates the KL divergence between the prior distribution p θ ðzÞ of z and the posterior distribution q ϕ ðz | yÞ of the model encoder, and E q ϕ ðz|yÞ ½log p θ ðy | zÞ represents the reconstruction loss of the data samples by the decoder p θ ðy | zÞ. The model's decoder learning goal is to restore the real data as much as possible, and the goal of the variational autoencoder becomes to maximize its objective function, which can be achieved by minimizing the first term on the right side of Equation (12), that is, making q ϕ ðz | yÞ of the hidden variable z approximate p θ ðzÞ. The traditional VAE belongs to an unsupervised model. Although it can generate similar output data based on the input, it cannot control its orientation to generate specific types of data. For this purpose, Makhzani et al. [17] proposed a conditional variational autoencoder (CVAE) model. Based on the Seq2Seq model, we introduce the latent variable z in the CVAE model. For a given input utterance, multiple appropriate responses may exist, and each response corresponds to a potential variable configuration that does not appear in the input utterance. CVAE is trained by maximizing the conditional likelihood variational lower bound of y for a given c situation.
In our model, the decoder is used to approximate p D ðy | z, cÞ, the prior network is used to approximate p P ðz | cÞ, and the recognition network is used to approximate the real posterior p R ðz | y, cÞ. θ D , θ P , and θ R are the parameters of their networks. The objective function is given by In addition, as described by Bowman et al. [33], it is difficult to encode useful information in hidden variables by directly combining the RNN decoder and the variational autoencoder in the field of text generation. Because the RNN-based decoder is a general function approximator, which has a strong ability to model sequence information, it can learn the representation without hidden variables information in the decoding process. The hidden variables lose their function, and VAE mathematically degenerates into a simple Seq2Seq model. Therefore, training a Seq2Seq dialogue generation model based on CVAE needs to balance the reconstruction loss and KL loss. In our experiments, we use the techniques of KL annealing, early stop, and bag loss to balance the reconstruction loss and KL loss. The bag of words loss is added to the training objective function on the previous basis, and the objective function is rewritten as 3.5. Dual Emotion Framework. To make the emotional responses more coherent, we add an emotion label of the post to the input of the post encoder. The input becomes ½E x ; x, enabling our model to mine the emotional information of the post and make the emotional response compatible with the post emotion. The estimated probability of the model can be rewritten as To make our emotional responses more human-like, we stitch the target emotion vector e y into the vector c to control the emotion replied by decoder. The vector c becomes ½e y ; v x . Thus, we can choose different emotions to reply to the users, and even affect the user's emotion. For example, when the user is unhappy, we can make the user happy by outputting a response with a happy emotion.

Summary of the CVAE-DE Model.
In this section, we introduce the mathematical derivation and structural framework of the model. The goal of our model is to generate dialogue responses that are rich in content, diverse in form, and rich in emotion. To improve our model's ability to understand emotion and improve the accuracy of emotion recognition, we use the BERT model as the emotion classifier. At the same time, to prevent the Seq2Seq model from generating a large number of general responses, we introduce the hidden variables of the conditional variational autoencoder to enable our model to generate rich and diverse responses. Finally, to make the emotion contained in the responses more natural and appropriate, we design a dual emotion framework that considers not only the controllability of the output emotion but also the continuity of the emotion with the input sentence.

Data Preparation and Implementation Details.
We use different data sets to train the multiemotion classifier and 5 Wireless Communications and Mobile Computing dialogue generation model. The multiemotion classifier is trained with the Weibo corpus data with emotion labels, which are derived from the Chinese Weibo emotion recognition task in NLPCC 2013 and the Chinese Weibo text emotion analysis task in NLPCC 2014. After sorting and filtering, the data set has a total of 40133 sentences, each of which contains an emotion label, which are divided into six categories: Null, Like, Sad, Disgust, Anger, and Happiness. The dialogue generation model is trained with the data set that is derived from the emotion dialogue generation task in NLPCC 2017. The data set contains 1119207 pieces of training data, each including an original sentence and a response sentence.
In the training of the multiemotion classifier, we divide the data set into a training set, a validation set, and a test set, with a ratio of 36133 : 2000 : 2000. We train the classifier on the basis of the pretrained Chinese model "BERT-Base, Chinese" released by Google.
In the training of the dialogue generation model, the ratio of the training set, validation set, and test set is 1099239 : 9984 : 9984. Our vocabulary size is set to 40000, the word embedding vector and the emotion label embedding vector are both set to 128, the encoder and decoder use 128 hidden units of RNN layer, and the latent variable size is set to 268. We randomly initialize all of the parameters of the model and set the batch size to 128.

Baselines.
In the experiments, we compare CVAE-DE with the following baselines: Seq2Seq: A standard Seq2Seq model with attention method that is widely used as a baseline in the conversation generation task [21].
ECM: A Seq2Seq model that uses the emotion category embeddings, internal and external memory mechanisms to generate emotional responses [14].
CVAE: A conditional variational autoencoder model that takes the target emotion label as input to formulate latent variable [16]. CVAE-MTDA: A conditional variational autoencoder model with a dual-attention mechanism used to ensure that specific emotional responses are coherent with the content and the emotion of the input [25]. EDGAN: A model based on generative adversarial networks with multiple generators for generating responses with specific emotion and a multiclass discriminator [27].

Evaluation Indicators.
In this paper, we introduce the evaluation metrics for the following two aspects.

Multiemotion Classifier.
Emotion classification accuracy is used as the evaluation index of the emotion classifier. For comparison, we train a variety of emotion classifiers, including RNN [30], LSTM [34], and Bi-LSTM [35].

Dialogue Generation Model. The evaluation indicators
of the dialogue generation model are mainly divided into the categories of automatic evaluation and manual evaluation. Since there is no correct answer in the open-domain dialogue generation, the bilingual evaluation (BLUE) algorithm [36] is not suitable for the evaluation of the dialogue generation model [37]. Therefore, according to the perplexity [38], the accuracy of emotion expression, and the Distinct-1 and Distinct-2 methods [11], the responses generated by our model are automatically evaluated.
Perplexity: Defined by Eq. (17) where S is the generated sentence, L is the length of the sentence, and pðw i Þ is the probability of the i-th word. A lower PPL score corresponds to a better model, more natural response, and smoother sentence.  [30] 56.2% LSTM [34] 59.7% Bi-LSTM [35] 62.1% BERT [18] 65.1%   [14] 0.0062 0.0396 CVAE [16] 0.0256 0.2635 CVAE-MTDA [25] 0.0287 0.2712 EDGAN [27] 0.0273 0.2658 CVAE-DE 0.0308 0.2836 Target responses 0.0952 0.5897 Distinct-1 and Distinct-2: Used to judge whether the model will generate a large number of universal and repetitive responses, which can reflect the diversity of responses. The definition is given in Equation (18) where Countðunique ngramÞ is the number of unigrams/bigrams that are not repeated in the responses and CountðwordÞ is the total number of unigrams/bigrams in the responses. A larger value of Distinct-1 and Distinct-2 indicates a higher diversity of the generated responses.

Wireless Communications and Mobile Computing
Manual Evaluation: To better understand the quality of the generated response in terms of content and emotion, we invite 4 volunteers to evaluate the results of our generation models. The reviewer scores of the generated response are based on content and emotion. The content scores are mainly based on whether the response is appropriate and natural or whether it may be generated by people; it is a widely accepted measurement standard by researchers and was proposed by Shang et al. [21]. The emotion scores are mainly based on whether the emotion of response meets the given target emotion. The content scores are divided into 0 point, 1 point, and 2 points. The emotion scores are divided into 0 point and 1 point.

Experimental Results and Analysis. (1) Classification
Accuracy of the Multiemotion Classifier: As shown in Table 1, the classifier based on the BERT model has the highest accuracy, reaching 65.1%. The higher the accuracy of the emotion classifier is, the more accurate the emotion label is, and the higher the accuracy of the emotion expression. Therefore, we will use a classifier based on the BERT model to generate the emotion labels.
(2) Perplexity and Accuracy of Emotion Expression: As shown in Table 2, CVAE-DE obtains better score than all of the other models in perplexity and emotion expression accuracy. The best score in emotion expression accuracy indicates that the dual emotion framework can generate a response that is closer to the emotional response in the real human conversation corpus than the other models. The emotional responses of CVAE-DE model are not only controlled by the target emotion but also affected by the emotion of the input sentence. As communicated in real life, the responding party is not only controlled by their own emotion but also affected by the emotion expressed by the other party. The emotion accuracy of Seq2Seq is quite low because it generates the same response with different emotion types.
(3) Distinct-1 and Distinct-2: It is observed from Although our model has achieved a relatively satisfactory performance compared to that of other models, there are still some limitations. Our model is mainly limited to some coarse-grained emotional labels, including like, sadness, and anger. Such coarse-grained classification labels make it difficult to capture the nuances of human emotion. Therefore, our future work direction may be to train our model to make it easier to capture the nuances of human emotions by building a corpus with fine-grained emotional labels.

Conclusion
In this paper, we propose an emotional dialogue generation model, CVAE-DE, to produce high-quality responses with multiple emotion types. An emotion classifier based on the BERT model is used to classify a variety of emotions, which to a certain extent improves the problem of previous methods obtaining a low classification accuracy of emotion categories. To enable the model to produce more rich and diverse responses, we introduce a conditional variable autoencoder on the basis of the Seq2Seq model based on the attention mechanism. At the same time, to enable the model to generate coherent and controllable emotional responses, we propose a dual-emotional framework. The experimental results show that the model proposed in this paper can produce high-quality responses with specific emotions.
In future work, we will use more complex generation models to further improve the quality of generated responses and use a corpus with fine-grained emotion classification labels to enrich the emotion of responses. At the same time, we will also explore the application of the method in this article to multiple rounds of dialogue, using contextual information to infer the user's emotional information, rather than the emotional information specified by the user. This will be a challenging task because it depends on the topic, contextual information, and the user's emotions.

Data Availability
We use different data sets to train the multiemotion classifier and dialogue generation model. The multiemotion classifier is trained with the Weibo corpus data with emotion labels, which are derived from the Chinese Weibo emotion recognition task in NLPCC 2013 and the Chinese Weibo text emotion analysis task in NLPCC 2014. After sorting and filtering, the data set has a total of 40133 sentences, each of which contains an emotion label, which are divided into six categories: Null, Like, Sad, Disgust, Anger, and Happiness. The dialogue generation model is trained with the data set that is derived from the emotion dialogue generation task in NLPCC 2017. The data set contains 1119207 pieces of training data, each including an original sentence and a response sentence. All researchers can access the data at the following site: https://www.biendata.xyz/ccf_tcci2018/datasets. 8 Wireless Communications and Mobile Computing