A Bichannel Transformer with Context Encoding for Document-Driven Conversation Generation in Social Media

Along with the development of social media on the internet, dialogue systems are becoming more and more intelligent to meet users’ needs for communication, emotion, and social intercourse. Previous studies usually use sequence-to-sequence learning with recurrent neural networks for response generation. However, recurrent-based learning models heavily suffer from the problem of long-distance dependencies in sequences. Moreover, some models neglect crucial information in the dialogue contexts, which leads to uninformative and inflexible responses. To address these issues, we present a bichannel transformer with context encoding (BCTCE) for document-driven conversation. ,is conversational generator consists of a context encoder, an utterance encoder, and a decoder with attention mechanism. ,e encoders aim to learn the distributed representation of input texts. ,e multihop attention mechanism is used in BCTCE to capture the interaction between documents and dialogues. We evaluate the proposed BCTCE by both automatic evaluation and human judgment. ,e experimental results on the dataset CMU_DoG indicate that the proposed model yields significant improvements over the state-of-the-art baselines on most of the evaluation metrics, and the generated responses of BCTCE are more informative and more relevant to dialogues than baselines.


Introduction
Dialogue systems such as Siri, Cortana, and Duer have been widely used to facilitate interactions between humans and intelligence devices as virtual assistants and social Chatbots. For example, people can conveniently make airline reservations with the help of an intelligent agent in social media. Conversational response generation, as a challenging task in natural language processing, plays a critical role in conversational systems.
Conversational generation aims to produce grammatical, coherent, and plausible responses in accordance with the input from users. Previous studies on dialogue generation mainly focus on either one-round conversation [1] or multiturn conversation [2]. One-round conversation tasks commonly determine responses on the basis of a single current query, while a multiturn conversation that consists of context-message-response triples commonly builds context-sensitive generators according to the dialogue history [2,3]. Multiturn conversation tasks tend to generate a variety of correlative responses in either goal-driven customer services [4][5][6] or chitchat without predefined goals [7].
In the previous studies, the sequence-to-sequence (seq2seq) framework [8] with the attention mechanism has commonly been used to generate conversational responses and has achieved remarkable success in various domains [9][10][11][12][13][14]. e seq2seq models map a type of sequential syntactic structure to another without explicitly defining structural features by building an end-to-end neural network [2,15]. Most seq2Seq models use a recurrent neural network [16,17] as the encoder and decoder to capture the sequential dependency. However, hierarchical recurrent neural networks, which suffer from timeconsuming training, have difficulty in solving the problem of long-distance textual semantic dependency.
Since lacking of a knowledge background, the previous studies on conversational generation may suffer from the safe and generic responses such as " at's all right" and "Yes." e uninformative responses are hard to match the relative content in the given document and satisfy the demand of users. To address this challenge, some knowledge-based methods have been proposed in recent works for conversational response generation. In these works, external knowledge is leveraged to facilitate conversation understanding and generation [18,19], which includes structured data such as knowledge graphs [20,21], unstructured textual knowledge [22], and visual knowledge [23]. With the development of the internet and big data, unstructured knowledge is more accessible than structured knowledge, which is constructed manually and depends heavily on the experience of experts. erefore, some recent works take conversation-related documents and texts as the background knowledge to enrich useful information in conversations to generate more informative and interesting responses [24].
Our work is inspired by the recent success of the transformer framework [25], which is entirely based on attention mechanisms in end-to-end natural language processing tasks and eliminates complex recurrent and convolution network architectures [26]. We propose a transformer-based model for multiturn document-driven conversation.
e proposed model encodes the conversational context and the current utterance, respectively. It also incorporates the multihop attention mechanism into the encoder and decoder to capture the correlative content for response generation, which draws global dependencies between documents, utterances, and responses. We conduct experiments on the conversation generation task regarding many metrics, including BLEU [27], METEOR [28], NW [24], and perplexity [19].
e experimental results indicate that the proposed model significantly outperforms the state-of-the-art methods. We also conduct ablation experiments to indicate the effects of the input elements fed into the encoder. e human judgment on various ablation models shows that the responses generated by the BCTCE model are more relevant to the context (document and dialogue history), more informative, and fluent than its several variants. e contributions of this work are as follows: We propose a novel BCTCE based on the transformer framework to build an encoder-decoder generator for document-driven conversation. e experimental results show that our model achieves new state-of-the-art performance.
e BCTCE learns the distributed representation of conversational context by encoding the document and dialogue utterances in parallel and integrating them within the interattention mechanism. e BCTCE leverages layer-wise multihop attention mechanisms to gradually enhance the interaction between inputs, where the dialogue utterances and the document which can provide supplementary knowledge are used to generate the context-aware and dialogue-consistent responses. e BCTCE can reduce the time of training and inference compared to the recurrent network-based response generators.
We review the related work in Section 2 and present the details of the proposed model in Section 3. Section 4 shows the experimental process, including datasets and evaluation criteria. e result analysis is also given in Section 4. Finally, we conclude this work and present future work with a brief summary in Section 5.

Related Work
Previous models for conversation are generally divided into rule-based, retrieval-based, and generation-based models. e rule-based and retrieval-based models depend on handcrafted rules or existing knowledge bases to match the correct answer, while the generation-based models require less manual effort by leveraging data-driven training of the algorithm on a noisy but large-scale corpus.
Recently, deep neural networks have been widely used for both response retrieval [29,30] and response generation [31]. Some retrieval-based works determine the correct responses by the semantic similarity between the representations of a require and its candidate answers learned by neural networks. Sequence-to-sequence (seq2seq) frameworks [8] that have achieved success in many domains, such as machine translation [9,32], have been commonly used for response generation [2,15,32]. In particular, seq2seq-based models play an important role in studies on multiturn conversation [4], which commonly build encoder-decoder networks for response generation. ey map a sequential syntactic structure to another without explicitly defining features, where recurrent neural networks (RNNs) such as long short-term memory (LSTM) and gated recurrent units (GRUs) [33] are commonly employed as the kernel unit. Vinyals and Quoc explored the LSTM network to produce sequential responses end-to-end for the multiturn conversation [4]. Shang et al. combined global and local context information on the basis of the original RNN for a oneround conversation. Sordoni et al. encoded the semantic information of the context and message by a multilayered nonlinear forward network and took RNN as a decoder to generate responses [3]. Chen et al. utilized a memory network to preserve more historical information in a multiturn dialogue [34]. RNNs are commonly used to sequentially encode each word in the input context and produce the response word-by-word during decoding. However, they were limited by the long time required for sequential training resulting from exploding or vanishing gradient. In addition, the model may suffer from information loss due to hardly capturing long-term semantic dependencies between utterances.
Attention mechanisms have become an integral part of sequence models in response generation, modeling the textual dependencies in the input or output sequences without regard to the position information [35,36]. In previous works on neural response generation, the attention mechanism was incorporated into the encoder-decoder framework to preserve the key semantic information in sentences [1,37]. Vaswani [37]. However, previous works merely produce general, rigid, and stylized responses without the natural variation in the language [39]. To address this issue, some studies have proposed context-aware conversational generators to produce more diverse and meaningful responses. Li et al. improved the LSTM-based generator by simply taking maximum mutual information as the objective function [15]. Xing et al. proposed a topic-aware neural generator that leverages topic information to simulate prior knowledge of humans by a joint attention mechanism and a biased generation probability [40]. More works focus on employing extra knowledge to guide the generation and hence tend to generate meaning and context-related responses. Liu et al. presented a neural knowledge diffusion model to introduce knowledge into dialogue generation [41]. Young et al. incorporated common sense knowledge about the concepts covered in utterances into end-to-end conversational models [30]. Madotto et al. used a multihop attention mechanism over memories with pointer networks to effectively incorporate knowledge base information in generative dialogue systems [42]. Moon et al. combined a knowledge graph with conversational utterances to infer the correct entity as the output response [43]. Lian et al. focused on the selection of knowledge for conversational response generation [21]. Both Li et al. [44] and Li et al. [26] proposed document-grounded dialogue generation models to form informative and interesting multiturn responses.

Model Architecture
is work proposes a novel transformer-based model which leverages joint encoding of a given document and dialogue for response generation, as shown in Figure 1. It follows the encoder-decoder framework by only using stacked layers, each of which consists of a multihead attention mechanism and position-wise connection network. Encoder-decoder neural networks with attention functions have been widely leveraged for solving sequential language generation [45].
In a multiturn conversational generation task, a dialogue is commonly considered as a sequence of K utterances u 1 , u 2 , . . . , u k , which contains the dialogue history u 1 , u 2 , . . . , u k−1 and the current utterance u k , where u i � w u i 1 , w u i 2 , . . . , w u i |u i | denotes the i-th utterance in the multiturn dialogue and w u i j denotes the j-th word in the i-th utterance. In this work, we denote the dialogue as a tokenlevel sequence U � w u 1 , w u 2 , . . . , w u l u , where l u denotes the length of the dialogue sequence. e given document for response generation is denoted as D � w d 1 , w d 2 , . . . , w d l d , where l d is the length of the document sequence. e dialogue utterances and document are fed into the encoders to learn their distributed representation, which is illustrated in Section 3.2. e output of the decoder in our model is a generated response R � w r 1 , w r 2 , . . . , w r T , where T is the length of the response.

Attention Mechanism and Multihead Attention.
We take advantage of the attention mechanism to capture the interactions between the document and the dialogue, which allows the model to attend the useful information for response generation. We assume that there are n 1 queries and n 2 key-value pairs. en, we use an attention function to obtain a weighted sum of the values for each query, where the query, key, and value are all vectors of dimension d k . e weight α assigned to each value is computed by a scaled dotproduct function of the query with the corresponding key, shown as the following equation: where Q is the matrix packing with a set of d k -dimensional vectors of queries, Q ∈ R n 1 ×d k . K and V are also matrices, K ∈ R n 2 ×d k and V ∈ R n 2 ×d k . e weight α ∈ R n 1 ×n 2 . Moreover, our model implements multihead attention [25] in all the attention computations to jointly collect information from different representation subspaces at different positions. We define the multihead attention function with M heads for projecting the queries, keys, and values M times with different learned linear projections. e result of the multihead attention function MultiHead is a vector that concatenates all the output vectors across M heads, shown as follows: where head m denotes the m-th weighted vector calculated as the following formula: where Q, K, and V indicate the vectors of the input query, key, and value with the same dimension d model , respectively. W Q m , W K m , W V m ∈ R d model ×d att are trainable parameter matrices for the m-th head, and d att � d model /M.

Encoder.
e encoder of the proposed model consists of an utterance encoder and a context encoder. e former aims to learn the representation of the current utterance, and the latter aims to learn the representation of the conversational context (document and dialogue utterances). e encoder is inspired by the reading behavior of human beings. Generally, a basic process of reading comprehension is that Complexity 3 firstly reading through the given document and dialogue to understand the theme and capture the key information from the context. en, we focus on the current utterance for generating the answer. In the context encoder, the given document and dialogue utterances are encoded in parallel by intraattention interaction and interattention interaction. is parallel learning process aims to better represent the context and fuse the information of the document and utterances.
We first map the symbolic representations of input sequences to distribution representations.
e tokens of the current utterances, document, and dialogue utterances are fed one by one into the encoder. Moreover, it has been widely accepted that position information is critical to indicate the order of the sequential input. However, the self-attention mechanism itself cannot distinguish between different positions. So, we introduce an additional position embedding to encode position information of the input into the word vectors, shown as the "positional encodings" module in Figure 1. e sum of the original word embedding and the position embedding is defined as the distribution representation e(w) of word w: where embed(·) denotes an embedding lookup function; the position embedding PE(·) is defined as in [25]: where pos is the position of w in the dialogue sequence or document sequence, d denotes the d-th dimension of the representation, and d model is the dimension of the input embedding. e utterance encoder is the same as that of the original transformer [25] with N stacks, shown in the left of Figure 1. It outputs the embedding of the current utterance and has two sublayers in each stack. e first sublayer is constructed by the multihead self-attention mechanism, and the second is a position-wise fully connected feedforward network. e context encoder is a variant of the transformer encoder with N stacks, shown in the center of Figure 1. Differing from the original transformer that only deals with a single channel in the encoder, it builds binary channels to separately encode both dialogue and document in the first step. e context encoder is composed of a stack of N identical layers, and each stack contains three sublayers: (i) Intra-attention layer: this layer is employed to encode two individual input sequences using the multihead where h U n−1 and h D n−1 are the outputs of the last stack. where ∈ R d model are trainable parameters. d inner is the size of the hidden layer in the feedforward network.
In addition, each sublayer has an "Add & Norm" operation, which is defined in the original transformer framework. e output of the last stack is matric h U N and matric h D N , which are concatenated as the output of the context encoder.

Decoder.
e decoder also has N stacks and contains three layers per stack, as shown in the right of Figure 1. At time step t, the previous t − 1 tokens and the output of the encoder are fed into the decoder to predict the t-th token in the response illustrated as the output of the N-th stack h R t ∈ R d model .
(i) Masked self-attention layer: this layer is similar to the intra-attention layer in the encoder. e difference is that we mask the subsequent positions of each token to ensure that the consequent utterance only depends on the previous tokens.
(ii) Context-attention layer: to enrich the context information in generated responses, the output of the context encoder is fed into the decoder and integrated with the previous response tokens by a multihead attention function, where the query of the function is the output of the masked self-attention layer, and the key and the value are the output of the context encoder. (iii) Utterance-attention layer: generally, the generated response must be relevant to the current utterance. us, we also use a multihead attention function to introduce the key information of the current utterance for generating dialogue-related responses in this layer. e query of the function is the output of the context-attention layer, and the key and the value are the output of the utterance encoder. (iv) Feedforward layer: this layer is the same as the feedforward layer in the encoder.
We select the token derived from an external vocabulary V o ; the probability of each candidate token being chosen is en, we define the probability of generating the t-th token as P(w t ): At each time step, we select the token that has the highest probability as the generated token: During the training process, the loss for time step t is defined as the negative log-likelihood of the target word w * t : e final loss is

Copying Mechanism.
In this work, we tend to generate more imaginative and context-aware responses. However, some tokens in the ground truth may not be included in the vocabulary (OOV, out of vocabulary). As such, we propose a variant of our transformer model that incorporates the copying mechanism [46,47] into the decoder to generate tokens that appear in the document and dialogue in addition to the external vocabulary. e tokens in generated responses may be chosen from the input or an external vocabulary according to a computed probability.
At each time step t, according to the multihead attention weights resulted from the context-attention layer, we determine the probability that the generated tokens are derived from the input document and dialogue utterances as the average of all the attention weights where α t ∈ R l u +l d and α m t indicates the attention weight of the m-th head.
According to the copying mechanism, the probability of tokens being chosen from the vocabulary is p g t ∈ [0, 1], while the probability of tokens being chosen from the input sequences is 1 − p g t : where W is a trainable parameter. e probability of the t-th generated token w t is calculated according to the source it is derived from as follows:

Experiment Settings.
We conduct the experiments, and the stacks of both encoder and decoder are set to 4. e number of attention heads is set to 8. e dimension of input embedding d model is set to 512, and the hidden size of the feedforward network d inner is set to 2,048. In the process of encoding, we take the previous four utterances and the given document as the input. We use the Adam algorithm [48] with learning rate 0.0001 for optimization. e batch size is set to 64, and the dropout rate is set to 0.1. In addition, we train the model for 50 epochs.

Dataset.
We evaluate the proposed model on the dataset CMU _ DoG (CMU document-grounded conversations) for document-driven conversations [24]. is dataset consists of a set of documents and a spectrum of dialogues corresponding to each document, which may contain movie names, ratings, introduction, and some other scenes. e documents present conversation-related information that may help generate context-aware responses in a multiturn conversation task. e dataset has a total of 4,112 conversations with an average of 21.43 turns. e dialogue utterances are derived from two different scenarios, both of which involve two participants. In the first scenario, only one participant has access to the given document, while both participants have access to the same given document in the second scenario. e number of conversations for scenario ONE is 2,128, and for scenario TWO, it is 1,984.
is highquality dataset explicitly presents the corresponding relationship between each section of a document and the conversation turns. e average length of documents is approximately 200.
ere are 72,922 utterances for training, 3,626 utterances for validation, and 11,577 utterances for testing.

Quantitative Evaluation.
To measure the performance of the proposed model and the baselines, we take BLEU [27], METEOR [28], NW [24], and perplexity (PPL) [19] as evaluation criteria to perform automatic evaluation.
(1) BLEU: BLEU is known to correlate reasonably well with human evaluation on the task of conversational response generation. It measures n−gram overlap between generated responses and the ground truth, which is defined as BLEU−n. We calculate various BLEU scores between the golden responses and the generated responses. Moreover, we calculate the unigram overlap between the given document and the generated responses to further compare models in terms of the correlation between the responses and the document. erefore, we only use the BLEU−1 score (called as Doc_BLEU) and ignore the brevity penalty factor in the BLEU computation. (2) METEOR: we also compare our proposed model with state-of-the-art baselines in terms of the ME-TEOR metric under the full mode (this mode contains the exact matching between words and phrase matching between stems, synonyms, and paraphrases). METEOR, which focuses on the recall rate, has more relevance with human judgment in comparison to BLEU. (3) NW: we explore the set operation (NW) to evaluate the relevance between documents and the conversations generated by the models. Let the set of tokens in the generated response be N, the set of tokens in 6 Complexity the document be M, the set of tokens in the previous three utterances be H, and the set of stop words be S. We calculate the set operation (NW) as | ((N ∩ M)/H)/S|. A higher NW score indicates that more tokens that appear in the document are used to expand the information in responses. (4) Perplexity: in addition to the previous three criteria, we use perplexity to automatically evaluate the fluency of the response. Lower perplexity indicates better performance of the models and higher quality of the generated sentences.

Human Judgment.
Manual evaluations are essential for dialogue generation. So, we augment the automatic evaluation with the human judgment of fluency, dialogue coherence, and lexical diversity. All the three evaluation metrics are scored 0/1/2. We randomly sample multiple conversations containing 822 utterances from the test set. We used a crowdsourcing service that asks annotators to score these utterances given its previous utterances and related documents. e final score of each utterance is the average of the scores rated by three annotators.
(1) Fluency: whether the response is natural and fluent. Score 0 represents the response is not fluent and incomprehensible; 1 represents the response is partially fluent but still comprehensible; and 2 represents the response is sufficiently fluent. (2) Dialogue coherence: whether the response is logically coherent with the dialogue. Score 0 represents the response is irrelevant with the previous utterances; 1 represents the response matches the topic of the previous utterances; and 2 represents the response is exactly coherent with the previous utterances. (3) Lexical diversity: whether the response is vivid and diverse. Score 0 represents the safe response which is applicable to almost all conversations, e.g., "i think so" and "i agree with you"; 1 represents the response suitable to limited conversations but plain and uninformative; and 2 represents the response is evidently vivid, diverse, and informative.  Table 2 shows the comparison of document relevance and response quality in terms of Doc_BLEU score, NW, the average length of responses (avg_len), and the PPL score. Our model outperforms the baselines by 4.5%-11.9% in terms of BELU-1, while it has lower NW score than D3G and incremental transformer (our impl). ese results indicate that our model can more effectively use the shared information between the document and the dialogue to produce responses than D3G and incremental transformer (our impl). Moreover, the average length of the responses generated by our model is higher than that of the baselines, which shows that our model may generate more informative responses. In addition, our model achieves a competitive PPL score with others.

Results and
As the results of human judgment shown in Table 3, our transformer-based model outperforms all the baselines in terms of the dialogue coherence and diversity. However, the performance of our model is slightly worse than incremental transformer (our impl) [44] on fluency. Figure 2 presents the training time for one epoch of our BCTCE model and some baselines (As the official source code of "Incremental Transformer," the process of training for each step is followed by the evaluation of the generated responses. erefore, it is hard to get the actual training time for one epoch of "Incremental Transformer" model.). For a fair comparison, all models use the same batch size, max length of the document, max length of the dialogue sequence, and max length of the response.

Training Time.
As shown in the figure, the training time for our BCTCE model is much less than D3G, while it is higher than other models. e reasons are as follows: (1) e SEQ model is a simple sequence-to-sequence RNN model with attention mechanism. It only uses the dialogue as the input of the encoder and discards the document, thus requiring considerably less time for model training than our model.

Ablation Study.
To validate the effectiveness of each module of the BCTCE, we conduct ablation experiments on the CMU_DoG dataset: (1) +copy: we introduce the copy mechanism into the BCTCE model to generate the response from the document and dialogue utterances in addition to external vocabulary.

Complexity
(2) -document: we replace the context encoder with the original transformer and take only dialogue utterances as its input (3) -history: we remove the dialogue history from the inputs, remaining only the current dialogue utterance and document (4) -context encoder: we discard the context encoder and the context-attention layer in the decoder (5) -utterance encoder: we discard the utterance encoder and the utterance-attention layer in the decoder (6) -bi_channel: we replace the context encoder with the original transformer and take the concatenation of the dialogue utterances and document as its input e automatic evaluation results are shown in Tables 4 and 5, respectively. e human evaluation results are shown in Table 6. As shown in Table 4, the ablation models, which remove some modules of BCTCE, perform worse than the basic BCTCE model on the similarity between generated responses and ground truth (BLEU-n scores and METEOR score). e results of "-bi_channel" indicate that the interaction between the document and the dialogue in bichannel encoding is effective for generating responses. e results of "-context encoder" and "-utterance encoder" show that the context encoder and the utterance encoder are beneficial for response generation. e results of "-document" and "-history" represent that the multiturn dialogue and the document knowledge are important as they contain some vital information useful for generating reasonable response. Table 5 shows that removing the document or introducing copy mechanism reduces the Doc_BLEU and NW scores. e results indicate that the BCTCE may pay more   Results marked with § are trained and evaluated with the source code from [26], results marked with ¶ are trained and evaluated with our implemented code, results marked with † are from [44], and results marked with ‡ are trained and evaluated with the code published by Li et al. [44]. 8 Complexity attention to the dialogue utterances after removing the document information, and the copy mechanism has less influence on the generated response than expected since the BCTCE has sufficient capability to learn the document knowledge for response generation. e Doc_BLEU and NW scores increase when removing the dialogue history or utterance encoder from the basic BCTCE model as the lack of sufficient dialogue information makes the model to be more focused on the document. e ablation model "--bi_channel" increases the Doc_BLEU score and reduces the NW score, which indicate its generated responses pay a little more attention to the shared information between the document and the dialogue. It is worth noting that the ablation model "-context encoder" significantly outperforms the BCTCE model on PPL and the fluency shown in Table 6. A possible reason is that it tends to generate safe and unremarkable responses (e.g., I don't know). e decrease of its avg_len and the diversity shown in Table 6 also supports our argument. Moreover, the copy mechanism effectively reduces the PPL from 17.80 to 16.33 and increases the diversity from 0.95 to 1.01, which indicate that it can improve the response quality of the basic BCTCE model.

Case Study.
In this section, we demonstrate three conversation cases and show the responses generated by our BCTCE model and several ablation models (as shown in Table 7). e first case shows that our BCTCE model produces a response that contains an "OOV," while the BCTCE with a copy mechanism extracts the token "post" and "times" from the document. It indicates that the copy mechanism eases the "OOV" problem, although it reduces the performance of the basic BCTCE model on many evaluation metrics.
In the second case, the BCTCE without the document or context encoder produces a completely incorrect answer, while the basic BCTCE and other ablation models can produce at least one correct answer "jime carrey" contained in the ground truth. ese results show that the document is necessary for generating answers, and the BCTCE can take full advantage of the valid information in the document. e third case indicates the importance of the dialogue history. For the conversation whose current utterance is an uninformative interjection "wow," the BCTCE model produces a response "all around, it has a lot of great stories," referring to the "story" in the previous utterances, while the   BCTCE without the dialogue history outputs a dialogueindependent response "yeah he did a great job" that contains an uncertain person pronoun. And the BCTCE without the context encoder produces an uninformative "OOV." is case shows that the BCTCE without the dialogue history or context encoder generates dialogue-inconsistent responses, while the responses generated by others which use historical dialogue information are more reasonable and consistent with the current utterance.

Conclusions
In this paper, we propose the BCTCE model based on the transformer framework for document-driven conversational . . .the white house retaliates, and in short order the post and times appear together before the supreme court to plead their first amendment argument for the right to publish the material. meanwhile, newspapers across the country pick up the story in solidarity with the post and times. the court rules 6 3 in the newspapers' favor, vindicating graham's decision. nixon demands that the post should be barred from the white house. one year later, a security guard discovers a break-in in progress at the watergate complex . . .
. . . the supreme court ruled 6-3 in favor of the news papers, which vindicated graham's decision. Now i really want to see it. it sounds really juicy, but it will be a little difficult because i feel like the situation will anger me.
Ground truth: but they won. nixon god mad and barred the post from the white house BTMPE: OOV -document: i think the post won the oscar for the OOV -history: i like the OOV of the white house, but they OOV the post to OOV the house. -context encoder: i think that is a good way to end the movie. -utterance encoder: it sounds like a good movie to Watch. -bi_channel: and then the supreme court 6-3 in the supreme court, the court rules. +copy: i wonder if the post joined the times or not.
. . .bruce nolan (jim carrey) is a television field reporter for eyewitness news at wkbw-tv in buffalo, new york, but desires to be the news anchorman . he is in a relationship with his girlfriend grace connelly (jennifer aniston), but also has a mild crush on his co-worker, susan ortega (catherine bell) , who barely seems to notice him. bruce, however, suffers from constant bad luck and reaches breaking point when he is passed over for promotion by his rival, evan baxter (steve carell) . . .
. . . it did not get a good rating on rotten tomatoes.
i liked it though that happens a lot with rotten tomato who is in the movie?
Ground truth: jim carrey and jennifer aniston BTMPE: it is a comedy film starring jim carrey and jennifer aniston as the main character -document: it's got 92% on rotten tomatoes -history: jim carrey plays bruce nolan. -context encoder: the movie is about a man named michael oher who is a stockbroker. -utterance encoder: jim carrey plays the lead -bi_channel: jim carrey is a funny actor. +copy: jim carrey . . .the duo tries to make it to the moving truck, but sid's dog, scud, sees them, and gives chase. buzz gets left behind while saving woody from scud, and woody tries rescuing him with andy's rc car, but the other toys, thinking woody eliminated rc as well, attack and toss him off the truck. having evaded scud, buzz and rc retrieve woody, and continue after the truck. upon seeing woody and buzz together on rc, the other toys realize their mistake, and . . . . . . so tell me the good things you heard about it heard that it had a cute story line/plot and that there are three movies total, toy story 1, 2, and 3 wow Ground truth: that's about it BTMPE: all around, it has a lot of great stories -document: i'm -history: yeah he did a great job.
-context encoder: OOV -utterance encoder: do you know what kind of movie it is? -bi_channel: have a great day +copy: i like the first pixar movies 10 Complexity response generation. is model consists of a context encoder, an utterance encoder, and a decoder, aiming to generate context-aware responses. e context encoder is used to learn the distributed representations of the input document and dialogue utterances in parallel and then fuse them by the multihead self-attention function. e utterance encoder aims to represent the current utterance as distributed embedding. e decoder focuses on the document knowledge and dialogue coherence for predicting the next response. e competitive models are evaluated by comparing the generated responses with the ground truth. Empirical results show that the BCTCE outperforms stateof-the-art baselines in terms of various BLEU scores, ME-TEOR, and NW. e effectiveness of the modules in the BCTCE is indicated by the ablation study. And the manual evaluation and case study show that our model can capture the useful information contained in the document and dialogue, which helps to generate diverse and reasonable responses with much more relevance with the context. In the future work, we will try to build various encoders and concatenate the output from the encoders to integrate the input sequences and generate reasonable context representation.