Mutual-Attention Net: A Deep Attentional Neural Network for Keyphrase Generation

Neural keyphrase generation (NKG) is a recently proposed approach to automatically extract keyphrase from a document. Unlike the traditional keyphrase extraction, the NKG can generate keyphrases that do not appear in the document. However, as a supervised method, NKG is hindered by noise. In order to solve the problem that the existing NKG model does not consider denoising the source document, in this work, this paper introduces a new denoising architecture mutual-attention network (MA-net). Considering the structure of documents in popular datasets, the multihead attention is applied to dig out the relevance between title and abstract, which aids denoising. To further accurate generation of high-quality keyphrases, we use multihead attention to compute the content vector instead of Bahdanau attention. Finally, we employ a hybrid network that augments the proposed architecture to solve OOV (out-of-vocabulary) problem. It can not only generate words from the decoder but also copy words from the source document. Evaluation using five benchmark datasets shows that our model significantly outperforms the state-of-the-art ones currently in the research field.


Introduction
A keyphrase is an ordered list of words that captures the main points discussed in a natural language document [1].Keyphrase is a signifcant way for people to quickly understand the key point of the document, which has been widely used in many text mining tasks, such as information retrieval, natural language processing, document summarization, and text classifcation [2].Owing to public accessibility, researchers usually adopt scientifc and technical publications related datasets as test platforms for keyphrase extraction algorithms.Similarly, we also use the datasets related to scientifc publications to conduct keyphrase extraction [3].
Generally, existing keyphrase extraction approaches usually contain two components: keyphrase candidate search and keyphrase selection.Keyphrase candidate search is to extract a keyphrase candidate set from a document.Researchers have tried to use N-grams or noun phrase and compute the tightness of the inner connection in some ways to determine whether it is a phrase with independent semantics [4].After a keyphrase candidate set is extracted, all these approaches conduct keyphrase selection to select proper keyphrases by ranking the importance of the candidate keyphrase set using diferent methods, either through supervised methods [5,6] or unsupervised methods.Te unsupervised method adopts the statistical feature of candidate keyphrases such as TF-IDF [7] to rank keyphrases and the unsupervised algorithm based on graph such as Tex-tRank [8] and HITS [9,10].In supervised algorithms, a classifer is trained on annotated with keyphrases documents in order to determine whether a candidate phrase is a keyphrase or not.
However, the abovementioned keyphrases extraction approaches mainly have two main drawbacks.First, they are unable to extract keyphrases that do not match any contiguous subsequence of the source document (called absent keyphrases, ones that fully match a part of the text are present keyphrases).Second, they cannot capture the semantic meaning of this document.Recently, a RNN-based sequence-to-sequence framework [11] has achieved great success in sequence generation and provides an end-to-end solution to extract absent keyphrases from the source document.To overcome the abovementioned drawbacks, Meng et al. (2017) frst introduced the CopyRNN [12], a RNN-based sequence-to-sequence framework, into this task [12], which incorporated a copying mechanism into the structure proposed by Gu et al. [13].Te copy mechanism is capable of solving the OOV (out-of-vocabulary) problem and allows the model to locate the important parts of the document.Diferent from traditional keyphrase extraction, CopyRNN can generate absent keyphrases.Terefore, we call this approach neural keyphrase generation (NKG).
A scientifc publication consists of the title, abstract, and main body in general.Te experimental results of supervised methods [12] indicate that using abstract instead of full text achieved better performance due to the noise in full text.Te personal style of authors' writing, diferent vocabulary ranges, and diferent felds hinder the denoising.Terefore, it is a major challenge for keywords extraction that how NKG (NKG is a supervised method) can obtain high-quality keywords from the high-noise source document.Reference [12] uses the title and abstract as the source document, discarding the main body; it can denoise to some extent.However, input sequence of a neural network refers to the concatenation of the title and abstract in [12].Usually the title represents the topics of the document; it is the least noisy and shortest sequence in the document.Although the length and noise of the abstract are shorter and lesser than the main body, it is still much longer and larger than the title.Ten, the approach of Meng [12] is equivalent to concatenating a low-noise short sequence with a high-noise long sequence to obtain a high-noise long sequence.
We hypothesize that the semantics of the keyphrases and the semantics of the title are highly correlated; the relations between the abstract and the keyphrases are also important.In fact, according to our statistics on fve benchmark datasets, nearly 60% of the words in the title (stop words such as a, the, and with have been removed from the title) also appear in keyphrases, thus confrming our hypothesis and statics.To overcome the above drawback, motivated by our hypothesis and statistics, we propose a novel architecture to encode the representation of the title and abstract.It takes into account the correlation between the title and keyphrases, computing the relevance between the title and abstract to denoise.Diferent from traditional statistical machine translation, the purpose of neural network machine translation is to establish a single neural network to maximize the translation performance through joint adjustment.Te recently proposed neural machine translation model usually belongs to the category of encoder and decoder, which encodes the source statement into a fxed length vector, from which the decoder generates the translation.In addition, to further complement the informativeness of the current hidden state for next word prediction, we introduce multihead attention instead of Bahdanau attention [14].Our model consists of three parts: (1) Tis is the frst work to model title and abstract in a document separately and considers the relationship between them (2) Adopting multihead attention to build title-aware abstract representation and abstract-aware title representation, and self-attention to build representation of document (3) A hybrid between an attention-based RNN decoder and a pointer network to generate tokens Te key contribution of this paper is three-fold.First, this is the frst work to model the title and abstract separately and consider the relationship between titles and abstracts.Second, we employ multihead attention [15] to calculate content vector instead of Bahdanau attention [14] and compute the copy distribution based on the content vector.Ten, we apply a pointer-network which enables the model to copy words from the source document via pointing [16] that improves accuracy and handling of OOV words.Lastly, we apply our model to the recently-introduced KP20k dataset [12] and four other popular datasets, outperforming the current state-of-the-art neural keyphrase generation model.
Te remainder of the paper is organized as follows.Section 2 introduces related work.Section 3 proposes mutual-attention net.Section 4 reports our experimental results.Section 5 gives the analysis, and then, we conclude the paper in Section 6.

Related Work
2.1.Encoder-Decoder Model.RNN-based encoder-decoder framework achieved state-of-the-art performances in translation task.RNN encoder-decoder is a part of the traditional phrase-based psmt system.On the basis of traditional statistical machine translation, a new joint model (RNN + psmt) is created by integrating RNN decoderencoder and compatible with psmt.Te new model is not only efective in the application of Uyghur Chinese and Chinese English machine translation but also can capture the laws of language.Bleu, an important evaluation index in machine translation, has been signifcantly improved [11].However, models without attention mechanism only consider the last encoder state initializing decoder, in which case set the last encoder state as the context vector.For each decoding time step, an attention distribution is generated and the weighted sum of the encoder states is calculated as the context vector [14].Te weights of the sum are represented as attention scores which make diferent parts of the input sequence to be dynamically focused by the decoder during the generation of the output sequences.Subsequently, this framework achieved remarkable performance in tasks such as abstractive summarization [13,16,17], image caption [18,19], and other sequence generation tasks.In the abstractive summarization task, the key information is often the low-frequency vocabulary in the corpus, even not 2 Computational Intelligence and Neuroscience in the vocabulary, so it cannot be recalled.Terefore, a point network is introduced to encoder-decoder framework separately [13,16] and diferent copy mechanisms are proposed to solve OOV problem.
Meng [12] frst introduces the RNN-based encoder-decoder framework to keyphrases extraction and applies the model proposed by Gu et al. [13], aiming to solve the defect that traditional approaches cannot generate absent keyphrase; it is called CopyRNN.CopyRNN outperforms popularly existing keyphrase extraction algorithms.

Neural Keyphrases Generation.
Our copy mechanism originated in [16] is close to CopyRNN [26], but there are some small diferences: we recycle the attention distribution to serve as the copy distribution, but CopyRNN uses two separate distributions.Our model can copy words from the source document, but the pointer components of CopyRNN activate only for OOV [2,13].

Multihead Attention. Multihead attention, proposed by
Vaswani [15], has been successfully applied to many tasks, including semantic role labeling [27] and relation extraction [28].In this paper, we adopt the multihead attention to compute the title-match abstract representation and abstract-match title representation.

Model Analysis
We frst describe the generic sequence-to-sequence attention-based model in Section 2.1 and then introduce our model in Section 2.2.

RNN Encoder-Decoder with Attention Mechanism.
We start by briefy describing the underlying framework proposed by Bahdanau et al. [14].In an RNN encoderdecoder with attention mechanism model, the RNN encoder reads an input sequence X � (x 1 , x 2 , . . .x T X ) into a set of hidden state vector h � (h 1 , h 2 . . ., h T X ) by an RNN: Also, on each step t, another RNN called decoder receives the word embedding of the previous word y t−1 (while training, this is the previous word of the reference keyphrase; at test time, it is the previous word emitted by the decoder) and has decoder state s t .
Te attention distribution a t is calculated as in [14]: where W h , W s , and b attn are the learnable parameters.Ten, the attention distribution is used to compute a weighted sum of encoder hidden states, known as content vector c t : Next, c t is concatenated with the decoder state s t and fed into linear layers to produce the predictive vocabulary distribution P vacob formulated as follows: where V 1 , V 2 , b 1 , and b 2 are the learnable parameters.P vacob is a probability distribution over all words in the vocabulary and provides us with our fnal distribution from which to predict word y t at step t: Denote all the parameters to be learned in sequence-tosequence attentional model as θ.
Te training object is formulated as follows:

Proposed Approach.
In this subsection, we frst describe the task of keyphrase generation, followed by our model in the following details: (1) Te encoder of the title and abstract (2) Our mutual attention architecture (3) Hybrid decoder is a train pair and we split (X, M) into i pairs: Ten, the model is ready to learn the mapping from source to target. .Finally, we use the bi-long short-term memory network [29] to obtain new presentation

Title and Abstract Encoder.
t�1 of title and abstract, respectively: Computational Intelligence and Neuroscience h H T H and h S T S are the last hidden states produced by the title encoder and abstract encoder, respectively.Ten, they are fed into fully connected layers to calculate the initial hidden state s 0 to start the decoder.
where V 3 , W H , and W S are the learnable parameters.

Mutual Attention.
To determine the relevance between the title and abstract, we adopt multihead attention formulation, as shown in Figure 2  , and the representation u X of document.We call it title-abstract mutual attention.Te multihead attention is defned as follows: where W Q i , W K i , and W V i are the learnable parameters; attention refers to scaled dot product attention.Please see reference [15] for more details.
Figure 3 gives an overview of obtaining document representation.In addition to multihead attention layers, we use a residual block which contains a fully connected feedforward network using residual connection [30] and layer normalization [31]: where W 2 , W 1 , b 1 , and b 2 are the learnable parameters.
(1) Computing Title-Match Abstract Representation.We take h H as queries, and h S as keys and values to compute titlematch abstract representation u H : where u H means that each word of the title has a corresponding weight distribution of abstract and a weighted representation of the abstract.
(2) Computing Abstract-Match Title Representation.It is a bit diferent from computing u H , while we compute u S .At each time step in the recurrent neural network, the old information will change with the current input.For longer sentences, we can imagine that the information stored in the t-k time step (k << T) will undergo a gradual transformation process after the t time step.In the process of backpropagation, the information must fow through a long time step to update the network parameters in order to minimize the loss of the network.Based on the residual block, we use a gate unit to control the fow of main body information to denoise.Te abstract-match title representation u S is computed as follows: where where W G is the learnable parameter, ⊗ is the element-wise product, and σ is the sigmoid function.
Ten, u H and u S are concatenated.
3.2.4.Hybrid Decoder.Our decoder is a hybrid between an attention-based RNN decoder and a pointer network [16,26].Diferent from [16], we use the output of encoder u X and current hidden state s t instead of s t−1 to analyze which 4 Computational Intelligence and Neuroscience word to copy (calculating the attention distribution a t and content vector c t at step t).
Given a fxed vocabulary V � v 1 , v 2 , . . ., v N  , an RNN decoder can only generate words from V. OOV words are marked as "UNK" that RNN decoder is unable to recall any keyphrases that contain "UNK."So, we introduce a copy mechanism based on pointer components called pointergenerator network [16] to sequence-to-sequence attentionbased model; it enables RNN to predict OOV by copying words from source document.In addition, for timestep t, we calculate the generation probability p gen ∈ [0, 1] from the context vector c t , decoder state s t , and decoder input x t : where w c , w s , w x , and b ptr are the learnable parameters and σ is the sigmoid function.Ten, p gen is used as a switch to choose between generating a word y ∈ V from vocabulary by sampling from P vacob or copying a word y ∈ X from the source document by sampling from the attention distribution a t denoted as P copy .For word y in X, its copy probability So, the fnal probability (y) P(y) � p gen P vocab (y) + p copy P copy (y), y ∈ vocab ∩ X, P copy (y), y ∈ UNK ∩ X, P vocab (y), y ∈ vocab and y ∉ X, where p copy � 1 − p gen .

Experiments and Results
Tis section begins by experiment setup.Ten, we report our result.

Experiment Setup.
In this subsection, we frst descript our benchmark datasets, followed by the baselines and evaluation metrics.Finally, we introduce the implementation details of our model.

Baseline
Models.An apparatus and method for layered decoding of a dense memory array using multiple stages of a multihead decoder relates to semiconductor integrated electronics.It contains a memory array and is exactly an array that incorporates array lines with very small spacing.Specifcally, it is an array with a three-dimensional memory array.Te decoder structure can be advantageously used to decode word lines and/or bit lines in many diferent types and confgurations of memory arrays.An intersection array and a NAND string memory array contain passive component memory cells (e.g., antifuse memory cells).It is especially used for memory arrays with more than one memory plane.Te invention is applicable to an integrated circuit with a memory array and a method for operating the integrated circuit and the memory array and suitable for computer-readable media coding of the integrated circuit or memory array.

Computational Intelligence and Neuroscience
(1) TF-IDF: this is an unsupervised algorithm that uses TF-IDF scores to rank candidates and outputs the top N-grams as keyphrases.(2) TextRank: TextRank is a graph-based unsupervised keyword extraction algorithm that utilizes the Pag-eRank [33] algorithm to calculate the importance of words and then ranks them according to the Pag-eRank scores of candidate keyphrases.
(3) SingleRank: it is essentially a TextRank approach with some diferences.(4) ExpandRank: it is a TextRank extension that exploits neighborhood knowledge for keyphrase extraction.
(5) KEA: KEA is a supervised approach.It takes TF-IDF, frst occurrence, length, and node degree as features and then uses the naive Bayesian algorithm to train the model to identify whether the candidate phrase is a keyphrase.It can be either used for free indexing or for indexing with a controlled vocabulary.(6) Maui: it is an improvement of KEA that augments new features, extending the vocabulary by Wikipedia.

Evaluation Metric.
To evaluate the performance of approaches for keywords extraction, we employ F-measure (F 1 ) to measure the models' performance on predicting the 6 Computational Intelligence and Neuroscience present keyphrases and recall to measure the models' performance on predicting absent keyphrases.
where P and R refer to precision and recall.[34]), lowercasing and replacing all digits with symbol "<digit>" are applied.We constructed the vocabulary with the most common 50K words, and out-of-vocabulary words were replaced with a special token "<unk>."Each word was initialized with pretrained GloVe [35] embeddings into a vector space of 200 dimensions; the hidden size is 258 for both encoder LSTM and decoder LSTM.We use loss on validation set to implement early stopping.Training is performed through stochastic gradient descent with the Adam optimizer [36].Te initial learning rate � 10 −1 , and gradient clipping � 2.
We train on a single Tesla K40 GPU with a batch size of 32.At test time, keyphrases are produced using a batch size of 1, beam search with beam size 200, and max depth of 6.

Results
. We report our results of experiment in this subsection.We conduct our model on two tasks: (1) Predicting the present keyphrases (2) Predicting the absent keyphrases 4.2.1.Predicting the Present Keyphrases.We evaluate the performance of our model on predicting the present keyphrases, because of which the traditional extraction models can only extract keyphrases from the source document.Te result is shown in Table 2 including comparisons with our baselines.Te best scores are highlighted in bold.We can see that unsupervised models (TF-IDF, TexRank, SingleRank, and ExpandRank) are more robust than traditional supervised models (Maui and KEA).But, deep neural networks are as robust as unsupervised models on all datasets.Te results demonstrate that our model improves the performance over RNN and CopyRNN on all the benchmark datasets.
Figure 4(a) is an example of the results of RNN, CopyRNN, and MA-net on predicting the present keyphrase.However, since neither RNN nor CopyRNN model the relationship between the title and the abstract, the phrase "information retrieval" which is not the ground truth has the highest rank and the both RNN and CopyRNN generate the phrase "machine learning" that MA-net does not generate.Although "information retrieval" and "machine learning" are related to the topic of this document, but it is too general to be selected as a keyphrase, our model predicts fnergrained keyphrases and gives them a higher ranking.Trough knowledge extraction, we have obtained a large number of entities and relationships, but due to diferent sources, there will be a lot of noise data and duplicate data.Our model MA-net uses multihead attention to model longterm dependencies between titles and abstracts.It highlights the tile representation and the abstract representation associated with the title, reducing the noise.Terefore, the phrase "relevance ranking" contained in the title got higher ranking in MA-net than RNN and CopyRNN.Te phrases that are less relevant to title such as "information retrieval" got lower ranking in MA-net than RNN and CopyRNN.

Predicting Absent Keyphrases.
As stated before, one advantage of neural keyphrase generation (NKG) is that it can predict absent keyphrases based on "understanding" of semantic information.
Only RNN and CopyRNN can handle this task.Terefore, following the previous study [12], we compare the performance of RNN, CopyRNN, and our models in terms of recall of top 10 and top 50 results.For training, we utilize both present and absent keyphrases in training datasets.For evaluating, we use the absent keyphrases in testing datasets.Te results are presented in Table 3.As observed from Table 3, our model outperforms our baselines on all datasets.From Figure 4(b), we observed similar result as predicting the present keyphrase; the ranking of phrase "content based ranking" enters top 10, whereas in CopyRNN, it ranks only 34."Video segmentation" enters top 50, whereas in CopyRNN, it ranks 64.Tis also benefts from the modeling of long-term dependencies between titles and abstracts.

Impact of 3 Components.
To further study the impact of the 3 components, we proposed the sequence-to-sequence attention-based model and we conduct a set of experiments in this subsection to compare the performance of the following model in the tasks mentioned in Section 4: (1) RNN: the sequence-to-sequence attention-based model proposed in [11] (2) CopyRNN: the model proposed in [12], which augments RNN by a copy mechanism proposed in [ Te result of predicting the present keyphrases and absent keyphrases is shown in Tables 4 and 5, respectively.Due to space limitation, here, we report only the average F 1 @ 5/F 1 @10 on predicting the present keyphrase and R@10/R@ 50 on predicting the absent keyphrase.

Comparison of RNN, CopyRNN, and PG.
Here, we compare the two sequence-to-sequence attention-based models, they both have copy mechanism.As we can see, the performance of PG greatly outperformances RNN on  Title: Towards content-based relevance ranking for video search Abstract: Most existing web video earch engines index videos by fle names URLs, and surounding texts.these types of video metadata roughldescribe the whole video in an abstract leve without taking the nch content.such as semantic content descriptions and speech within the videoi is not satisfactory as the details of video contents are ignored.Ininto consideration.Therefore the relevanceranking of the vdeo search resultsthis paper we propose a novel relevance ranking approach for Web-based video search using both video metadata and the rich content containedn the videos.to leverage real content into ranking.the videos are segmented into shots.which are smaler and more semantic-meaningfuetrievable units.and then more detailed information ofvideo content such as semantic descriptions and speech of each shots are used to improvethe retrieval and ranking performance.with video metadata and content information of shots, we developed an integrated ranking approachwhich achieves improved ranking peformance.We also introduce machine learning into the ranking system.and compare them with iR-mode (information retrieval model) based method.the evaluation results demonstrate the effectiveness of the proposed ranking methods.
(a) Present keyphrase RNN:  Te result of predicting the present keyphrase is shown in Table 6.Method II outperforms method I.It is clear from Table 6 that method I achieves higher F1 scores than method II.We ofer one possible explanation for the observation.In method I, features and word embedding are the frst layer of our model; the rest of the network needs to be trained from scratch.In method II, the merging of features and text representation (u X ) directly participates in the calculation of attention, and it is easier to learn which word is important, thereby improving the accuracy of the generation.
We do not report the result of predicting absent keyphrases since both methods have no efect.We believe that, compared with predicting the present keyphrases, predicting the absent keyphrases is at a higher semantic level, requiring a better understanding of content.However, hand-crafted features are not semantic feature.However, the advantage of our model on predicting the absent keyphrase is not so obvious of that on predicting the present keyphrase.We observed from Tables 2 and 3 that our model is (241.2%,133.3%,105.3%,87.9%, and 96.7%) higher on F 1 @5 and F 1 @10 (456.3%, 207.9%, 166.2%, 87.9%, and 47.1%) higher on F 1 @10 on predicting the present keyphrase than RNN (64.5%, 24.2%, 12.0%, 12.2%, and 78.3%) on recall@10 and (77.1%, 32.1%, 33.7%, 40.0%, and 57.6%) on recall@50 on predicting the present keyphrase.So, we concern that how generative is our model?Te value of the generation probability p gen gives a measure of the generativeness of our model.During test, the model is heavily inclined to copy, the mean value of p gen is only 0.13.Tis phenomenon may be the reason that the performance of predicting the present keyphrases is much better than the performance of predicting the absent keyphrases.

Conclusion
In this paper, we present a deep attentional neural network called MA-net for keyphrase generation.We introduce the multihead attention to obtain representation for the title and document and then use the pointer networks to locate the words to copy.Our model achieves state-of-the-art results on KP20k dataset and four other popular datasets.For future work, we will try to design new network structures to improve the performance of predicting absent keyphrases and consider the correlation among keyphrases.

Figure 1
shows the structure of our encoder.A document � w X t all experimental datasets, each document X contains a title and an abstract.Ten, the words are converted to their word-level embeddings e

Figure 1 :
Figure 1: Encoder of the title and abstract.

Table 1 :
Statistics of datasets.
t Figure 3: Hybrid decoder with copy mechanism multihead attention.Total: dataset size of total.Train: size of train set.Validation: size of validation set.Test: size of test set.

Table 2 :
Comparison results of predicting present keyphrases on F 1 score at top 5 and top 10.
Figure 4: An example of result by RNN, CopyRNN, and MA-net.Te ground truth is highlighted in bold.

Table 3 :
Comparison results of predicting the absent keyphrases on recall at top 10 and top 50.
5.2.Efect of Hand-CraftedFeature.We report the efect of hand-crafted features (POS tags and named-entities as well as TF and IDF) in this subsection.For discrete features such as POS tags, we use one-hot representation, for continuous features such as TF and IDF; we convert them into categorical values by discretizing them into a fxed number of bins and use one-hot representations to indicate the bin number they fall into.We try two ways to use features:

Table 4 :
Impact of 3 components on predicting the present keyphrase.

Table 5 :
Impact of 3 components on predicting the absent keyphrase.How Generative Is Our Model?Our model can not only generate words from fxed vocabulary but also copy words from source document.Terefore, our model can be viewed as a balance between extraction and generation.

Table 6 :
Te efect of two feature-rich methods of predicting the present keyphrases.