Deep Learning Based Abstractive Text Summarization: Approaches, Datasets, Evaluation Measures, and Challenges

ive sentence summarization with attentive recurrent neural networks (RAS) Quasi-recurrent neural network (QRNN) + CNN Generating news headlines with recurrent neural networks Abstractive text summarization using attentive sequence-to-sequence RNNsive text summarization using attentive sequence-to-sequence RNNs Selective encoding for abstractive sentence summarization (SEASS) Faithful to the original: fact aware neural abstractive summarization (FTSumg) Improving transformer with sequential context representations for abstractive text summarization (RCT) Multisentence summary Get to the point: summarization with pointer-generator networks Reinforcement learning (RL) Generative adversarial network for abstractive text summarization Exploring semantic phrases (ATSDL) + CNN Bidirectional attentional encoder-decoder model and bidirectional beam search Key information guide network Improving abstraction in text summarization Dual encoding for abstractive text summarization (DEATS) Bidirectional decoder (BiSum) Text summarization with pretrained encoders A text abstraction summary model based on BERT word embedding and reinforcement learning Transformer-based model for single-document neural summarization Text summarization method based on double attention pointer network (DAPT) Figure 4: Taxonomy of several approaches that use a recurrent neural network and attention mechanism in abstractive text summarisation based on the summary type. Mathematical Problems in Engineering 7


Introduction
Currently, there are vast quantities of textual data available, including online documents, articles, news, and reviews that contain long strings of text that need to be summarised [1]. e importance of text summarisation is due to several reasons, including the retrieval of significant information from a long text within a short period, easy and rapid loading of the most important information, and resolution of the problems associated with the criteria needed for summary evaluation [2]. Due to the evolution and growth of automatic text summarisation methods, which have provided significant results in many languages, these methods need to be reviewed and summarised. erefore, in this review, we surveyed the most recent methods and focused on the techniques, datasets, evaluation measures, and challenges of each approach, in addition to the manner in which each method addressed challenges.
Applications such as search engines and news websites use text summarisation [1]. In search engines, previews are produced as snippets, and news websites generate headlines to describe the news to facilitate knowledge retrieval [3,4]. Text summarisation can be divided into several categories based on function, genre, summary context, type of summarizer, and number of documents [5]; one specific text summarisation classification approach divides the summarisation process into extractive and abstractive categories [6].
Extractive summarisation extracts or copies some parts from the original text based on scores computed using either statistical features or linguistic features, while abstractive summarisation rephrases the original text to generate new phrases that may not be in the original text, which is considered a difficult task for a computer. As abstractive text summarisation requires an understanding of the document to generate the summary, advanced machine learning techniques and extensive natural language processing (NLP) are required. us, abstractive summarisation is harder than extractive summarisation since abstractive summarisation requires real-word knowledge and semantic class analysis [7]. However, abstractive summarisation is also better than extractive summarisation since the summary is an approximate representation of a human-generated summary, which makes it more meaningful [8]. For both types, acceptable summarisation should have the following: sentences that maintain the order of the main ideas and concepts presented in the original text, minimal to no repetition, sentences that are consistent and coherent, and the ability to remember the meaning of the text, even for long sentences [7]. In addition, the generated summary must be compact while conveying important information about the original text [2,9].
Abstractive text summarisation approaches include structured and semantic-based approaches. Structured approaches encode the crucial features of documents using several types of schemas, including tree, ontology, lead and body phrases, and template and rule-based schemas, while semantic-based approaches are more concerned with the semantics of the text and thus rely on the information representation of the document to summarise the text. Semantic-based approaches include the multimodal semantic method, information item method, and semantic graph-based method [10][11][12][13][14][15][16][17].
Deep learning techniques were employed in abstractive text summarisation for the first time in 2015 [18], and the proposed model was based on the encoder-decoder architecture. For these applications, deep learning techniques have provided excellent results and have been extensively employed in recent years.
Raphal et al. surveyed several abstractive text summarisation processes in general [19]. eir study differentiated between different model architectures, such as reinforcement learning (RL), supervised learning, and attention mechanism. In addition, comparisons in terms of word embedding, data processing, training, and validation had been performed. However, there are no comparisons of the quality of several models that generated summaries.
Furthermore, both extractive and abstractive summarisation models were summarised in [20,21]. In [20], the classification of summarisation tasks was based on three factors: input factors, purpose factors, and output factors. Dong and Mahajani et al. surveyed only five abstractive summarisation models each. On the other hand, Mahajani et al. focused on the datasets and training techniques in addition to the architecture of several abstractive summarisation models [21]. However, the quality of the generated summary of the different techniques and the evaluation measures were not discussed.
Shi et al. presented a comprehensive survey of several abstractive text summarisation models, which are based on sequence-to-sequence encoder-decoder architecture for convolutional and RNN seq2seq models. e focus was the structure of the network, training strategy, and the algorithms employed to generate the summary [22]. Although several papers have analysed abstractive summarisation models, few papers have performed a comprehensive study [23]. Moreover, most of the previous surveys covered the techniques until 2018, even though surveys were published in 2019 and 2020, such as [20,21]. In this review, we addressed most of the recent deep learning-based RNN abstractive text summarisation models. Furthermore, this survey is the first to address recent techniques applied in abstractive summarisation, such as Transformer.
is paper provides an overview of the approaches, datasets, evaluation measures, and challenges of deep learningbased abstractive text summarisation, and each topic is discussed and analysed. We classified the approaches based on the output type into: single-sentence summary and multisentence summary approaches. Also, within each classification, we compared between the approaches in terms of architecture, dataset, dataset preprocessing, evaluation, and results. e remainder of this paper is organised as follows: Section 2 introduces a background of several deep learning models and techniques, such as the recurrent neural network (RNN), bidirectional RNN, attention mechanisms, long short-term memory (LSTM), gated recurrent unit (GRU), and sequenceto-sequence models. Section 3 describes the most recent singlesentence summarisation approaches, while the multisentence summarisation approaches are covered in Section 4. Section 5 and Section 6 investigate datasets and evaluation measures, respectively. Section 7 discusses the challenges of the summarisation process and solutions to these challenges. Conclusions and discussion are provided in Section 8.

Background
Deep learning analyses complex problems to facilitate the decision-making process. Deep learning attempts to imitate what the human brain can achieve by extracting features at different levels of abstraction. Typically, higher-level layers have fewer details than lower-level layers [24]. e output layer will produce an output by nonlinearly transforming the input from the input layer. e hierarchical structure of deep learning can support learning. e level of abstraction of a certain layer will determine the level of abstraction of the next layer since the output of one layer will be the input of the next layer. In addition, the number of layers determines the deepness, which affects the level of learning [25].
Deep learning is applied in several NLP tasks since it facilitates the learning of multilevel hierarchal representations of data using several data processing layers of nonlinear units [24,[26][27][28]. Various deep learning models have been employed for abstractive summarisation, including RNNs, convolutional neural networks (CNNs), and sequence-tosequence models. We will cover deep learning models in more detail in this section.

RNN Encoder-Decoder
Summarization. RNN encoderdecoder architecture is based on the sequence-to-sequence model. e sequence-to-sequence model maps the input sequence in the neural network to a similar sequence that consists of characters, words, or phrases.
is model is utilised in several NLP applications, such as machine translation and text summarisation. In text summarisation, the input sequence is the document that needs to be summarised, and the output is the summary [29,30], as shown in Figure 1. An RNN is a deep learning model that is applied to process data in sequential order such that the input of a certain state depends on the output of the previous state [31,32]. For example, in a sentence, the meaning of a word is closely related to the meaning of the previous words. An RNN consists of a set of hidden states that are learned by the neural network. An RNN may consist of several layers of hidden states, where states and layers learn different features. e last state of each layer represents the whole inputs of the layer since it accumulates the values of all previous states [5]. For example, the first layer and its state can be employed for part-of-speech tagging, while the second layer learns to create phrases. In text summarisation, the input for the RNN is the embedding of words, phrases, or sentences, and the output is the word embedding of the summary [5].
In the RNN encoder-decoder model, at the encoder side, at certain hidden states, the vector representation of the current input word and the output of the hidden states of all previous words are combined and fed to the next hidden state. As shown in Figure 1, the vector representation of the word W3 and the output of the hidden states he1 and he2 are combined and fed as input to the hidden states he3. After feeding all the words of the input string, the output generated from the last hidden state of the encoder are fed to the decoder as a vector referred to as the context vector [29]. In addition to the context vector, which is fed to the first hidden state of the decoder, the start-of-sequence symbol 〈SOS〉 is fed to generate the first word of the summary from the headline (assume W5, as shown in Figure 1). In this case, W5 is fed as the input to the next decoder hidden state. Each generated word is passed as an input to the next decoder hidden state to generate the next word of the summary. e last generated word is the end-of-sequence symbol 〈EOS〉. Before generating the summary, each output from the decoder will take the form of a distributed representation before it is sent to the softmax layer and attention mechanism to generate the next summary [29].

Bidirectional RNN. Bidirectional RNN consists of forward
RNNs and backward RNNs. Forward RNNs generate a sequence of hidden states after reading the input sequence from left to right. On the other hand, the backward RNNs generate a sequence of hidden states after reading the input sequence from right to left. e representation of the input sequence is the concatenation of the forward and backward RNNs [33]. erefore, the representation of each word depends on the representation of the preceding (past) and following (future) words. In this case, the context will contain the words to the left and the words to the right of the current word [34].
Using bidirectional RNN enhances the performance. For example, if we have the following input text "Sara ate a delicious pizza at dinner tonight," in this case, assume that we want to predict the representation of the word "dinner," using bidirectional RNN and the forward LSTMs represent "Sara ate a delicious pizza at" while the backward LSTM represents "tonight." Considering the word "tonight" when representing the word "dinner" provides better results.
On the other hand, using the bidirectional RNN at the decoder size minimizes the probability of the wrong prediction. e reason for this is that the unidirectional RNN only considers the previous prediction and reason only about the past. erefore, if there is an error in previous prediction, the error will accumulate in all subsequent predictions, and this problem can be addressed using the bidirectional RNN [35].

Gated Recurrent Neural Networks (LSTM and GRU).
Gated RNNs are employed to solve the problem of vanishing gradients, which occurs when training a long sequence using an RNN. is problem can be solved by allowing the gradients to backpropagate along a linear path using gates, where each gate has a weight and a bias. Gates can control and modify the amount of information that flows between hidden states. During training, the weights and biases of the gates are updated. e most popular gated RNNs are LSTM [36] and GRU [37], which are two variants of an RNN.

Long Short-Term Memory (LSTM).
e repeating unit of the LSTM architecture consists of input/read, memory/ update, forget, and output gates [5,7], but the chaining structure is the same as that of an RNN. e four gates share information with each other; thus, information can flow in loops for a long period of time. e four gates of each LSTM unit, which are shown in Figures 2 and 3, are discussed here.
(1) Input Gate. In the first timestep, the input is a vector that is initialised randomly, while in subsequent steps, the input of the current step is the output (content of the memory cell) of the previous step. In all cases, the input is subject to element-wise multiplication with the output of the forget gate. e multiplication result is added to the current memory gate output.
(2) Forget Gate. A forget gate is a neural network with one layer and a sigmoid activation function. e value of the sigmoid function will determine if the information of the previous state should be forgotten or remembered. If the sigmoid value is 1, then the previous state will be remembered, but if the sigmoid value is 0, then the previous state will be forgotten. In language modelling, for example, the forget gate remembers the gender of the subject to produce the proper pronouns until it finds a new subject. ere are four inputs for the forget gate: the output of the previous block, the input vector, the remembered information from the previous block, and the bias.
(3) Memory Gate. e memory gate controls the effect of the remembered information on the new information. e memory gate consists of two neural networks. e first network has the same structure as the forget gate but a different bias, and the second neural network has a tanh activation function and is utilised to generate the new Mathematical Problems in Engineering Context vector Encoder Decoder Figure 1: Sequence-to-sequence; the last hidden state of the encoder is fed as input to the decoder with the symbol EOS [51]. information. e new information is formed by adding the old information to the result of the element-wise multiplication of the output of the two memory gate neural networks.
(4) Output Gate. e output gates control the amount of new information that is forwarded to the next LSTM unit. e output gate is a neural network with a sigmoid activation function that considers the input vector, the previous hidden state, the new information, and the bias as input. e output of the sigmoid function is multiplied by the tanh of the new information to produce the output of the current block.

Gated Recurrent Unit (GRU).
A GRU is a simplified LSTM with two gates, a reset gate and an update gate, and there is no explicit memory. e previous hidden state information is forgotten when all the reset gate elements approach zero; then, only the input vector affects the candidate hidden state. In this case, the update gate acts as a forget gate. LSTM and GRU are commonly employed for abstractive summarisation since LSTM has a memory unit that provides extra control; however, the computation time of the GRU is reduced [38]. In addition, while it is easier to tune the parameters with LSTM, the GRU takes less time to train [30].

Attention
Mechanism. e attention mechanism was employed for neural machine translation [33] before being utilised for NLP tasks such as text summarisation [18]. A basic encoder-decoder architecture may fail when given long sentences since the size of encoding is fixed for the input string; thus, it cannot consider all the elements of a long input. To remember the input that has a significant impact on the summary, the attention mechanism was introduced [29]. e attention mechanism is employed at each output word to calculate the weight between the output word and every input word; the weights add to one. e advantage of using weights is to show which input word must receive attention with respect to the output word. e weighted average of the last hidden layers of the decoder in the current step is calculated after passing each input word and fed to the softmax layer along the last hidden layers [39].

Beam Search.
Beam search and greedy search are very similar; however, while greedy search considers only the best hypothesis, beam search considers b hypotheses, where b represents the beam width or beam size [5]. In text summarisation tasks, the decoder utilises the final encoder representation to generate the summary from the target vocabulary. In each step, the output of the decoder is a probability distribution over the target word. us, to obtain the output word from the learned probability, several methods can be applied, including (1) greedy sampling, which selects the distribution mode, (2) 1-best or beam search, which selects the best output, and (3) n-best or beam search, which select several outputs. When n-best beam search is employed, the top b most relevant target words are selected from the distribution and fed to the next decoder state. e decoder keeps only the top k1 of k words from the different inputs and discards the rest.

Distributed Representation (Word Embedding).
A word embedding is a word distributional vector representation that represents the syntax and semantic features of words [40]. Words must be converted to vectors to handle various NLP challenges such that the semantic similarity between words can be calculated using cosine similarity, Euclidean distance, etc. [41][42][43]. In NLP tasks, the word embeddings of the words are fed as inputs to neural network models. In the recurrent neural network encoder-decoder architecture, which is employed to generate the summaries, the input of the model is the word embedding of the text, and the output is the word embedding of the summary.
In NLP, there are several word embedding models, such as Word2Vec, GloVe, FastText, and Bidirectional Encoder Representations from Transformers (BERT), which are the most recently employed word embedding models [41,[44][45][46][47]]. e Word2Vec model consists of two approaches, skip-gram and continuous bag-of-words (CBOW), which both depend on the context window [41]. On the other hand, GloVe represents the Mathematical Problems in Engineering global vector, which is based on statistics of the global corpus instead of the context window [44]. FastText extends the skipgram of the Word2Vec model by using the subword internal information to address the out-of-vocabulary (OOV) terms [46]. In FastText, the subword components are composed to build the vector representation of the words, which facilitates representation of the word morphology and lexical similarity. e BERT word embedding model is based on a multilayer bidirectional transformer encoder [47,48]. Instead of using sequential recurrence, the transformer neural network utilises parallel attention layers. BERTcreates a single large transformer by combining the representations of the words and sentences. Furthermore, BERT is pretrained with an unsupervised objective over a large amount of text.

Transformers.
e contextual representations of language are learned from large corpora. One of the new language representations, which extend word embedding models, is referred to as BERT mentioned in the previous section [48]. In BERT, two tokens are inserted to the text. e first token (CLS) is employed to aggregate the whole text sequence information. e second token is (SEP); this token is inserted at the end of each sentence to represent it. e resultant text consists of tokens, where each token is assigned three types of embeddings: token, segmentation, and position embeddings. Token embedding is applied to indicate the meaning of a token. Segmentation embedding identifies the sentences, and position embedding determines the position of the token. e sum of the three embeddings is fed to the bidirectional transformer as a single vector. Pretrained word embedding vectors are more precise and rich with semantic features. BERT has the advantage of fine-tuning (based on the objectives of certain tasks) and feature-based methods. Moreover, transformers compute the presentation of the input and output by using self-attention, where the self-attention enables the learning of the relevance between the "word-pair" [47].

Single-Sentence Summary
Recently, the RNN has been employed for abstractive text summarisation and has provided significant results. erefore, we focus on abstractive text summarisation based on deep learning techniques, especially the RNN [49]. We discussed the approaches that have applied deep learning for abstractive text summarisation since 2015. RNN with an attention mechanism was mostly utilised for abstractive text summarisation. We classified the research according to summary type (i.e., single-sentence or multisentence summary), as shown in Figure 4. We also compared the approaches in terms of encoder-decoder architecture, word embedding, dataset and dataset preprocessing, and evaluations and results.

Abstractive Summarization Architecture
3.1.1. Feedforward Architecture. Neural networks were first employed for abstractive text summarisation by Rush et al. in 2015, where a local attention-based model was utilised to generate summary words by conditioning it to input sentences [18]. ree types of encoders were applied: the bag-ofwords encoder, the convolution encoder, and the attentionbased encoder. e bag-of-words model of the embedded input was used to distinguish between stop words and content words; however, this model had a limited ability to represent continuous phrases. us, a model that utilised the deep convolutional encoder was employed to allow the words to interact locally without the need for context. e convolutional encoder model can alternate between temporal convolution and max-pooling layers using the standard time-delay neural network (TDNN) architecture; however, it is limited to a single output representation. e limitation of the convolutional encoder model was overcome by the attention-based encoder. e attention-based encoder was utilised to exploit the learned soft alignment to weight the input based on the context to construct a representation of the output. Furthermore, the beam-search decoder was applied to limit the number of hypotheses in the summary.

RNN Encoder-Decoder Architecture
(1) LSTM-RNN. An abstractive sentence summarisation model that employed a conditional recurrent neural network (RNN) to generate the summary from the input is referred to as a recurrent attentive summariser (RAS) [39]. A RAS is an extension of the work in [18]. In [18], the model employed a feedforward neural network, while the RAS employed an RNN-LSTM. e encoder and decoder in both models were trained using sentence-summary pair datasets, but the decoder of the RAS improved the performance since it considered the position information of the input words. Furthermore, previous words and input sentences were employed to produce the next word in the summary during the training phase.
Lopyrev [29] proposed a simplified attention mechanism that was utilised in an encoder-decoder RNN to generate headlines for news articles. e news article was fed into the encoder one word at a time and then passed through the embedding layer to generate the word representation. e experiments were conducted using simple and complex attention mechanisms. In the simple attention mechanism, the last layer after processing the input in the encoding was divided into two parts: one part for calculating the attention weight vector, and one part for calculating the context vector, as shown in Figure 5(a). However, in the complex attention mechanism, the last layer was employed to calculate the attention weight vector and context vector without fragmentation, as shown in Figure 5(b). In both figures, the solid lines indicate the part of the hidden state of the last layer that is employed to compute the context vector, while the dashed lines indicate the part of the hidden state of the last layer that is applied to compute the attention weight vector. e same difference was seen on the decoder side: in the simple attention mechanism, the last layer was divided into two parts (one part was passed to the softmax layer, and the other part was applied to calculate the attention weight), while in the complex attention mechanism, no such division was made. A beam search at the decoder side was performed during testing to extend the sequence of the probability.  e encoder-decoder RNN and sequence-to-sequence models were utilised in [55], which mapped the inputs to the target sequences; the same approach was also employed in [38,51]. ree different methods for global attention were proposed for calculating the scoring functions, including dot product scoring, the bilinear form, and the scalar value calculated from the projection of the hidden states of the RNN encoder [38]. e model applied LSTM cells instead of GRU cells (both LSTM and GRU are commonly employed for abstractive summarisation tasks since LSTM has a memory unit that provides control but the computation time of GRU is lower). ree models were employed: the first model applied unidirectional LSTM in both the encoder and the decoder; the second model was implemented using bidirectional LSTM in the encoder and unidirectional LSTM in the decoder; and the third model utilised a bidirectional LSTM encoder and an LSTM decoder with global attention. e first hidden state of the decoder is the concatenation of all backward and forward hidden states of the encoder. e use of attention in an encoder-decoder neural network generates a context vector at each timestep. For the local attention mechanism, the context vector is conditioned on a subset of the encoder's hidden states, while for the global attention mechanism, the vector is conditioned on all the encoder's hidden states. After generating the first decoder output, the next decoder input is the word embedding of the output of the previous decoder step. e affine transformation is used to convert the output of the decoder LSTM to a dense vector prediction due to the long training time needed before the number of hidden states is the same as the number of words in the vocabulary.
Khandelwal [51] employed a sequence-to-sequence model that consists of an LSTM encoder and LSTM decoder for abstractive summarisation of small datasets. e decoder generated the output summary after reading the hidden representations generated by the encoder and passing them to the softmax layer. e sequence-to-sequence model does not memorize information, so generalization of the model is not possible. us, the proposed model utilised imitation learning to determine whether to choose the golden token (i.e., reference summary token) or the previously generated output at each step.
(2) GRU-RNN. A combination of the elements of the RNN and convolutional neural network (CNN) was employed in an encoder-decoder model that is referred to as a quasi-recurrent neural network (QRNN) [50]. In the QRNN, the GRU was utilised in addition to the attention mechanism. e QRNN was applied to address the limitation of parallelisation, which aimed to obtain the dependencies of the words in previous steps via convolution and "fo-pooling," which were performed in parallel, as shown in Figure 6. e convolution in the QRNN can be either mass convolution (considering previous timesteps only) or centre convolution (considering future timesteps).
e encoder-decoder model employed two neural networks: the first network applied the centre convolution of QRNN and consisted of multiple hidden layers that were fed by the vector representation of the words, and the second network comprised neural attention and considered as input the encoder hidden layers to generate one word of a headline. e decoder accepted the previously generated headline word and produced the next word of the headline; this process continued until the headline was completed.
SEASS is an extension of the sequence-to-sequence recurrent neural network that was proposed in [52]. e selective encoding for the abstractive sentence summarisation (SEASS) approach includes a selective encoding model that consists of an encoder for sentences, a selective gate network, and a decoder with an attention mechanism, as shown in Figure 7. e encoder uses a bidirectional GRU, while the decoder uses a unidirectional GRU with an attention mechanism. e encoder reads the input words and their representations. e meaning of the sentences is applied by the selective gate to choose the word representations for generating the word representations of the sentence. To produce an excellent summary and accelerate the decoding process, a beam search was selected as the decoder.
On the other hand, dual attention was applied in [53]. e proposed dual attention approach consists of three modules: two bidirectional GRU encoders and one dual attention decoder. e decoder has a gate network for context selection, as shown in Figure 8, and employs copying and coverage mechanisms. e outputs of the encoders are two context vectors: one context vector for sentences and one context vector for the relation, where the relation may be   Figure 6: Comparison of the CNN, LSTM, and QRNN models [50].

Sentence encoder
Rel4 Rel5 Attention Relation encoder  (3) Others. e long-sequence poor semantic representation of abstractive text summarisation approaches, which are based on an RNN encoder-decoder framework, was addressed using the RC-Transformer (RCT) [54]. An RCT is an RNN-based abstractive text summarisation model that is composed of two encoders (RC encoder and transformer encoder) and one decoder. e transformer shows an advantage in parallel computing in addition to retrieving the global context semantic relationships. On the other hand, sequential context representation was achieved by a second encoder of the RCT-Transformer. Word ordering is very crucial for abstractive text summarisation, which cannot be obtained by positioning encoding. erefore, an RCT utilised two encoders to address the problem of a shortage of sequential information at the word level. A beam search was utilised at the decoder. Furthermore, Cai et al. compared the speed of the RCT model and that of the RNN-based model and concluded that the RCT is 1.4x and 1.2x faster.

Word
Embedding. In the QRNN model, GloVe word embedding, which was pretrained using the Wikipedia and Gigaword datasets, was performed to represent the text and summary [50]. In the first model, the proposed model by Jobson et al., the word embedding, randomly initialised and updated during training, while GloVe word embedding was employed to represent the words in the second and third models [38]. In a study by Cai et al., Transformer was utilised [54].

Dataset and Dataset
Preprocessing. In the model that was proposed by Rush et al., datasets were preprocessed via PTB tokenization by using "#" to replace all digits, conversion of all letters to lowercase letters, and the use of "UNK" to replace words that occurred fewer than 5 times [18]. e model was trained with any input-output pairs due to the shortage of constraints for generating the output. e training process was carried out on the Gigaword datasets, while the summarisation evaluation was conducted on DUC2003 and DUC2004 [18]. Furthermore, the proposed model by Chopra et al. was trained using the Gigaword corpus with sentence separation and tokenisation [39]. To form sentence-summary pairs, each headline of the article was paired with the first sentence of the article. e same preprocessing steps of the data in [18] were performed in [39]. Moreover, the Chopra et al. model was evaluated using the DUC2004 dataset, which consists of 500 pairs.
Gigaword datasets were also employed by the QRNN model [50]. Furthermore, articles that started with sentences that contained more than 50 words or headlines with more than 25 words were removed. Moreover, the words in the articles and their headlines were converted to lowercase words, and the data points were split into short, medium, and long sentences, based on the lengths of the sentences, to avoid extra padding.
Lopyrev and Jobson et al. trained the model using Gigaword after processing the data. In the Lopyrev model, the most crucial preprocessing steps for both the text and the headline were tokenisation and character conversion to lowercase [29]. In addition, only the characters of the first paragraph were retained, and the length of the headline was fixed between 25 and 50 words. Moreover, the no-headline articles were disregarded, and the 〈unk〉 symbol was used to replace rare words.
Khandelwal employed the Association for Computational Linguistics (ACL) Anthology Reference Corpus, which consists of 16,845 examples for training and 500 examples for testing, and they were considered small datasets, in experiments [51]. e abstract included the first three sentences, and the unigram that overlaps between the title and the abstract was also calculated.
ere were 25 tokens in the summary, and there were a maximum of 250 tokens in the input text.
e English Gigaword dataset, DUC2004 corpus, and MSR-ATC were selected to train and test the SEASS model [52]. Moreover, the experiments of the Cao et al. model were conducted using the Gigaword dataset [53]. e same preprocessing steps of the data in [18] were performed in [52,53]. Moreover, RCT also employed the Gigaword and DUC2004 datasets in experiments [54].
e pointer-generator [55] includes singlesentence and multisentence summaries. Additional details are presented in the following sections.

LSTM RN.
A novel abstractive summarisation method was proposed in [56]; it generated a multisentence summary and addressed sentence repetition and inaccurate information. See et al. proposed a model that consists of a singlelayer bidirectional LSTM encoder, a single-layer unidirectional LSTM decoder, and the sequence-to-sequence attention model proposed by [55]. e See et al. model generates a long text summary instead of headlines, which consists of one or two sentences. Moreover, the attention mechanism was employed, and the attention distribution facilitated the production of the next word in the summary by telling the decoder where to search in the source words, as shown in Figure 9.
is mechanism constructed the weighted sum of the hidden state of the encoder that facilitated the generation of the context vector, where the context vector is the fixed size representation of the input. e probability (Pvocab) produced by the decoder was employed to generate the final prediction using the context vector and the decoder's last step. Furthermore, the value of Pvocab was equal to zero for OOV words. RL was employed for abstractive text summarisation in [57]. e proposed method in [57], which combined RL with supervised word prediction, was composed of a bidirectional LSTM-RNN encoder and a single LSTM decoder.
Two models-generative and discriminative models-were trained simultaneously to generate abstractive summary text using the adversarial process [58]. e maximum likelihood estimation (MLE) objective function employed in previous sequence-sequence models suffers from two problems: the difference between the training loss and the evaluation metric, and the unavailability of a golden token at testing time, which causes errors to accumulate during testing. To address the previous problems, the proposed approach exploited the adversarial framework. In the first step of the adversarial framework, reinforcement learning was employed to optimize the generator, which generates the summary from the original text. In the second step, the discriminator, which acts as a binary classifier, classified the summary as either a ground-truth summary or a machine-generated summary. e bidirectional LSTM encoder and attention mechanism were employed, as shown in [56].
Abstract text summarisation using the LSTM-CNN model based on exploring semantic phrases (ATSDL) was proposed in [30]. ATSDL is composed of two phases: the first phase extracts the phrases from the sentences, while the second phase learns the collocation of the extracted phrases using the LSTM model. To generate sentences that are general and natural, the input and output of the ATSDL model were phrases instead of words, and the phrases were divided into three main types, i.e., subject, relation, and object phrases, where the relation phrase represents the relation between the input phrase and the output phrase. e phrase was represented using a CNN layer. ere are two main reasons for choosing the CNN: first, the CNN was efficient for sentence-level applications, and second, training was efficient since long-term dependency was unnecessary. Furthermore, to obtain several vectors for a phrase, multiple kernels with different widths that represent the dimensionality of the features were utilised. Within each kernel, the maximum feature was selected for each row in the kernel via maximum pooling.
e resulting values were added to obtain the final value for each word in a phrase. Bidirectional LSTM was employed instead of a GRU on the encoder side since parameters are easy to tune with LSTM. Moreover, the decoder was divided into two modes: a generate mode and a copy mode. e generate mode generated the next phrase in the summary based on previously generated phrases and the hidden layers of the input on the encoder side, while the copy mode copied the phrase after the current input phrase if the current generated phrase was not suitable for the previously generated phrases in the summary. Figure 10 provides additional details.
Bidirectional encoder and decoder LSTM-RNNs were employed to generate abstractive multisentence summaries [35]. e proposed approach considered past and future context on the decoder side when making a prediction as it employed a bidirectional RNN. Using a bidirectional RNN on the decoder side addressed the problem of summary imbalance. An unbalanced summary could occur due to noise in a previous prediction, which will reduce the quality of all subsequent summaries. e bidirectional decoder consists of two LSTMs: the forward decoder and the backward decoder. e forward decoder decodes the information from left to right, while the backward decoder decodes the information from right to left. e last hidden state of the forward decoder is fed as the initial input to the backward decoder, and vice versa. Moreover, the researcher proposed a bidirectional beam-search method that generates summaries from the proposed bidirectional model. Bidirectional beam search combined information from the past and future to produce a better summary. erefore, the output summary was balanced by considering both past and future information and by using a bidirectional attention mechanism. In addition, the input sequence was read in reverse order based on the conclusion that LSTM learns better when reading the source in reverse order while remembering the order of the target [66,67]. A softmax layer was employed on the decoder side to obtain the probability of each target word in the summary over the vocabulary distribution by taking the output of the decoder as input for the softmax layer. e decoder output depends on the internal representation of the encoder, i.e., the context vector, the current hidden state of the decoder, and the summary words previously generated by the decoder hidden states. e objective of training is to maximise the probability of the alignment between the sentence and the summary from both directions. During training, the input of the forward decoder is the previous reference summary token. However, during testing, the input of the forward decoder is the token generated in the previous step. e same situation is true for the backward decoder, where the input during training is the future token from the summary. Nevertheless, the bidirectional decoder has difficulty during testing since the complete summary must be known in advance; thus, the full backward decoder was generated and fed to the forward decoder using a unidirectional backward beam search.
A combination of abstractive and extractive methods was employed in the guiding generation model proposed by [59]. e extractive method generates keywords that are encoded by a key information guide network (KIGN) to represent key information. Furthermore, to predict the final summary of the long-term value, the proposed method applied a prediction guide mechanism [68]. A prediction
Decoder hidden states Vocabulary distribution Figure 9: Baseline sequence-to-sequence model with attention mechanism [56]. guide mechanism is a feedforward single-layer neural network that predicts the key information of the final summary during testing. e encoder-decoder architecture baseline of the proposed model is similar to that proposed by Nallapati et al. [55], where both the bidirectional LSTM encoder and the unidirectional LSTM decoder were employed. Both models applied the attention mechanism and softmax layer. Moreover, the process of generating the summary was improved by proposing KIGN, which considers as input the keywords extracted using the TextRank algorithm. In KIGN, key information is represented by concatenating the last forward hidden state and first backward hidden state. KIGN employs the attention mechanism and pointer mechanism.
In general, the attention mechanism hardly identifies the keywords; thus, to identify keywords, the output of KIGN will be fed to the attention mechanism. As a result, the attention mechanism will be highly affected by the keywords. However, to enable the pointer network to identify the keywords, which are the output of KIGN, the encoder context vector and hidden state of the decoder will be fed to the pointer network, and the output will be employed to calculate the soft switch. e soft switch determines whether to copy the target from the original text or generate it from the vocabulary of the target, as shown in Figure 11. e level of abstraction in the generated summary of the abstractive summarisation models was enhanced via the two techniques proposed in [60]: decoder decomposition and the use of a novel metric for optimising the overlap between the n-gram summary and the ground-truth summary. e decoder was decomposed into a contextual network and pretrained language model, as shown in Figure 12. e contextual network applies the source document to extract the relevant parts, and the pretrained language model is generated via prior knowledge. is decomposition method facilitates the addition of an external pretrained language model that is related to several domains. Furthermore, a novel metric was employed to generate an abstractive summary by including words that are not in the source document. Bidirectional LSTM was utilised in the encoder, and the decoder applied 3-layer unidirectional weightdropped LSTM. In addition, the decoder utilised a temporal attention mechanism, which applied the intra-attention mechanism to consider previous hidden states. Furthermore, a pointer network was introduced to alternate between copying the output from the source document and selecting it from the vocabulary. As a result, the objective function combined between force learning and maximum likelihood.
A bidirectional decoder with a sequence-to-sequence architecture, which is referred to as BiSum, was employed to minimise error accumulation during testing [62]. Errors accumulate during testing as the input of the decoder is the previously generated summary word, and if one of the generated word summaries is incorrect, then the error will propagate through all subsequent summary words. In the bidirectional decoder, there are two decoders: a forward decoder and a backward decoder. e forward decoder generates the summary from left to right, while the backward decoder generates the summary from right to left. e forward decoder considers a reference from the backward decoder. However, there is only a single-layer encoder. e encoder and decoder employ an LSTM unit, but while the encoder utilises bidirectional LSTM, the decoders use unidirectional LSTM, as shown in Figure 13. To understand the summary generated by the backward decoder, the attention mechanism is applied in both the backward decoder and the encoder. Moreover, to address the problem of out-of-vocabulary words, an attention mechanism is employed in both decoders. A double attention pointer network, which is referred to as (DAPT), was applied to generate an abstractive text summarisation model [49]. e encoder utilised bidirectional LSTM, while the decoder utilised unidirectional LSTM. e encoder key features were extracted using a selfattention mechanism. At the decoder, the beam search was employed. Moreover, more coherent and accurate summaries were generated. e repetition problem was addressed using an improved coverage mechanism with a truncation parameter. e model was optimised by generating a training model that is based on RL and scheduled sampling.

GRU-RNN.
Dual encoding using a sequence-to-sequence RNN was proposed as the DEATS method [61]. e dual encoder consists of two levels of encoders, i.e., primary and secondary encoders, in addition to one decoder, and all of them employ a GRU. e primary encoder considers coarse encoding, while the secondary encoder considers fine encoding. e primary encoder and decoder are the same as the standard encoder-decoder model with an attention mechanism, and the secondary encoder generates a new context vector that is based on previous output and input. Moreover, an additional context vector provides meaningful information for the output. us, the repetition problem of the generated summary that was encountered in previous approaches is addressed. e semantic vector is generated on both levels of encoding: in the primary encoder, the semantic vector is generated for each input, while in the secondary encoder, the semantic vector is recalculated after the importance of each input word is calculated. e fixed-length output is partially generated at each stage in the decoder since it decodes in stages. Figure 14 elaborates the DEATS process. e primary encoder produces a hidden state h p j for each input j and content representation c p . Next, the decoder decodes a fixedlength output, which is referred to as the decoder content representation c d . e weight αj can be calculated using the hidden states h p j and the content representations c p and c d . In this stage, the secondary encoder generates new hidden states or semantic context vectors h s m , which are fed to the decoder. Moreover, DEATS uses several advanced techniques, including a pointer-generator, copy mechanism, and coverage mechanism.
Wang et al. proposed a hybrid extractive-abstractive text summarisation model, which is based on combining the reinforcement learning with BERT word embedding [63]. In this hybrid model, a BERTfeature-based strategy was used to generate contextualised token embedding.
is model consists of two submodels: abstractive agents and extractive agents, which are bridged using RL. Important sentences are extracted using the extraction model and rewritten using the abstraction model. A pointer-generator network was utilised to copy some parts of the original text, where the sentencelevel and word-level attentions are combined. In addition, a beam search was performed at the decoder. In abstractive and extractive models, the encoder consists of a bidirectional GRU, while the decoder consists of a unidirectional GRU. e training process consists of pretraining and full training phases.
Egonmwan et al. proposed to use sequence-to-sequence and transformer models to generate abstractive summaries [64]. e proposed summarisation model consists of two modules: an extractive model and an abstractive model. e encoder transformer has the same architecture shown in [48]; however, instead of receiving the document representation as input, it receives sentence-level representation. e architecture of the abstractive model consists of a single- layer unidirectional GRU at the encoder and single-layer unidirectional GRU at the decoder. e input of the encoder is the output of the transformer. A beam search was performed during inference at the decoder, while greedydecoding was employed during training and validation.

Others.
BERT is employed to represent the sentences of the document to express its semantic [65]. Liu et al.
proposed abstractive and extractive summarisation models that are based on encoder-decoder architecture. e encoder used a BERT pretrained document-level encoder, while the decoder utilised a transformer that is randomly initialised and trained from scratch. In the abstractive model, the optimisers of the encoder and decoder are separated. Moreover, two stages of fine-tuning are utilised at the encoder: one stage in extractive summarisation and one stage in abstractive summarisation. At the decoder side, a beam search was performed; however, the coverage and copy mechanisms were not employed since these two mechanisms need additional tuning of the hyperparameters. e repetition problem was addressed by producing different summaries by using trigram-blocking. e OOV words rarely appear in the generated summary.

Word
Embedding. e word embedding of the input for the See et al. model was learned from scratch instead of using a pretrained word embedding model [56]. On the other hand, both the input and output tokens applied the same embedding matrix Wemb, which was generated using the GloVe word embedding model in the Paulus et al. model [57]. Another word embedding matrix referred to as Wout was applied in the token generation layer. Additionally, a sharing weighting matrix was employed by both the shared embedding matrix Wemb and the Wout matrix. e sharing weighting matrixes improved the process of generating tokens since they considered the embedding syntax and semantic information.  e discriminator input sequence of the Liu et al. model was encoded using a maximum pooling CNN, where the result was passed to the softmax layer [58]. On the other hand, the word embedding that was applied in the Al-Sabahi et al. model was learned from scratch using the CNN/Daily Mail datasets with 128 dimensions [35]. Egonmwan et al. [64] used pretrained GloVe word embedding. BERT word embedding was utilised in the models proposed by Wang et al. [63] and Liu et al. [65].

Dataset and Dataset
Preprocessing. Experiments were conducted with the See et al. [56], Al-Sabahi et al. [35], and Li et al. [59] models using CNN/Daily Mail datasets, which consist of 781 tokens paired with 56 tokens on average; 287,226 pairs, 13,368 pairs, and 11,490 pairs were utilised for training, validation, and testing, respectively [56]. In the model proposed by Paulus et al., the document was preprocessed using the same method applied in [55]. e proposed model was evaluated using two datasets: the CNN/ Daily News dataset and the New York Times dataset. e CNN/Daily Mail dataset was utilised by Liu et al. for training their model [58].
e ATSDL model consisted of three stages: text preprocessing, phrase extractions, and summary generation [30]. During text preprocessing, the CoreNLP tool was employed to segment the words, reduce the morphology, and resolve the coreference. e second stage of the ATSDL model was phrase extraction, which included the acquisition, refinement and combination of phrases. In addition, multiorder semantic parsing (MOSP), which was proposed to create multilayer binary semantics, was applied for phrase extraction. e first step of MOSP was to perform Stanford NLP parsing, which is a specialised tool that retrieved the lexical and syntactic features from the preprocessed sentences. Next, dependency parsing was performed to create a binary tree by determining the root of the tree, which represents the relational phrase. If the child node has children, then the child is considered a new root with children; this process continues recursively until there are no children for the root. In this case, the tree structure is completed. Accordingly, the compound phrases can be explored via dependency parsing. However, one of the important stages of phrase extraction is refinement, during which redundant and incorrect phrases are refined before training by applying simple rules. First, the phrase triples at the topmost level are exploited since they carry the most semantic information. Second, triple phrases with subject and object phrases and no nouns are deleted since the noun contains a considerable amount of conceptual information. Triple phrases without a verb in a relational phrase are deleted. Moreover, phrase extraction includes phrase combination, during which phrases with the same meaning are combined to minimise redundancy and the time required to train the LSTM-RNN. To achieve the goal of the previous task and determine whether two phrases can be combined, a set of artificial rules are applied. e experiments were conducted using the CNN and Daily Mail datasets, which consisted of 92,000 text sources and 219,000 text sources, respectively. e Kryściński et al. [60] model was trained using a CNN/ Daily Mail dataset, which was preprocessed using the method from 56 [55,56]. e experiments of DEATS were conducted using the CNN/Daily Mail dataset and DUC2004 corpus [61]. e experiments of the BiSum model were performed using the CNN/Daily Mail dataset [62] [58]. In addition, a manual qualitative evaluation was performed to evaluate the quality and readability of the summary. Two participants evaluated the summaries of 50 test examples that were selected randomly from the datasets. Each summary was given a score from 1 to 5, where 1 indicates a low level of readability and 5 indicates a high level of readability.
ROUGE1 and ROUGE2 were used to evaluate the ATSDL model [30]. e value of ROUGE1 was 34.9, and the value of ROUGE2 was 17.8. Furthermore, ROUGE1, ROUGE2, and ROUGE-L were applied as evaluation metrics of the Al-Sabahi et al. and Li et al. models, and the values of 42.6, 18.8, and 38.5, respectively, were obtained for the Al-Sabahi et al. model [35], while the values of 38.95, 17.12, and 35.68, respectively, were obtained for the Li et al. model [58]. e evaluation of the Kryściński et al. model was conducted using quantitative and qualitative evaluations [60]. e quantitative evaluations included ROUGE1, ROUGE2, and ROUGE-L, and the values of 40.19, 17.38, and 37.52, respectively, were obtained. Additionally, a novel score related to the n-gram was employed to measure the level of abstraction in the summary. e qualitative evaluation involved the manual evaluation of the proposed model. Five participants evaluated 100 full-text summaries in terms of relevance and readability by giving each document a value from 1 to 10. Furthermore, for comparison purposes, fulltext summaries from two previous studies [56,58]  ree values were chosen for evaluating the answer: a score of 1 indicates the correct answer; a score of 0.5 indicates a partially correct answer; and a score of 0 indicates a wrong answer. ROUGE1, ROUGE2, and ROUGE-L for the DAPT model over the CNN/Daily Mail datasets were 40.72, 18.28, and 37.35, respectively.
Finally, the pointer-generator approach was applied on the single-sentence and multisentence summaries. Attention encoder-decoder RNNs were employed to model the abstractive text summaries [55]. Both the encoder and decoder have the same number of hidden states. Additionally, the proposed model consists of a softmax layer for generating the words based on the vocabulary of the target. e encoder and decoder differ in terms of their components. e encoder consists of two bidirectional GRU-RNNs-a GRU-RNN for the word level and a GRU-RNN for the sentence level-while the decoder uses a unidirectional GRU-RNN, as shown in Figure 15. Furthermore, the decoder uses batching, where the vocabulary at the decoder for each minibatch is restricted to the words in the batch of the source document. Instead of considering every vocabulary, only certain vocabularies were added based on the frequency of the vocabulary in the target dictionary to decrease the size of the decoder softmax layer. Several linguistic features were considered in addition to the word embedding of the input words to identify the key entities of the document. Linguistic and statistical features included TF-IDF statistics and the part-of-speech and named-entity tags of the words. Specifically, the part-of-speech tags were stored in matrixes for each tag type that was similar to word embedding, while the TF-IDF feature was discretised in bins with a fixed number, where one-hot representation was employed to represent the value of the bins. e onehot matrix consisted of the number of bin entries, where only one entry was set to one to indicate the value of the TF-IDF of a certain word. is process permitted the TF-IDF to be addressed in the same way as any other tag by concatenating all the embeddings into one long vector, as shown in Figure 16. e experiments were conducted using the annotated Gigaword corpus with 3.8 M training examples, the DUC corpus, and the CNN/Daily Mail corpus. e preprocessing methods included tokenisation and part-of-speech and name-entity generation. Additionally, the Word2Vec model with 200 dimensions was applied for word embedding and trained using the Gigaword corpus. Additionally, the hidden states had 400 dimensions in both the encoder and the decoder. Furthermore, datasets with multisentence summaries were utilised in the experiments. e values of ROUGE1, ROUGE2, and ROUGE-L were higher than those of previous work on abstractive summarisation, with values of 35.46, 13.3, and 32.65, respectively.
Finally, for both single-sentence summary and multisentence summary models, the components of the encoder and decoder of each approach are displayed in Table 1. Furthermore, dataset preprocessing and word embedding of several approaches are appeared in Table 2 while training, optimization, mechanism, and search at the decoder are presented in Table 3.

Datasets for Text Summarization
Various datasets were selected for abstractive text summarisation, including DUC2003, DUC2004 [69], Gigaword [70], and CNN/Daily Mail [71]. e DUC datasets are produced for the Document Understanding Conference; although their quality is high, they are small datasets that are typically employed to evaluate summarisation models. e DUC2003 and DUC2004 datasets consist of 500 articles. e Gigaword dataset from the Stanford University Linguistics Department was the most common dataset for model training in 2015 and 2016. Gigaword consists of approximately 10 million documents from seven news sources, including the New York Times, Associated Press, and Washington Post. Gigaword is one of the largest and most diverse summarisation datasets even though it contains headlines instead of summaries; thus, it is considered to contain single-sentence summaries.
Recent studies utilised the CNN/Daily Mail datasets for training and evaluation. e CNN/Daily Mail datasets consist of bullet points that describe the articles, where multisentence summaries are created by concatenating the bullet points of the article [5]. CNN/Daily Mail datasets that are applied in abstractive summarisation were presented by Nallapati et al. [55]. ese datasets were created by modifying the CNN/Daily Mail datasets that were generated by Hermann et al. [
PTB tokenization by using "#" to replace all digits, converting all letters to lower case, and "UNK" to replace words that occurred fewer than 5 times Encodes the position information of the input words [55] Nallapati et al.
Part-of-speech and name-entity tags generating and tokenization (i) Encodes the position information of the input words (ii) e input text was represented using the Word2Vec model with 200 dimensions that was trained using Gigaword corpus (iii) Continuous features such as TF-IDF were represented using bins and one-hot representation for bins (iv) Lookup embedding for part-of-speech tagging and name-entity tagging [52] Zhou et al.
PTB tokenization by using "#" to replace all digits, converting all letters to lower case, and "UNK" to replace words that occurred fewer than 5 times Word embedding with size equal to 300 [53] Cao et al.
Normalization and tokenization, using the "#" to replace digits, convert the words to lower case, and "UNK" to replace the least frequent words.  Table 4 lists the datasets that are used to train and validate the summarisation methods in the research papers listed in this work.

Evaluation Measures
e package ROUGE is employed to evaluate the text summarisation techniques by comparing the generated summary with a manually generated summary [73]. e package consists of several measures to evaluate the performance of text summarisation techniques, such as ROUGE-N (ROUGE1 and ROUGE2) and ROUGE-L, which were employed in several studies [38]. ROUGE-N is n-gram recall such that ROUGE1 and ROUGE2 are related to unigrams and bigrams, respectively, while ROUGE-L is related to the longest common substring. Since the manual  (5) evaluation of automatic text summarisation is a time-consuming process and requires extensive effort, ROUGE is employed as a standard for evaluating text summarisation. ROUGE-N is calculated using the following equation: where S is the reference summary, n is the n-gram length, and Count match (gram n ) is the maximum number of matching n-gram words between the reference summary and the generated summary. Count (gram n ) is the total number of n-gram words in the reference summary [73]. ROUGE-L is the longest common subsequence (LCS), which represents the maximum length of the common matching words between the reference summary and the generated summary. LCS calculation does not necessarily require the match words to be consecutive; however, the order of occurrence is important. In addition, no predefined number of match words is required. LCS considers only the main in-sequence, which is one of its disadvantages since the final score will not include other matches. For example, assume that the reference summary R and the automatic summary A are as follows:  In this case, ROUGE-L will consider either "Ahmed ate" or "the apple" but not both, similar to LCS. Tables 5 and 6 present the values of ROUGE1, ROUGE2, and ROUGE-L for the text summarisation methods in the various studies reviewed in this research. In addition, Perplexity was employed in [18,39,51], and BLEU was utilised in [29]. e models were evaluated using various datasets. e other models applied ROUGE1, ROUGE2, and ROUGE-L for evaluation. It can be seen that the highest values of ROUGE1, ROUGE2, and ROUGE-L for text summarisation with the pretrained encoder model were 43.85, 20.34, and 39.9, respectively [65]. Even though ROUGE was employed to evaluate abstractive summarisation, it is better to obtain new methods to evaluate the quality of summarisation. e new evaluation metrics must consider novel words and semantics since the generated summary contains words that do not exist in the original text. However, ROUGE was very suitable for extractive text summarisation.
Based on our taxonomy, we divided the results of ROUGE1, ROUGE2, and ROUGE-L into two groups. e first group considered single-sentence summary approaches, while the second group considered multisentence summary approaches. Figure 18 compares several deep learning techniques in terms of ROUGE1, ROUGE2, and ROUGE-L for the Gigaword datasets, consisting of single-sentence summary documents.
e highest values for ROUGE1, ROUGE2, and ROUGE-L were achieved by the RCT model [54]. e values for ROUGE1, ROUGE2, and ROUGE-L were 37.27, 18.19, and 34.62, respectively. Furthermore, Figure 19 compares the ROUGE1, ROUGE2, and ROUGE-L values for abstractive text summarisation methods for the CNN/Daily Mail datasets, which consist of multisentence summary documents. e highest values of ROUGE1, ROUGE2, and ROUGE-L were achieved for text summarisation with a pretrained encoder model. e values for ROUGE1, ROUGE2, and ROUGE-L were 43.85, 20.34, and 39.9, respectively [65]. It can be clearly seen that the best model in the single-sentence summary and multisentence summary is the models that employed BERT word embedding and were based on transformers. e ROUGE values for the CNN/Daily Mail datasets are larger than those for the Gigaword dataset, as Gigaword is utilised for single-sentence summaries as it contains headlines that are treated as summaries, while the CNN/Daily Mail datasets are multisentence summaries. us, the summaries in the CNN/Daily Mail datasets are longer than the summaries in Gigaword.
Liu et al. selected two human elevators to evaluate the readability of the generated summary of 50 test examples of 5 models [58]. e value of 5 indicates that the generated summary is highly readable, while the value of 1 indicates that the generated summary has a low level of readability. It can be clearly seen from the results that the Liu et al. model was better than the other four models in terms of ROUGE1, ROUGE2, and human evaluation, even though the model is not optimal with respect to the ROUGE-L value. In addition to quantitative measures, qualitative evaluation measures are important. Kryściński et al. also performed qualitative evaluation to evaluate the quality of the generated summary [60]. Five human evaluators evaluated the relevance and readability of 100 randomly selected test examples, where two values are utilised: 1 and 10. e value of 1 indicates that the generated summary is less readable and less relevance while the value of 10 indicates that the generated summary is readable and very relevance.
e results showed that, in terms of readability, the model proposed by Kryściński  Liu et al. evaluated the quality of the generated summary in terms of succinctness, informativeness, and fluency in addition to measuring the level of retaining key information, which was achieved by human evaluation [65]. In addition, qualitative evaluation evaluated the output in terms of grammatical mistakes. ree values were selected for evaluating 20 test examples: 1 indicates a correct answer, 0.5 indicates a partially correct answer, and 0 indicates an incorrect answer. We can conclude that quantitative evaluations, which include ROUGE1, ROUGE2, and ROUGE-L, are not enough for evaluating the generated summary of abstractive text summarisation, especially when measuring readability, relevance, and fluency. erefore, qualitative measures, which can be achieved by manual evaluation, are very important. However, qualitative measures without quantitative measures are not enough due to the small number of testing examples and evaluators.

Challenges and Solutions
Text summarisation approaches have faced various challenges; although some have been solved, others still need to be addressed. In this section, these challenges and their possible solutions are discussed.

Unavailability of the Golden Token during Testing.
Due to the availability of golden tokens (i.e., reference summary tokens) during training, previous tokens in the headline can be input into the decoder at the next step. However, during testing, the golden tokens are not available; thus, the input for the next step in the decoder will be limited to the previously generated output word. To solve this issue, which becomes more challenging when addressing small datasets, different solutions have been proposed. For example, in reference [51], the data-as-demonstrator (DaD) model [74] is utilised. In DaD, at each step, based on a coin flip, either a gold token is utilised during training or the previous step is employed during both testing and training. In this manner, at least the training step receives the same input as testing. In all cases, the first input of the decoder is the 〈EOS〉 token, and the same calculations are applied to compute the loss. In [29], teacher forcing is employed to address this challenge: during training, instead of feeding the expected word from the headline, 10% of the time, the generated word of the previous step is fed back [75,76].
Moreover, the mass convolution of the QRNN is applied in [50] since the dependency of words generated in the future is difficult to determine.

Out-of-Vocabulary (OOV) Words.
One of the challenges that may occur during testing is that the central words of the test document may be rare or unseen during training; these words are referred to as OOV words. In 61 [55,61], a switching decoder/pointer was employed to address OOV words by using pointers to point to their original positions in the source document. e switch on the decoder side is used to alternate between generating a word and using a pointer, as shown in Figure 20 [55]. When the switch is turned off, the decoder will use the pointer to point to the word in the source to copy it to the memory. When the switch is turned on, the decoder will generate a word from the target vocabularies. Conversely, researchers in [56] addressed OOV words via probability generation Pgen, where the value is calculated from the context vector and decoder state, as shown in Figure 21. To generate the output word, Pgen switches between copying the output words from the input sequence and generating them from the vocabulary. Furthermore, the pointer-generator technique is applied to point to input words to copy them. e combination between the words in the input and the vocabulary is referred to the extended vocabulary. In addition, in [57], to generate the tokens on the decoder side, the decoder utilised the switch function at each timestep to switch between generating the token using the softmax layer and using the pointer mechanism to point to the input sequence position for   unseen tokens to copy them. Moreover, in [30], rare words were addressed by using the location of the phrase, and the resulting summary was more natural. Moreover, in 35 [35,58,60], the OOV problem was addressed by using the pointer-generator technique employed in [56], which alternates between generating a new word and coping the word from the original input text.

Summary Sentence Repetition and Inaccurate Information
Summary. e repetition of phrases and generation of incoherent phrases in the generated output summary are two challenges that must be considered. Both challenges are due to the summarisation of long documents and the production of long summaries using the attention-based encoder-decoder RNN [57]. In [35,56], repetition was addressed by using the coverage model to create the coverage vector by aggregating the attention over all previous timesteps. In [57], repetition was addressed by using the key attention mechanism, where for each input token, the encoder intratemporal attention records the weights of the previous attention. Furthermore, the intratemporal attention uses the hidden states of the decoder at a certain timestep, the previously generated words, and the specific part of the encoded input sequence, as shown in Figure 22, to prevent repetition and attend to the same sequence of the input at a different step of the decoder. However, the intra-attention encoder mechanism cannot address all the repetition challenges, especially when a long sequence is generated.
us, the intradecoder attention mechanism was proposed to allow the decoder to consider more previously generated words. Moreover, the proposed intradecoder attention mechanism is applicable to any type of the RNN decoder. Repetition was also addressed by using an objective function that combines the cross-entropy loss maximum likelihood and gradient reinforcement learning to minimise the exposure bias. In addition, the probability of trigram p (yt) was proposed to address repetition in the generated summary, where yt is the trigram sequence. In this case, the value of p (yt) is 0 during a beam search in the decoder when the same trigram sequence was already generated in the output summary. Furthermore, in [60], the heuristic proposed by [57] was employed to reduce repetition in the summary. Moreover, in [61], the proposed approach addressed repetition by exploiting the encoding features generated using a secondary encoder to remember the previously generated decoder output, and the coverage mechanism is utilised.

Fake Facts.
Abstractive summarisation may generate summaries with fake facts, and 30% of summaries generated from abstractive text summarisation suffer from this problem [53]. With fake facts, there may be a mismatch between the subject and the object of the predicates. us, to address this problem, dependency parsing and open information extraction (e.g., open information extraction (OpenIE)) are performed to extract facts. erefore, the sequence-to-sequence framework with dual attention was proposed, where the generated summary was conditioned by the input text and description of the extracted facts. OpenIE facilitates entity extraction from a relation, and Stanford CoreNLP was employed to provide the proposed approach with OpenIE and the dependency parser. Moreover, the decoder utilised copying and coverage mechanisms.

Other Challenges.
e main issue of the abstractive text summarisation dataset is the quality of the reference summary (Golden summary). In the CNN/Daily Mail dataset, the reference summary is the highlight of the news. Every highlight represents a sentence in the summary; therefore,

24
Mathematical Problems in Engineering the number of sentences in the summary is equal to the number of highlights. Sometimes, the highlights do not address all crucial points in the summary. erefore, a highquality dataset needs high effort to become available. Moreover, in some languages, such as Arabic, the multisentence dataset for abstractive summarisation is not available. Single-sentence abstractive Arabic text summarisation is available but is not free. Another issue of abstractive summarisation is the use of ROUGE for evaluation. ROUGE provides reasonable results in the case of extractive summarisation. However, in abstractive summarisation, ROUGE is not enough as ROUGE depends on exact matching between words. For example, the words book and books are considered different using any one of the ROUGE metrics. erefore, a new evaluation measure must be proposed to consider the context of the words (words that have the same meaning must be considered the same even if they have a different surface form). In this case, we propose to use METEOR which was used recently in evaluating machine translation and automatic summarisation models [77]. Moreover, METEOR considers stemming, morphological variants, and synonyms. In addition, in flexible order language, it is better to use ROUGE without caring about the order of the words. e quality of the generated summary can be improved using linguistic features. For example, we proposed the use  Figure 21: Pointer-generator model [56].
of dependency parsing at the encoder in a separate layer at the top of the first hidden state layer. We proposed the use of the word embedding, which was built by considering the dependency parsing or part-of-speech tagging. At the decoder side, the beam-search quality can be improved by considering the part-of-speech tagging of the words and its surrounding words. Based on the new trends and evaluation results, we think that the most promising feature among all the features is the use of the BERT pretrained model. e quality of the models that are based on the transformer is high and will yield promising results.

Conclusion and Discussion
In recent years, due to the vast quantity of data available on the Internet, the importance of the text summarisation process has increased. Text summarisation can be divided into extractive and abstractive methods. An extractive text summarisation method generates a summary that consists of words and phrases from the original text based on linguistics and statistical features, while an abstractive text summarisation method rephrases the original text to generate a summary that consists of novel phrases. is paper reviewed recent approaches that applied deep learning for abstractive text summarisation, datasets, and measures for evaluation of these approaches. Moreover, the challenges encountered when employing various approaches and their solutions were discussed and analysed. e overview of the reviewed approaches yielded several conclusions. e RNN and attention mechanism were the most commonly employed deep learning techniques. Some approaches applied LSTM to solve the gradient vanishing problem that was encountered when using an RNN, while other approaches applied a GRU. Additionally, the sequence-to-sequence model was utilised for abstractive summarisation. Several datasets were employed, including Gigaword, CNN/Daily Mail, and the New York Times. Gigaword was selected for single-sentence summarisation, and CNN/Daily Mail was employed for multisentence summarisation. Furthermore, ROUGE1, ROUGE2, and ROUGE-L were utilised to evaluate the quality of the summaries. e experiments showed that the highest values of ROUGE1, ROUGE2, and ROUGE-L were obtained in text summarisation with a pretrained encoder mode, with values of 43.85, 20.34, and 39.9, respectively. e best results were achieved by the models that apply Transformer. e most common challenges faced during the summarisation process were the unavailability of a golden token at testing time, the presence of OOV words, summary sentence repetition, sentence inaccuracy, and the presence of fake facts. In addition, there are several issues that must be considered in abstractive summarisation, including the dataset, evaluation measures, and quality of the generated summary.

Data Availability
No data were used to support this study.  Figure 22: A new word is added to the output sequence by combining the current hidden state "H" of the decoder and the two context vectors, marked as "C" [57].