Deep learning is the crucial technology in intelligent question answering research tasks. Nowadays, extensive studies on question answering have been conducted by adopting the methods of deep learning. The challenge is that it not only requires an effective semantic understanding model to generate a textual representation but also needs the consideration of semantic interaction between questions and answers simultaneously. In this paper, we propose a stacked Bidirectional Long Short-Term Memory (BiLSTM) neural network based on the coattention mechanism to extract the interaction between questions and answers, combining cosine similarity and Euclidean distance to score the question and answer sentences. Experiments are tested and evaluated on publicly available Text REtrieval Conference (TREC) 8-13 dataset and Wiki-QA dataset. Experimental results confirm that the proposed model is efficient and particularly it achieves a higher mean average precision (MAR) of 0.7613 and mean reciprocal rank (MRR) of 0.8401 on the TREC dataset.
Deep learning forms a more abstract high-level representation attribute feature by combining low-level features to discover the distributed feature representations of data. It provides an effective method for NLP research. In recent years, intelligent question answering in the NLP field has emerged as a prominent discipline research hotspot in both academia and industry, which has been widely used by many influential question answering systems. Answer selection plays a vital role in question answering task, and it mainly encodes QA pair and inputs them into the model to extract the key information and get the corresponding representation [
In the past few years, most question answering studies [
With the significant innovation of deep learning, deep neural networks are able to availably map the meaning of a single word in a sentence to a continuous representation of the entire sentence, and the meaning of the sentence representation obtained is more complete. Because deep learning reduces the need for manual feature engineering and adapting to new tasks, it has become an important research method for various tasks of NLP in the last several years, and a large number of researchers take advantage of its end-to-end model for sentence semantic analysis to implement question answering tasks. Feng et al. [
In this paper, we construct a deep learning architecture for question answering, where questions and answers are limited to a single sentence. The cores of our architecture are two distributed sentence models working in parallel, based on a stacked BiLSTM neural network. We map questions and answers to the corresponding distribution vectors and finally calculate the semantic similarity between them. BiLSTM neural networks have been widely used in recent years to deal with NLP issues [
The main contributions of this paper are summarized as follows: A stacked BiLSTM neural network is resorted to attain the vector representation of the input sentence, which can effectively capture the semantics of the sentence. Our model combines coattention mechanism and attention mechanism to encode sentences to obtain the interaction and influence between the QA pair. The cosine similarity and the Euclidean distance are reconciled to calculate the degree of matching between two vectors. This method is able to take the distance and angle relationship between vectors into consideration.
The rest of this paper is organized as follows. Section
Research in question answering has been greatly boosted by the Text REtrieval Conference series since 1999. Recently, a number of related works [
Previously, traditional research approaches concentrated on syntactic matching between the questions and answers. Punyakanok et al. [
In the recent work of question answering, the mainstream is based on deep learning methods. Yih et al. [
The attention mechanism is appropriate for inferring the mapping relationship between different modal data extremely. It can help a framework like a codec to properly acquire the interrelationships of multiple content models, thus expressing more effectively [
In many previous works such as Liu [
In this section, we describe the proposed question answering model based on deep learning, which is optimized based on the architecture of Tan et al. [
Framework of the proposed neural network model.
In Figure
LSTM networks architecture was originally developed by Hochreiter and Schmidhuber [
Architecture of Long Short-Term Memory cell.
To overcome the shortcoming of single LSTM cell that can only capture previous context but not utilize the future context, Schuster and Paliwal [
Some previous works represented that by stacking multiple BiLSTM in neural networks, the performance of classification or regression can be further improved [
Architecture of the stacked BiLSTM networks.
Defining
Here, we implement a coattention mechanism to encode question according to the answer sequences, as shown in Figure
An illustration of the coattention mechanism.
We first perform matrix multiplication to calculate the affinity matrix
Softmax function is applied to standardize vector elements, and it is effective in dealing with multiclassification and probability distribution problems. Hence, the column- and row-based softmax functions are utilized to generate attention weights for the hidden states of question and answer separately in the following equation:
In order to obtain the attention vector of the question in light of each word of answers, we concatenate attention weights and affinity matrix to compute new context vectors
To reduce the information loss of stacked BiLSTM, a soft attention flow layer can be used for linking and integrating information from the question and answer words [
Here,
In this work, we resort a method to reconcile cosine similarity and Euclidean distance to evaluate the degree of matching between the questions and answers. Cosine similarity represents the angle between two vectors, and the Euclidean distance represents the distance between two points in Euclidean space. We hope that the distance between the question and the answer semantic vector to be close enough and the angle is small enough, to maximize the similarity calculation between question and answer pair sentence vectors. The schematic diagram of cosine similarity and Euclidean distance is shown in Figure
Schematic diagram of cosine similarity and Euclidean distance.
A vector representation of the question and answer is obtained from the hidden layer of the model. The cosine similarity and Euclidean distance calculation details are as below.
Normalize the cosine similarity to the [0, 1] interval and it can be obtained as follows:
During training, the positive and the negative samples can be input simultaneously by using the hinge loss function. We define the hinge loss function as the training goal as below:
In the process of training, we utilize the backpropagation algorithm to calculate the gradient
In this section, we will introduce the detailed information of the experimental implementation, including TREC-QA (8-13) dataset and Wiki-QA dataset, model evaluation indicators, and selection of training parameters, and then, we will carefully analyze the experimental results on different datasets to prove that our proposed model has good accuracy and robustness.
In this part, we mainly introduce two public datasets, TREC-QA (8-13) dataset and Wiki-QA dataset, and we also introduce the source, data characteristics, and the number of Q&A pairs of these two datasets in detail.
The experiment is operated on the Text REtrieval Conference 8-13 QA datasets (
Details of TREC-QA (8-13) dataset.
Set | Source | Questions | Positive answers | Negative answers | Length |
---|---|---|---|---|---|
Train-All | TREC 8-12 | 1229 | 6403 | 47014 |
|
Dev | TREC 13 | 84 | 222 | 926 |
|
Test | TREC 13 | 100 | 284 | 1233 |
|
Total | TREC 8-13 | 1411 | 6909 | 49173 |
|
Wiki-QA (
Details of Wiki-QA dataset.
Set | Questions | Positive answers | Negative answers | Length |
---|---|---|---|---|
Train-All | 873 | 1040 | 19320 | 16.27 |
Dev | 126 | 140 | 2593 | 15.91 |
Test | 243 | 100 | 5872 | 16.11 |
Total | 1242 | 293 | 27785 | 16.17 |
In this paper, all experiments were performed on Python, MATLAB, and their optimization toolboxes on a computer with an Intel Core 2 Duo 2.93 GHz processor and a Windows 7 operating system.
Following the previous works of Wang et al. [
In this paper, different experimental factors are set to test and evaluate our proposed method, and then our method is compared with other most advanced methods under the same dataset. The neural network model is implemented with TensorFlow library. In the course of training, we continuously observe the performance on the test set and select the highest MAP and MRR score parameters for final evaluation. Our implementation is as follows:
In order to verify the validity and accuracy of the algorithm model of the fusion stacked BiLSTM network and the coattention mechanism in the intelligent question answering, we tested and verified the TREC-QA (8-13) dataset and Wiki-QA dataset, respectively, and the experimental results were analyzed and summarized.
We conducted a comparative experiment on single-layer BiLSTM, stacked BiLSTM, and stacked BiLSTM with coattention on the TREC-QA (8-13) dataset. Figure
Comparison of sentence semantic analysis with or without coattention.
Variation in evaluation metrics with the epochs: (a) MAP and (b) MRR.
Experimental results of different baselines and our proposed model on Train-All data.
Idx | Model | MAP | MRR |
---|---|---|---|
1 | Probabilistic quasi-synchronous grammar [ |
0.6029 | 0.6852 |
2 | Tree edit models [ |
0.6091 | 0.6917 |
3 | Linear-chain CRF [ |
0.6307 | 0.7477 |
4 | LCLR [ |
0.7092 | 0.7700 |
5 | Bigram + count [ |
0.7113 | 0.7846 |
6 | Three-layer BiLSTM + BM25 [ |
0.7134 | 0.7913 |
7 | Convolutional deep neural networks [ |
0.7459 | 0.8078 |
8 | BiLSTM/CNN with attention [ |
0.7111 | 0.8322 |
9 | Attentive LSTM [ |
0.7530 | 0.8300 |
10 | BiLSTM encoder-decoder with step attention [ |
0.7261 | 0.8018 |
11 | BiLSTM | 0.6982 | 0.7764 |
12 | Stacked BiLSTM | 0.7127 | 0.7893 |
13 | BiLSTM with coattention | 0.7325 | 0.7962 |
14 | Stacked BiLSTM with coattention | 0.7451 | 0.8114 |
15 | Stacked BiLSTM with coattention (cosine + Euclidean) |
|
|
Different from the traditional work of Yih et al. [
We found that our experimental results of the coattention mechanism were significantly better than most of the above results [
The experimental index of stacked BiLSTM is better than single-layer BiLSTM when compared line 11 and line 12 with line 13 and line 14, respectively. Furthermore, Wang and Nyberg [
The best MAP (0.7613) and MRR (0.8401) are obtained by incorporating the coattention mechanism into a stacked BiLSTM neural networks and combining cosine similarity and Euclidean distance to calculate the matching degree between two vectors. Our experimental result outperforms the state-of-the-art baselines of Tan et al. [
Firstly, we conducted comparative experiments in the model training process, selected the question and answer statement from the test set of TREC-QA (8-13) randomly, trained the model with/without coattention mechanism, and obtained the corresponding semantic vector representation through different models. The specific content verified that the presence or absence of a coattention mechanism had an impact on the analytical representation of the semantics of the statement. The comparison results are shown in Figure
In Figure
Secondly, we verified the epoch sensitivity of the above several models under different iteration periods. Figure
We performed an epoch-number sensitivity analysis on our proposed model, which varied from 5 to 35. Figure
We presented an optimized deep model by using stacked BiLSTM, coattention mechanism, attention mechanism, and a combined similarity metric, and our experimental results are shown in line 11 to line 15 of Table
We did further comparison experiments on the Wiki-QA dataset. Validation of the model on the Wiki-QA dataset makes the proposed approach more convincing. The parameter initialization and preset aspects of the model on the Wiki-QA dataset are basically consistent with the settings of the TREC dataset, where the batch size of the dataset is 30. Because it is also the order of information retrieval and candidate answer rankings, according to the official evaluation data, the evaluation metrics are selected as MAP and MRR.
We also validated the various models of the design under different epochs on the Wiki-QA dataset, as shown in Figure
Variation in evaluation metrics with the epochs: (a) MAP and (b) MRR.
The experimental results of each model under the Wiki-QA dataset are shown in Table
Experimental results of different baselines and our model on the Wiki-QA dataset.
Idx | Model | MAP | MRR |
---|---|---|---|
1 | LSTM with attention [ |
0.6639 | 0.6828 |
2 | CNN-Cnt [ |
0.6520 | 0.6086 |
3 | wGRU-sGRU-Gl2 [ |
0.7537 | 0.7658 |
4 | wGRU-sGRU-Gl2-Cnt [ |
|
|
5 | Stacked BiLSTM | 0.7248 | 0.7333 |
6 | SBiLSTM-coA (cosine + Euclidean) |
|
|
In the field of intelligent question answering, these data results confirm that the model has some excellent performance in the statement semantic capture representation of questions and answers and can better represent semantic features.
In this paper, we proposed a stacked BiLSTM neural network based on the coattention mechanism for question answering. Stacked BiLSTM is used to sentence semantic understanding and modeling; coattention mechanism and attention mechanism are utilized to obtain the codependent representation of questions and answers; the combination of cosine similarity and Euclidean distance is used to calculate the similarity between the question and the answer. As reported in Section
This work involved data from the Text REtrieval Conference (TREC) 8-13 datasets and Wiki-QA datasets. We used the 53417 Q&A pairs in TREC 8-12 to train the model, while using 1148 Q&A pairs and 1517 Q&A pairs in TREC 13 for development and testing, respectively. All researchers can access the data in the following site:
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by the National Key R&D Program of China (2017YFE0123000), the Innovation Project of Graduate Research in Chongqing (no. CYS19273), and the Key R&D Program of Common Key Technology Innovation for Key Industries in Chongqing (no. CSTC2015zdcy-ztzx60001).