DPAEG: A Dependency Parse-Based Adversarial Examples Generation Method for Intelligent Q&A Robots

Recently, the natural language processing(NLP-) based intelligent question and answer (Q&A) robots have been used ubiquitously. However, the robustness and security of current Q&A robots are still unsatisfactory, e.g., a slight typo in the user’s question may cause the Q&A robot unable to return the correct answer. In this paper, we propose a fast and automatic test dataset generation method for the robustness and security evaluation of current Q&A robots, which can work in black-box scenarios and thus can be applied to a variety of dierent Q&A robots. Specically, we propose a dependency parse-based adversarial examples generation (DPAEG) method for Q&A robots. DPAEG rst uses the proposed dependency parse-based keywords extraction algorithm to extract keywords from a question. en, the proposed algorithm generates adversarial words according to the extracted keywords, which include typos and words that are spelled similarly to the keywords. Finally, these adversarial words are used to generate a large number of adversarial questions. e generated adversarial questions which are similar to the original questions do not aect human’s understanding, but the Q&A robots cannot answer these adversarial questions correctly. Moreover, the proposed method works in a black-box scenario, which means it does not need the knowledge of the target Q&A robots. Experiment results show that the generated adversarial examples have a high success rate on two state-of-the-art Q&A robots, DrQA and Google Assistant. In addition, the generated adversarial examples not only aect the correct answer (top-1) returned by DrQA but also aect the top-k candidate answers returned by DrQA. e adversarial examples make the top-k candidate answers contain fewer correct answers and make the correct answers rank lower in the top-k candidate answers. e human evaluation results show that participants with dierent genders, ages, and mother tongues can understand the meaning of most of the generated adversarial examples, which means that the generated adversarial examples do not aect human’s understanding.


Introduction
In recent years, arti cial intelligence (AI) has developed rapidly both in its techniques and applications.A typical application of AI is the natural language processing-(NLP-) based intelligent question and answer (Q&A) robots [1], which are not only used in general applications but also in professional business or government applications.Recently, many companies have developed their Q&A robots and put them on the market, such as Google Assistant [2], Cortana [3], Siri [4], Alexa [5], and Watson [6].Unlike search engines (e.g., Google and Baidu) that provide a ranked list of relevant web documents to the user, the task of an intelligent Q&A robot is to give the user a precise and concise answer in several interactions with the user [7].In general, the Q&A robot has the following two features: (1) users can query the Q&A robot in natural language and (2) the answer returned by the Q&A robot is directly the answer that the user needs, instead of a ranked list of relevant documents.
Many current Q&A robots use NLP models to understand user's questions and return answers [8].However, the NLP models still have some shortcomings.For example, studies show that NLP models are not robust enough [9] that a small typo in the user's input may cause the NLP models to fail to process the question.Besides, the NLP models used in Q&A robots may not truly understand the semantics of the user's question [10], which cause the Q&A robots to give irrelevant answers.Moreover, NLP models are also vulnerable to the adversarial examples attacks [11].ese shortcomings of the NLP models will affect the robustness and security of current Q&A robots, which will lead to a very bad user experience.
To date, there are some research studies on the robustness and security of machine learning models, such as [12][13][14][15][16], but little research has been done on the robustness and security issues of these ubiquitous Q&A robots.Motivated by these issues, in this paper, we propose a fast and automatic test dataset generation method to evaluate the robustness of Q&A robots by crafting adversarial questions.Although the proposed method only makes minor modifications to the original questions, these carefully constructed adversarial questions can easily make state-of-theart Q&A robots answer incorrectly.Moreover, these generated adversarial questions are quite similar to the original questions and thus do not affect human's understanding of these adversarial questions.ere are some adversarial example generation methods for text classifiers in the literature, such as [17][18][19][20][21][22].However, these adversarial example generation methods for text classifiers are not suitable for Q&A robots.e reasons are as follows.(1) e application scenarios of text classifiers and Q&A robots are different.Text classifiers are applied to spam filtering, sentiment analysis, fake news detection, and so on.Q&A robots are applied to intelligent customer service, smart home service, professional question and answer, information query, and so on.(2) e techniques adopted by text classifiers and Q&A robots are also different.Text classifiers use a single NLP model to perform classification tasks.However, since the tasks performed by Q&A robots are more complex, Q&A robots use multiple different NLP models in the entire process of understanding questions and searching for answers.(3) e adversarial example generation methods for text classifiers are based on a single target model, and some of these generation methods require specific knowledge of the target model, such as [17,18,21].However, for Q&A robots, attackers cannot obtain the specific knowledge of the Q&A robots in most cases.erefore, it is more difficult to generate adversarial examples for Q&A robots than for text classifiers.In this paper, the proposed method determines the important words of the questions and modifies them slightly, which does not require the design information of the target Q&A robot and thus has a strong universality over a wide range of Q&A robots.
e proposed method first exploits a dependency parser to extract keywords from the original question.en, the adversarial words of keywords are generated.e adversarial words contain three types of words: typos of keywords, words that are spelled similarly to the keywords, typos of these spell-similar words.e spell-similar words are obtained by searching for words in the English dictionary that satisfy the proposed three constraints.Typos of keywords and typos of spell-similar words are common misspelled words, which are obtained by querying the typos corpus and discarding those typos that have a large edit distance from the keywords.Finally, the keywords in the original question are replaced by adversarial words to generate large number of adversarial questions.In the experiment, the adversarial examples are generated from WebQues-tionsSP, CuratedTREC, and WikiMovies Q&A datasets.Two state-of-the-art Q&A robots, DrQA and Google Assistant, are used to evaluate the success rate of the proposed method.
e experiment results on DrQA and Google Assistant show that the generated adversarial examples can make the Q&A robot go wrong with a high success rate.
e experimental results in terms of recall @k (R n @k) [23], mean reciprocal rank (MRR) [24], and mean average precision (MAP) [24] further show that the generated adversarial examples also affect the top-k candidate answers returned by DrQA.e adversarial examples result in fewer correct answers in the top-k candidate answers and make the correct answers rank lower in the top-k candidate answers.Besides, we invite participants with different genders, ages, and mother tongues to evaluate the quality of the generated adversarial examples.e human evaluation results show that different participants can understand the meaning of most of the adversarial examples generated by the proposed method.
e main contributions of this paper are as follows: (i) Many previous text adversarial example generation methods require the knowledge of the target model to determine important parts of a text sequence which are further modified to generate adversarial examples.However, our proposed keywords extraction algorithm can determine important parts of a question without the knowledge of the design information of the Q&A robots.erefore, the proposed method can work under black-box situations.Moreover, to the best of the authors' knowledge, this is the first adversarial examples generation method for intelligent Q&A robots and also the first automatic test dataset generation method for the robustness and security evaluation of Q&A robots.(ii) e proposed algorithm first extracts keywords from a given question and then generates adversarial words that are similar to the extracted keywords.ese words are used to replace the corresponding words in the original question to generate large number of adversarial examples.Since the differences between generated adversarial questions and the original question are inconspicuous, humans are not aware of these adversarial words when reading the question.Human evaluation of participants with different genders, ages, and mother tongues shows that they have no trouble understanding the generated adversarial questions.

Related Work
Generally, there are three kinds of working mechanisms used by current Q&A robots: using the knowledge base (KB), using information retrieval (IR), and using both KB and IR.
e KB-based Q&A robots transform a question into a standard structured query through semantic parsing and then get answer from the KB [25].e key step of this type of Q&A robots is to transform the user's natural language questions into standard structured query languages [25].Currently, many Q&A robots use machine learning techniques to understand the semantics of the questions, such as [25][26][27].In [25], Yih et al. used an entity linking system and a deep convolutional neural network model for question answering.Yin et al. [26] proposed an end-to-end neural network model to generate answers.For IR-based Q&A robots, such as [28][29][30], they retrieve unstructured text documents and extract relevant answers from these documents.
e DrQA, which is developed by Facebook [31], is a Q&A model for answering questions by retrieving and reading unstructured knowledge.DrQA uses Wikipedia as the unique knowledge source and uses recurrent neural network (RNN) model to extract answers from relevant articles [28].Some Q&A robots, such as YodaQA [32], QuASE [33], and Watson [34], combine KB and IR techniques to get answers of the questions.Baudiš [32] proposed a Q&A framework, named YodaQA.YodaQA searches unstructured and structured knowledge and then uses a classifier to determine the best matching answer.Sun et al. [33] proposed a QuASE system for open-domain question answering, which searches for answers directly from the web and uses a knowledge base to further improve the accuracy of answering questions.
Although different Q&A robots have different working mechanisms, many current Q&A robots use NLP models when processing users' questions and searching for the correct answers [35].Unfortunately, the NLP models are vulnerable to adversarial examples, which are carefully designed inputs by an attacker to cause the model to produce erroneous outputs [36].Recently, there are some adversarial example generation methods in NLP tasks, including text classification, machine translation, and reading comprehension.For instance, in [17][18][19], the authors search for the most important part of a text sequence for the text classifier and then make slight modifications to this part to generate adversarial examples.ese modifications include insertion, substitution, removal, etc.When targeting machine translation model, Ebrahimi et al. and Belinkov and Bisk [37,38] used noisy texts to generate adversarial examples, which make the machine translation results change greatly.When targeting reading comprehension systems, Jia and Liang [10] added irrelevant sentences to the input to fool the reading comprehension system.To the best of the authors' knowledge, there is no research on adversarial examples generation for intelligent Q&A robots.However, since many NLP models are applied to Q&A robots, Q&A robots also face the threat of adversarial examples in practice.For example, when interacting with a Q&A robot, the user often misspells the words in a question, which will cause the Q&A robot to return a wrong answer or an irrelevant answer.
In this paper, the proposed adversarial examples generation method for Q&A robots is to modify an important part of a question slightly.Compared with other methods, the difference between the generated adversarial examples and the original question is more inconspicuous, and the proposed method can work in black-box scenarios.is minor modification hardly changes the semantics of the original question.Even if the semantics of a single word changes, people can still infer the semantics from the context of the question.Human evaluation experiments show that humans can understand the original meaning of the generated adversarial examples.In addition, the proposed method exploits a dependency parser to determine the important parts of a question without the knowledge of the design information of the Q&A robot.erefore, it can be applied to various Q&A robots under black-box scenarios.

The Proposed DPAEG Method
3.1.Overall Procedure.In this section, we elaborate the proposed dependency parse-based adversarial examples generation (DPAEG) method.DPAEG replaces an important part of the original question with typos or words that are spelled similarly.e framework of the proposed adversarial examples generation method is shown in Figure 1.ere are four stages in the proposed method to craft adversarial examples.First, the proposed method preprocesses the questions from Q&A datasets, which removes the original questions that the target Q&A robot cannot answer correctly.is means, in the Q&A dataset, only the original questions that the Q&A robot can answer correctly are retained for adversarial examples generation.Second, the proposed dependency parse-based keywords extraction algorithm is used to extract keywords from the original questions.ird, the proposed adversarial words generation algorithm is used to slightly modify keywords of a question, which includes three types of modifications, typos of keywords, spell-similar words, and typos of these spell-similar words.Specifically, by searching in a dictionary according to the proposed constraints, words that are spelled similarly to the keywords are determined.e typos of keywords and typos of these spell-similar words are determined from the typos corpus according to the edit distance settings.Finally, Security and Communication Networks the keywords in the original question are replaced by adversarial words to generate large number of adversarial questions.e detailed process of each stage is described in the following sections.
3.2.Preprocess: Questions Filtering.For any given question, the proposed method is able to generate a large number of adversarial questions.In the experiment, three standard Q&A datasets (WebQuestionsSP [39], CuratedTREC [40], and WikiMovies [41]) are used to provide original questions.Any other questions are also feasible.Since the target Q&A robot cannot correctly answer all the original questions in these three datasets, it is meaningless to generate adversarial examples with those original questions that the Q&A robot cannot answer.erefore, a preprocessing operation is applied to the Q&A datasets, and the original questions that the target Q&A robot cannot answer correctly are removed.
e remaining questions that the target Q&A robot can answer correctly are used to generate adversarial examples.

Keywords Extraction Based on Dependency
Parse. e proposed method extracts keywords according to the importance of words in a question.Generally, if modifying or removing a word in a question causes a signi cant change in the answer given by a Q&A robot, it indicates that this word is important for the Q&A robot to correctly understand and answer the question.However, since the Q&A robot is a black box for attackers, it is di cult to determine the important part of a question according to the Q&A robot except for continuous interactions with the Q&A robot.To solve this problem, the proposed keywords extraction algorithm identi es the important parts of a question according to the dependency relation of the question and thus can work in a black-box scenario without interactions with Q&A robots.Note that, the extracted keywords are determined by the dependencies between the words in the current sentence.If the same word has di erent dependencies in di erent sentences, the importance of the word in di erent sentences will be di erent.
e dependency relation is a method of describing the grammatical structure of a sentence, which represents the grammatical relation between words in a sentence [42].Generally, a dependency parser converts a sentence into a dependency tree.e root of the tree is called the head of the sentence, which does not modify any word [42].An example of a dependency parse for the sentence "Who played the voice of Aladdin" is shown in Figure 2.
e root of the dependency tree is "played."e arrow represents the dependency relation between two parts.For instance, the dependency relation between "Who" and "played" is the nsubj relation, which means that "Who" is the nominal subject (nsubj) of "played."Similarly, "the voice" is the direct object (dobj) of "played," "of" is the prepositional modi er (prep) of "the voice," and "Aladdin" is the prepositional object (pobj) of "of".e dependent relations of a sentence can be divided into aux (auxiliary), arg (argument), and mod (modi er) [43].
ese relations can be further divided into 48 di erent grammatical relations.In order to extract the important parts of an input question, the proposed keywords extraction method only focuses on words that satisfy the following rule: the dependent relations between the word and the head of the sentence is in the arg relation set (R arg ).e dependent relations contained in the arg relation set are shown in Figure 3 [43].
e proposed keywords extraction algorithm is shown in Algorithm 1 Firstly, a dependency parser is used to extract a dependency tree from the question q ori .e dependency parser used in this method is a dependency parser provided by spaCy (https://spacy.io),which is a natural language processing tool.spaCy uses a transition-based parser to extract dependencies [44], and the process of extracting the dependency relation of a question is recapitulated as follows.Initially, the parser has an empty stack and a bu er, where the original question is in the bu er [44].en, the parser uses the shift and reduce operations to control the state of the stack and the bu er [44].e shift operation moves the word in the bu er to the top of the stack, while the reduce operation pops top two words in the stack and determines the dependency relation between these two words [44].e shift and reduce operations are repeated until the stack and bu er are empty.As a result, the dependency relation of the question is obtained, which is represented as a dependency tree [44].All nodes on the dependency tree are denoted by T w i , h i , r i | w i , h i ∈ q ori , where w i is the word on the ith node of the tree, h i is the word on the parent node of the ith node, and r i is the dependent relation between w i and h i .
en, if the root of the dependency tree is a content word, the root is added to the keyword set K. For each child node of the root, if the child node satis es the following two conditions, the word on the child node is also added to the keyword set K. e two conditions are (1) the dependent relation between the child node and the root is in the arg relation set R arg and (2) the word on the child node is a content word.Finally, if the question q ori contains a clause, the root of the dependency tree is rst replaced by the head of the clause.
en, the keywords are extracted in the same way in the clause.After extracting the keywords, the important parts of the question q ori are determined.
ese extracted keywords are denoted as K k 1 , k 2 , k 3 , . . ., k p , where p is the number of keywords.
Since the function words in a question have little e ect on the answer returned by the Q&A robot, the function words are not used as keywords in the proposed method.Compared with using all the content words as keywords, the proposed algorithm only uses content words that have a greater in uence on the Q&A robot for returning the correct answer.In Section 4.4, we will compare the performance of the proposed keywords extraction method with content words extraction method that selects all content words from the question as the keywords.Input: original question q ori Output: keywords set K (1) Initialize a keywords set K, a stack S, and a word P (2) w i , h i , r i | w i , h i ∈ q ori dependency parser (q ori ) (3) Push the head of the question q ori into the stack S (4) While S is not empty do (5) Pop the top of the stack S to the word P (6) if P is content word then (7) Add P to keyword set K (8) end if (9) for w j ∈ child nodes of P do (10) if r j ∈ R arg and w j is a content word then (11) Add w j to keyword set K (12) end if (13) if w j is modi ed by a clause then (14) Push the head of the clause into the stack S (15) end if (16) end for (17) end while (18) return keywords set K ALGORITHM 1: Keywords extraction algorithm.

Adversarial Words Generation Based on Extracted
Keywords.To mislead a Q&A robot, the input questions are slightly modified to generate adversarial examples.e difference between the original question and the modified question should be as small as possible so that humans have no trouble understanding the modified questions.To this end, the proposed method generates adversarial words that are similar to the extracted keywords.
ese adversarial words are used to modify the corresponding keywords in the original question.e proposed adversarial word generation method is shown in Algorithm 2, which generates three types of adversarial words: typos of keywords, words that are spelled similarly to the keyword, and typos of these spellsimilar words.
We describe Algorithm 2 in detail as follows.e algorithm determines typos of the keyword k from the typos corpus.If the edit distance between the keyword k and typos of the keyword k is less than or equal to 2, the typos of the keyword k are added to the adversarial words set W adv .e adopted typos corpus is publicly available in [45], which contains the Birkbeck typos corpus [46], the Holbrook typos corpus [47], the Aspell typos corpus [48], and Wikipedia typos corpus [49].
Words that are spelled similarly to the keyword are determined by searching in a dictionary according to proposed constraints.e dictionary contains common English words [50], which are divided into 26 subdictionaries according to the initial letters.Firstly, the subdictionary SD i is determined according to the initials of the keyword, in which the initials of all words in the SD i are the same as the initials of the keyword.
en, if the word w in the subdictionary SD i satisfies the proposed constraints, the word w is added to the corresponding similar word set KS. e proposed constraints are as follows: (i) e edit distance between the word w and the keyword k is less than or equal to a predefined edit distance d. (ii) e part of speech (POS) of the word w is the same as the POS of the keyword k. (iii) e first letter of the word w is the same as the first letter of the keyword k.Similarly, the last letter of the word w is the same as the last letter of the keyword k.
e first constraint can identify words that are spelled similarly to the keyword.e purpose of the second constraint is to increase the success rate of adversarial attacks.
e effect of the second constraint on the success rate of the generated adversarial examples is demonstrated in Section 4.3.1.e reasons behind the third constraint are as follows.On the one hand, inspired by [38], keeping the first and last letters of the word unchanged makes it easier for humans to recognize the original form of the modified word.On the other hand, sufficient similar words can be searched in a subdictionary.Hence, it is unnecessary to spend more time searching for more similar words from other subdictionaries.is constraint can make the algorithm only search one of the 26 subdictionaries, which can effectively reduce the number of searches and thus improve the search efficiency.
e Damerau-Levenshtein distance [51,52] is used to evaluate the edit distance between two words.For the keyword k and a word w in the dictionary, the Damerau-Levenshtein distance between them (dis(k, w)) is the minimum number of character operations required to convert the keyword k to the word w.Character operations include inserting, deleting, replacing a single character, or transposing two adjacent characters [53].In order to search for appropriate similar words, we set different predefined edit distances according to the POS of the keyword.e value of d is determined by the following rule: where length(k) is the length of the keyword k and the max function ensures that the distance d is not less than 1.If the POS of the keyword k is a verb, the distance d is set to be 1.
Otherwise, the distance d is set to be max 1, ⌊(length(k)−  2)/2⌋}.e reason for setting different predefined edit distances d for the verb and other words in a sentence is as follows.A verb is an important part of a sentence.If the difference between the verb of the modified sentence and the original sentence is too large, it may affect human's understanding of the modified sentence.Hence, such predefined distance settings can ensure that the edit distance between the verb of the adversarial example and the verb of the original question is small so that people have no difficulty in understanding the generated adversarial examples.
After searching for words that are similar to the keyword, the adversarial words are generated based on these similar words.For each word ks i in the similar word set (KS), if the edit distance between keyword k and ks i is less than or equal to 2, the word ks i is added directly to the adversarial word set W adv .Otherwise, the algorithm searches for typos of the word ks i .If there are typos of ks i in the typos corpus and the edit distances between these typos and the keyword are less than or equal to 2, the typos of ks i are added to the adversarial words set W adv .Finally, for each keyword in the question, a corresponding adversarial word set is obtained.

Adversarial Questions Generation.
For each keyword, the corresponding adversarial words are generated.ese adversarial words are used to replace the corresponding keywords in the original question to generate multiple adversarial questions.However, if too many keywords are replaced in the original question, humans cannot infer the semantics from the context of the question and may have trouble understanding the generated adversarial examples.Hence, in order to prevent too many keywords from being modified in the original question, the following criterion is applied to select the appropriate adversarial questions: 6 Security and Communication Networks Add q adv to Q adv , if dis q ori , q adv  < ε, where q adv is the generated adversarial question, Q adv is the adversarial questions set, dis(q ori , q adv ) is the edit distance between the original question q ori and the generated question q adv , and ϵ is a predefined threshold that represents the maximum edit distance between the original question q ori and the generated question q adv .dis(q ori , q adv ) can not only limit the number of modified words in the entire question but also limit the degree of modification in a single word.If dis(q ori , q adv ) is smaller than the maximum edit distance ϵ, the adversarial question is added to the adversarial questions set Q adv .Otherwise, the adversarial question will be discarded.Finally, for each original question, a corresponding adversarial question set is generated.e time complexity of the proposed adversarial examples generation algorithm is analyzed as follows.e proposed adversarial examples generation algorithm consists of three parts: keywords extraction, adversarial words generation, and adversarial questions generation.Assume that there are n words in the input question.
e time complexity of keywords extraction and adversarial questions generation is Θ(n).For adversarial words generation algorithm, the main time overhead is to search for similar words.Assume that there are m words in a subdictionary.For a given keyword, it only needs to perform m comparisons to determine spell-similar words.If the time cost of each comparison is c 1 and the time cost of determining the typos of the word is c 2 , the runtime of the adversarial words generation algorithm is approximately: which can generate adversarial words in a constant time.
is means that the time complexity of the adversarial words generation algorithm is also Θ(n).
erefore, the time complexity of the proposed adversarial example generation algorithm is Θ(n).It is shown that the proposed method has good scalability and can generate adversarial examples efficiently for large datasets.

Experimental Evaluation
Since this is the first work on robustness and security issues of Q&A robots (no comparison works are available), we use two top Q&A robots and human evaluations to evaluate the proposed method.First, the experimental setup is presented in Section 4.1.In Section 4.2, we use multiple metrics to evaluate the impact of generated adversarial examples on Q&A robots.Besides, we invite participants to subjectively evaluate the quality of generated adversarial examples.In Section 4.3, the effects of different parameter settings on the performance of the proposed method are evaluated, which include the POS constraints of similar words and the maximum edit distance.In Section 4.4, the performance of the proposed method is further evaluated from two aspects: the proposed keywords extraction algorithm and the proposed keywords modification algorithm.

Experimental Setup
4.1.1.Datasets.In the experiment, three standard Q&A datasets, WebQuestionsSP [39], CuratedTREC [40], and Input: keyword k Output: adversarial word set W adv (1) //Query typos of keyword k (2) if there are typos of keyword k in the typos corpus and dis(k, typos of k) ≤ 2 then (3) Add typos of keyword k to W adv (4) end if (5) //Search for spell-similar words (6) Set the value of d according to the POS of k (7) Determine the subdictionary SD i according to the initials of the keyword k (8) for w ∈ SD i do (9) if the word w satisfies three constraints then (10) Add similar word w to similar word set KS (11) end if (12) end for (13) //Query typos of spell-similar words (14) for ks i ∈ KS do (15) if dis(k, ks i ) ≤ 2 then (16) Add ks i to W adv (17) else (18) if there are typos of ks i in the typos corpus and dis(k, typos of ks i ) ≤ 2 then (19) Add typos of ks i to W adv (20) end if (21) end if (22) end for (23) return adversarial words set W adv ALGORITHM 2: Adversarial words generation algorithm.

Security and Communication Networks
WikiMovies [41], are used to generate adversarial questions.
e information of the three datasets are as follows: (i) WebQuestionsSP: this dataset, which is created by Yih et al. [39], contains semantic parses for the questions from the WebQuestions dataset.ere are 4737 questions in the WebQuestionsSP dataset.(ii) CuratedTREC: this dataset is collected by Baudiš and Šedivỳ [40] based on the Text REtrieval Conference (TREC) [54] corpus, which consists of 2180 questions extracted from TREC1999, TREC2000, TREC2001, and TREC2002 dataset.(iii) WikiMovies: this dataset is constructed by Miller et al. [41], which consists of question-answer pairs in the field of movies.e WikiMovies dataset contains training set, development set, and test set.
e three sets contain 96k, 10k, and 10k examples, respectively [41].In the experiment, we use the test set to generate the adversarial examples.
e adversarial examples generated from these standard Q&A datasets can form a new adversarial questions dataset.Unlike these standard Q&A datasets which are used to evaluate the ability of Q&A robot to answer questions, the generated adversarial questions dataset is used to evaluate the robustness of Q&A robots when facing typos and misspellings and evaluate Q&A robots' understanding of sentence semantics.In other words, if the Q&A robot cannot answer the question in the standard Q&A datasets, it means that the Q&A robot does not have the answer to the question.Unlike this, if the Q&A robot cannot answer the adversarial question in the generated adversarial questions dataset, it indicates that the Q&A robot has the answer to the original question, but it cannot process the perturbation in the adversarial question.

Target Q&A Robots.
To illustrate the feasibility of the proposed method, the success rate of the generated adversarial examples on two top Q&A robots, DrQA [28] and Google Assistant [2], is calculated.e information of the two target Q&A robots are as follows: (i) DrQA is an open-domain question answering system based on Wikipedia, which consists of two components [28]: the document retriever module and the document reader module.e document retriever module searches for articles related to the question from the Wikipedia database, and then the document reader module uses RNN model to extract answers from the relevant articles.DrQA has good performance on multiple Q&A datasets.erefore, DrQA is a good baseline to evaluate the performance of the proposed adversarial examples generation method.(ii) Google Assistant [2] is an intelligent personal assistant designed by Google, which provides question and answer service.Users can ask questions to Google Assistant by voice or text.If Google Assistant can correctly answer the user's question, it will directly return the corresponding answer.Otherwise, it will return web search results related to the question [2].In the experiment, we send questions to it in plain text and record the answers returned by Google Assistant.If the answer returned by Google Assistant is a web search result, we consider that Google Assistant cannot answer this question correctly.

Evaluation Metric.
Success rate [37] is used as the metric to evaluate the adversarial examples generated by the proposed algorithm.e success rate is the ratio of questions that the Q&A robot answers incorrectly in all generated adversarial questions [37].e higher the success rate of the generated adversarial examples is, the more effective the attack on the target Q&A robot is.Besides, we use three other metrics, recall @k (R n @k) [23], mean reciprocal rank (MRR) [24], and mean average precision (MAP) [24] to evaluate the impact of adversarial examples on the top-k candidate answers returned by the Q&A robot.R n @k [23] reflects whether the correct answer exists in the top-k candidate answers returned by the Q&A robot, where n is the number of relevant documents retrieved by the Q&A robot.e same as [23,55,56], we use R 10 @2 and R 10 @5 as the evaluation metrics.MRR [24] reflects the position of the first correct answer in the top-k candidate answers returned by the Q&A robot.MAP [24] reflects the ranking of the correct answers in the top-k candidate answers returned by the Q&A robot.Note that, Google Assistant returns only one answer or some webpages.On the one hand, these top-k related metrics cannot be calculated based on only one answer returned by Google Assistant.On the other hand, since the returned webpages are not specific answers, we also cannot calculate these metrics based on returned webpages.erefore, we cannot use these three metrics to evaluate the performance of Google Assistant answering adversarial questions.Hence, in this paper, these three top-k related metrics can only be used to evaluate DrQA.
To demonstrate that the adversarial questions generated by the proposed method does not affect human's understanding, we invite a number of participants to evaluate whether they understand the meaning of the generated adversarial examples.We define a metric named comprehension rate.
e comprehension rate of a participant is calculated as N und /N all , where N und is the number of adversarial examples that the participant can understand correctly and N all is the number of all the evaluated adversarial examples.

Experimental
Results on Q&A Robots.Table 1 shows three samples generated by the proposed method.e underlined letters represent the difference between the generated adversarial example and the original question.In the first example, the keyword in the original question is replaced by a typo of the keyword.In the second example, the keyword is replaced by a word that is spelled similarly to the keyword.In the third example, the keyword is replaced by a typo of a spell-similar word.It is shown that only one or two characters in the original question are modified, but the answers given by the Q&A robots are very different from the answer to the original question.Table 2 presents the success rate of the adversarial examples generated from the three datasets on the two target Q&A robots.e maximum edit distance ϵ is set to be 4 in this experiment.It is shown that the generated adversarial examples have a high success rate on DrQA.In other words, for most adversarial examples, DrQA cannot return the correct answers.Compared with DrQA, Google Assistant is more robust to the generated adversarial questions.However, there are still about half of the adversarial questions that Google Assistant cannot answer correctly.erefore, the generated adversarial examples can mislead the target Q&A robot's understanding of questions, resulting in a low accuracy of answering questions.
Besides, we use the metrics R 10 @2 and R 10 @5 [23], MRR [24], and MAP [24] to evaluate the impact of adversarial questions on the top-k candidate answers returned by DrQA.Table 3 shows the performance of DrQA answering original questions and adversarial questions in terms of R 10 @2, R 10 @5, MRR, and MAP.Note that since we only use questions that DrQA can answer correctly to generate adversarial questions (as discussed in Section 3.2), the R n @k and MRR scores of DrQA answering original questions are 1.It is shown that the scores of these metrics are very low when answering adversarial questions, which indicates that the adversarial questions not only affect the correct answer (top-1) returned by DrQA but also affect the top-k candidate answers returned by DrQA.Specifically, R 10 @2 and R 10 @5 scores indicate that when DrQA answers adversarial questions, the number of correct answers in the top-k answers returned by DrQA is less than that of DrQA when answering original questions.
e MRR score indicates that the adversarial examples make the first correct answer rank lower in the top-k candidate answers returned by DrQA.Compared with the MAP score of DrQA answering original questions, the MAP score of DrQA answering adversarial questions is much lower, which indicates that the generated adversarial questions can significantly make all correct answers rank lower in the top-k candidate answers returned by DrQA.
In the practical use of Q&A robots, different users may use different expressions to describe the same meaning of a question.
erefore, we also evaluate the success rate of generating adversarial examples for questions with the same meaning but different expressions.We select 50 questions that the Q&A robot can answer correctly from the Web-QuestionsSP dataset.en, we rephrase these questions by restructuring these questions and replacing the words in the question with synonyms.
e meaning of the restated question is consistent with the original question.Since not all the restated questions that the Q&A robot can answer correctly, it is meaningless to generate adversarial examples from those questions that the Q&A robot cannot answer.erefore, we discard the restated questions that the Q&A robot cannot answer correctly and discard the corresponding original questions.Finally, for DrQA, we generate 257 and 263 adversarial examples from 41 original questions and 41 corresponding restated questions, respectively.For Google Assistant, we generate 289 and 277 adversarial examples from 44 original questions and 44 corresponding restated questions, respectively.Table 4 shows the success rate of the adversarial questions generated from the original questions and from the restated questions on the two target Q&A robots.e results show that the success rate of the adversarial examples generated using restated questions is similar to that of the adversarial examples generated using original questions.erefore, the restated questions can also effectively generate adversarial examples.Two examples of adversarial questions are shown in Table 5, which are generated using the original questions and using the restated questions, respectively.e underlined letters represent the difference between the adversarial question and the original question or the restated question.It is shown that both the adversarial questions generated from the original questions To avoid human subjective factors affecting human evaluation results, we invite 10 different participants to evaluate the quality of the generated adversarial examples and determine whether the subjective factors (i.e., gender, age, and mother tongue) of the participants affect the human evaluation results.Specifically, the background of these 10 participants is as follows: (1) there are 5 male participants and 5 female participants; (2) 7 participants are 18∼35 years old, and 3 participants are 36∼50 years old; and (3) there are 8 participants whose mother tongue is Chinese and 2 participants whose mother tongue is English.Compared with automatic evaluation on the Q&A robots by programs and scripts, human evaluation is a time consuming process for the participants and therefore not suitable for evaluation with a large number of questions.Hence, in this experiment, we randomly select 50 adversarial questions generated from WebQuestionsSP dataset to perform human evaluation and calculate the comprehension rate of each participant.
Table 6 shows the minimum, maximum, and average comprehension rate of participants under different subjective factors.It is shown that participants can understand the meaning of most of the generated adversarial examples.In addition, under different subjective factors, the comprehension rate of each type of participants is similar.In other words, participants with different backgrounds have no difference in understanding the generated adversarial questions.erefore, the participants' gender, age, and their mother tongue hardly affect humans' understanding on the generated adversarial questions, and humans can understand the meaning of the generated adversarial examples correctly.

Different POS Constraints of Similar Words.
In the process of generating adversarial words (Section 3.4), the proposed method uses three constraints to search for words that are spelled similarly to the keyword.In order to verify that the second constraint can improve the success rate of the proposed adversarial examples, we compare the success rate of adversarial examples under the following three   Figure 5 presents the success rate of adversarial examples generated under di erent maximum edit distance settings.It is shown that the larger the maximum edit distance is, the higher the success rate of the adversarial examples is.e reason behind this is that when the maximum editing distance is set to be large, the di erence between the adversarial question and the original question will be large.
erefore, the probability of the Q&A robot answering the question correctly will be small, and thus the success rate of the adversarial examples will be high.However, a large maximum edit distance may make it di cult for humans to understand the generated adversarial questions.
We also invite 10 participants (as mentioned in Section 4.2) to evaluate the e ect of maximum edit distance on the comprehension rate of humans.e adversarial questions are generated under the maximum edit distance ε 3, 4, 5, respectively.For ε 4, we have evaluated the comprehension rate of humans on the generated adversarial questions in Section 4.2.For ε 3 and ε 5, we randomly select 20 adversarial questions generated from the WebQuestionsSP dataset for evaluation, respectively (since human evaluation is a time consuming process for the participants, it is not suitable to evaluate using a large number of questions).Figure 6 shows the minimum, maximum, and average comprehension rate of humans on the generated adversarial  questions under the maximum edit distance ε 3, 4, 5, respectively.It is shown that the larger the maximum edit distance is, the lower the comprehension rate of humans is.erefore, the maximum edit distance is set to be 3∼5 to ensure that the generated adversarial examples have a good success rate, while at the same time humans have no difculty in understanding the generated adversarial examples.

Keywords Extraction and Keywords Modi cation
Evaluation.In the proposed method, keywords extraction and keywords modi cation are two important steps of generating adversarial examples.erefore, we also evaluate the performance of the proposed method from these two aspects.

Keywords Extraction Evaluation.
To evaluate the performance of the proposed keywords extraction method, we implement two other keywords extraction methods for comparison.
e random keywords extraction method is used as one baseline, which randomly selects one or more words from the question as the keywords.e content words extraction method is used as another baseline, which selects all content words from the question as the keywords.In the evaluation experiment, rst, the random keywords extraction method, the content words extraction method, and the proposed keywords extraction method are used to extract keywords, respectively.en, the extracted keywords are removed from the question to generate adversarial examples.Note that in this experiment, we removed the keywords directly instead of replacing them to evaluate the importance of the keywords founded by these three methods.
ese adversarial examples are generated using the WebQues-tionsSP dataset.Lastly, the generated adversarial examples are applied to the target Q&A robots, and the success rates are calculated.
Table 7 presents the success rates of the adversarial examples generated by di erent keywords extraction methods.Compared with the random keywords extraction method and the content words extraction method, the proposed keywords extraction method has a higher success rate on the DrQA and the Google Assistant.is indicates that the proposed keyword extraction method can e ectively extract keywords which are important in the original question.If the keywords in a question change, DrQA and Google Assistant will not be able to answer the question.
erefore, the proposed keyword extraction method can e ectively improve the success rate of the generated adversarial questions.It is also shown that the content keywords extraction has a higher success rate than the random keywords extraction method, which indicates that content words are important than function words.

Keywords Modi cation Evaluation.
When evaluating the performance of the proposed keywords modi cation method, the random keywords modi cation method and the noisy texts method [38] are used as baselines.First, the proposed keywords extraction algorithm is used to extract keywords from the question.en, the random keywords modi cation method, the noisy texts method, and the proposed keywords modi cation method are used to modify the keywords to generate three di erent types of adversarial examples, respectively.e random keywords modi cation method randomly replaces the characters in the keywords.
e noisy texts method generates adversarial examples by modifying a word in the following ve ways [38]: replacing a single letter, swapping the position of two letters, randomizing the order of letters in a word except the rst and last letters, randomizing the order of all letters, and replacing letters with adjacent letters on the keyboard.Similarly, these adversarial examples are generated from the WebQues-tionsSP dataset.
e generated adversarial examples are applied to the target Q&A robots.
Table 8 presents the success rate of the adversarial examples generated by di erent keywords modi cation methods.It is shown that the adversarial examples generated by the proposed method have a higher success rate than the adversarial examples generated by the random keywords modi cation method.
e success rate of the adversarial examples generated by noisy texts method [38] is close to the success rate of the adversarial examples generated by the proposed method.Note that when targeting the Google Assistant, the success rate of the adversarial examples generated by the noisy texts method [38] is a little bit higher than that of the proposed method.e reason is that the average edit distance (d 5.2) between the adversarial examples generated by the noisy texts method and the original question is larger than the average edit distance (d 3.7) between the adversarial examples generated by the proposed method and the original question.However, larger edit distance makes it more di cult for humans to understand the meaning of the adversarial examples.Besides, we also compare the impact of di erent keywords modi cation methods on the comprehension rate of humans.For the proposed keywords modi cation method, we have evaluated the comprehension rate of humans on the generated adversarial questions as discussed in Section 4.2.For the random keywords modi cation method and the noisy texts method [38], we randomly select 20 adversarial questions generated from the WebQuestionsSP dataset, respectively, and evaluate the comprehension rate of 10 participants on the generated adversarial questions.
Figure 7 shows the minimum, maximum, and average comprehension rate of participants under di erent keywords modi cation methods.It is shown that the comprehension rate of humans under the proposed keywords modi cation method is higher than the comprehension rate of humans under other keywords modi cation methods.In other words, compared with random keywords modi cation method and the noisy texts method [38], it is easier for humans to understand the meaning of the adversarial questions generated by the proposed keywords modi cation methods.Overall, the adversarial examples generated by the proposed method have a high success rate on DrQA and Google Assistant, and humans can easily understand the meaning of the generated adversarial examples.

Conclusion
In this paper, we propose a novel adversarial examples generation method for Q&A robots, which can be used as a fast and automatic test dataset generation method for the robustness and security evaluation of intelligent Q&A robots in black-box scenarios.
e proposed method generates adversarial questions by modifying the important part of a question slightly, which is close to the practical use of Q&A robots, e.g., typos, spelling mistakes, and similar words.
ese generated adversarial questions can successfully make the Q&A robot answer incorrectly, while it ensures that the di erence between the generated adversarial questions and the original question is so small that it does not a ect human's understanding of the question.In the experiment, two state-of-the-art Q&A robots, DrQA and Google Assistant (which are considered to be two top Q&A robots currently), are used to evaluate the success rate of the proposed method.Experimental results show that the generated adversarial examples have high success rates on DrQA and Google Assistant.e metrics R n @k, MRR, and MAP on DrQA further indicate that the generated adversarial examples cause DrQA to return fewer correct answers in the top-k candidate answers and cause the correct answers to rank lower in the top-k candidate answers returned by DrQA.In addition, the human evaluation results demonstrate that even if the participants' gender, age, and mother tongue are di erent, they have no di culty in understanding the generated adversarial examples.is is the rst adversarial examples generation method for intelligent Q&A robots and also the rst automatic test dataset generation method for the robustness and security evaluation of Q&A robots. is paper can hopefully help evaluate and enhance the robustness of intelligent Q&A robots.

Figure 1 :Figure 2 :
Figure 1: Framework of the proposed adversarial questions generation method.

10
Security and Communication Networks settings:(1) the POS of the similar word is the same as the POS of the keyword;(2) the POS of the similar word is di erent from the POS of the keyword; and (3) there is no constraint on the POS of the similar word.We generate adversarial examples under these three di erent settings.en,these generated adversarial examples are applied to the DrQA to calculate the success rate.e comparison results of the three di erent settings are shown in Figure4.Obviously, the success rate of the adversarial examples generated in setting 1 is higher than that of the adversarial examples generated in setting 2 and setting 3. erefore, when searching for the words that are spelled similarly to the keyword, keeping the POS of the similar word the same as the POS of the keyword can e ectively improve the success rate of the adversarial examples.4.3.2.Di erent MaximumEdit Distance ϵ. the proposed method, di erent maximum edit distance settings not only a ect the success rate of the generated adversarial examples on Q&A robots but also a ect human's understanding of the generated adversarial examples.In this section, under different maximum edit distance settings, we evaluate the success rate of the adversarial examples on the Q&A robots and evaluate the comprehension rate of humans of the adversarial examples.e adversarial examples are generated from the three datasets under di erent maximum edit distances, and DrQA is used to evaluate the success rate of these adversarial examples.

Figure 4 :Figure 5 :
Figure 4: Success rate of adversarial examples generated under three di erent POS constraints on the DrQA.

Figure 6 :
Figure 6: Minimum, maximum, and average comprehension rate of humans on the generated adversarial questions from the WebQuestionsSP dataset under the maximum edit distance ε 3, 4, 5, respectively.

Table 1 :
Examples of generated adversarial questions and the returned answers of Q&A robots.In this section, we use the metric comprehension rate to evaluate different humans' understanding on the generated adversarial examples.Besides, the effect of different maximum edit distance settings on the comprehension rate of humans is presented in Section 4.3.eeffect of different keywords modification methods on the comprehension rate of humans is presented in Section 4.4.

Table 2 :
Success rate of the adversarial examples generated from the three datasets on the two target Q&A robots (ϵ � 4).

Table 3 :
Performance of DrQA answering original questions and adversarial questions in terms of R 10 @2, R 10 @5, MRR, and MAP.

Table 4 :
Success rates of the adversarial questions generated from the original questions and from the restated questions on the two target Q&A robots.

Table 5 :
Examples of adversarial questions generated from the original questions and from the restated questions.

Table 6 :
Minimum, maximum, and average comprehension rate of participants under three di erent subjective factors (gender, age, and mother tongue).