RLAS-BIABC: A Reinforcement Learning-Based Answer Selection Using the BERT Model Boosted by an Improved ABC Algorithm

Answer selection (AS) is a critical subtask of the open-domain question answering (QA) problem. The present paper proposes a method called RLAS-BIABC for AS, which is established on attention mechanism-based long short-term memory (LSTM) and the bidirectional encoder representations from transformers (BERT) word embedding, enriched by an improved artificial bee colony (ABC) algorithm for pretraining and a reinforcement learning-based algorithm for training backpropagation (BP) algorithm. BERT can be comprised in downstream work and fine-tuned as a united task-specific architecture, and the pretrained BERT model can grab different linguistic effects. Existing algorithms typically train the AS model with positive-negative pairs for a two-class classifier. A positive pair contains a question and a genuine answer, while a negative one includes a question and a fake answer. The output should be one for positive and zero for negative pairs. Typically, negative pairs are more than positive, leading to an imbalanced classification that drastically reduces system performance. To deal with it, we define classification as a sequential decision-making process in which the agent takes a sample at each step and classifies it. For each classification operation, the agent receives a reward, in which the prize of the majority class is less than the reward of the minority class. Ultimately, the agent finds the optimal value for the policy weights. We initialize the policy weights with the improved ABC algorithm. The initial value technique can prevent problems such as getting stuck in the local optimum. Although ABC serves well in most tasks, there is still a weakness in the ABC algorithm that disregards the fitness of related pairs of individuals in discovering a neighboring food source position. Therefore, this paper also proposes a mutual learning technique that modifies the produced candidate food source with the higher fitness between two individuals selected by a mutual learning factor. We tested our model on three datasets, LegalQA, TrecQA, and WikiQA, and the results show that RLAS-BIABC can be recognized as a state-of-the-art method.


Introduction
Today, the questions charged in numerous domains in cyberspace, such as Stack Over ow and GitHub, are progressing quotidianly. QA is one of the vital branches of natural language processing (NLP) that can have the ability to answer questions automatically. QA can be made in two ways: Several methods focus on generating answers that usually employ developing networks like generative adversarial network (GAN) to create answers [1]. Nonetheless, they cannot guarantee accurate meaning and grammar.
Another category of methods uses AS, one of the essential subtasks of QA which is also applied in other elds such as machine comprehension [2]. Over the last few years, the problem has been gaining an increasing amount of attention [3,4]. A question q and a set of candidate answers A a 1 , a 2 , a 3 , . . . , a N are given, and the goal is to attain a i ∈ A as the best answer to question q. Questions and answers can have various lengths, and multiple answers may be the true answer to a question.
From the literature, there are numerous methods for AS based on traditional and deep learning methods [5]. e traditional approaches rely more on search engine [6], information retrieval [7,8], handcrafted rules [9], or machine learning models [10,11]. Information retrieval-based models work based on the keywords without using any semantic data, which makes it challenging to obtain the correct answers [12]. Handcrafted rule-based techniques cannot enfold all patterns, and their performance is delimited [13,14]. In machine learning-based methods, features are manually made, so their quality laboriously depends on feature extraction [15,16]. Some criteria and classifiers, including edit distance and support vector machine, consider the matching associations between AS pairs [11]. Typically, traditional methods suffer from two major weaknesses. First, they mostly do not use semantic information in keywords, features, or rules, causing them not to consider all-side relationships between QA pairs. Second, feature extraction and handmade rules are not flexible, leading to inferior generalization capability. After the appearance of deep learning, many problems in many domains [17][18][19][20][21][22][23], including AS, have been overshadowed by it. Deep learning-based methods for AS usually employ a convolutional neural network (CNN) [24] or LSTM to grab semantic features on various levels. e main task is to estimate the semantic similarity between a question-answer pair, which can be regarded as a text similarity calculation or classification work. A CNN is employed to model the hierarchical structures of sentences and evaluate their matching amount [25]. At the same time, an LSTM is considered to generate the embeddings of questions and answers while keeping sequential dependency information. Although deep models can only achieve limited improvement, they face some difficulties. ey forge the embedding representation of the question-answer pair with one neural network design. is results in paying attention to one-side features and ignoring the other complex semantic features among question-answer couples. After that, models that try to comprehend languages were developed [26]. ese models realize language syntactic and semantic rules in different methods, including next word and sentence prediction and masked word prediction [27]. ey recognize a language and can make new texts with correct syntax and semantic rules. e BERT model [27] is one of the latest language models, being superior to all other developed language models. is model has grabbed advantage of the statement offered in transformers [28], which is currently widely employed in NLP tasks [29]. e success of deep models mainly relies on architecture, training algorithms, and selection of features employed in training. All these make the design of deep networks a complex optimization problem [30]. In many methods, the topology and transfer functions are set, and the space of possible networks is spanned by all potential values of the weights and biases [31]. In [32,33] and [34], ant colony optimization [35], tabu search [36], simulated annealing [37], and genetic algorithm [38] were utilized for the training of neural networks with fixed topology. e neural network learning optimization process discovers the weight configuration associated with the lowest output error. Nevertheless, finding the optimal weight for deep models largely depends on weight initialization that has a more significant impact on neural network performance than network architecture and training examples [39]. AS methods, including in-depth ones, utilize gradient-based algorithms such as BP and Levenberg-Marquardt (LM) [40] for model weight optimization. While the BP algorithms converge in the first-order derivatives, the LM ones converge with second-order derivatives [41]. e main problem of BP and LM is the sensitivity to the initial weights, which leads to getting stuck in the local optimization [42]. To deal with this problem, global search approaches, having the power to evade local minima, are being employed to pretrain weights, such as population-based metaheuristic (PBMH) algorithms [43][44][45]. Among PBMH algorithms, ABC is one of the most powerful algorithms for optimization problems, which has two advantages over traditional algorithms: no need to calculate gradients and not getting caught up in local optimizations [46]. is algorithm is based on the intelligent behavior of bees, containing two general concepts: food sources and artificial bees. Artificial bees are looking for food sources with high nectar. e position of the food source shows a solution to the optimization problem, and the amount of nectar equals the quality of a solution. Although the food source position is a critical factor determining whether a bee selects a food source, some necessary information is still missing when bees produce a neighboring food source.
One of the other main problems in AS is imbalanced classes, since the member number of positive class, including the question and the corresponding answer, is much smaller than that of negative class, including the question and the non-corresponding answer, which reduces the performance of existing methods. Proposed methods with an imbalanced problem are generally divided into two groups: data-level methods and algorithmic-level methods. In data-level algorithms, training data is manipulated to balance class distribution by an oversampling minority class, undersampling majority class, or both. SMOTE [47] is an oversampling system that generates new examples by linear interpolation between adjacent minority samples. Near Miss [48] is an undersampling method that deals with an imbalanced problem by accidentally removing samples from a larger class. is algorithm eliminates the data of the larger class when viewing two data points belonging to two various classes that are close in terms of distribution. Oversampling algorithms can increase the possibility of overfitting, and undersampling algorithms lose valuable information in the majority class. In algorithmic-level methods, the importance of the minority class rises with techniques such as costsensitive learning, ensemble learning, and decision threshold adjustment. In the cost-sensitive learning methods, different costs are allocated to the wrong classification of each class in the loss function, which is more for the minority class. Ensemble learning-based solutions train multiple subclassifications and adopt voting to get better results. reshold adjustment techniques train the classifier in the imbalanced dataset and change the decision threshold during the test. Deep learning-based methods have also been suggested to classify imbalanced data. e paper [49] introduced a loss function for deep models that equally receives classification errors from the majority and minority classes. Another study in [50] learns the discriminative features of imbalanced data while maintaining intercluster and interclass margins. e authors in [51] presented a method based on the bootstrapping algorithm that balances training data of convolutional network per mini-batch. An algorithm is proposed by [52] for optimizing network weights and classsensitive costs. In [53], the authors extracted complex samples in the minority class and improved their algorithm by batchwise optimization with Class Rectification Loss function [54].
In the last few years, deep reinforcement learning has been successfully used in computer games, robots' control, recommendation systems [55][56][57], etc. For classification problems, deep reinforcement learning has helped eradicate noisy data and learn better features, which significantly improved classification performance. Nonetheless, little research has been accomplished on applying deep reinforcement learning to imbalanced classification. Deep reinforcement learning is ideally appropriate for imbalanced classification as its learning mechanism, and specific reward function is comfortable paying more attention to minority class by giving higher rewards or penalties.
is paper presents an attention mechanism-based LSTM model for AS, called RLAS-BIABC, established on the BERT word embedding, reinforcement learning, and an improved ABC algorithm. e main body of the RLAS-BIABC model consists of two attention-mechanism-based bidirectional LSTM (BLSTM)networks and a feedforward network to calculate the similarity of the question-answer pair. e model aims to learn both positive and negative pairs. e positive pair is related to the question and real answer, while the negative one considers each question with the other answers. We use BERT as word embedding to learn the semantic similarity between sentences without pre-engineered features. What is more, we introduce an improved ABC algorithm for RLAS-BIABC, whose task is to find weight initialization in all LSTMs, the attention mechanism, and feedforward network to begin the BP algorithm. In this regard, we modify the ABC algorithm by applying mutual learning between two selected position parameters to choose the candidate food source with higher fitness. In addition, in the BP step, our proposed method employs reinforcement learning to handle imbalanced classification in the proposed method. In this respect, we define the AS problem as a guessing game divided into a sequential decision-making process. At each step, the agent takes an environmental state represented by a training instance and then executes a two-class classification operation under the guidance of a policy. If the classifier accomplishes the operation well, it will take a positive reward; otherwise, it will take a negative reward. e minority class is more rewarded than the majority one. e agent's goal is to get as many cumulative rewards as possible during the sequential decision-making process, that is, to classify the samples as accurately as possible. We assess the RLAS-BIABC model on three standard datasets, TrecQA, LegalQA, and WikiQA, and show RLAS-BIABC to be superior to other methods that use random weighting. e main contributions of the article are as follows: (1) We consider the BERT word embedding, which is the last developed model for many languages. (2) Instead of using the random weight system for the model weights, we define an encoding strategy and compute an initial value using an improved ABC algorithm. (3) We consider the AS problem a sequential decision-making process and propose a deep reinforcement learning framework for imbalanced classification. (4) We study the performance of the proposed model through experiments and compare it with the other methods that use the random weight for initialization and are faced with the imbalanced classification problem. e rest of this article is organized as follows: Section 2 presents a short review of related works. Section 3 introduces the ABC algorithm. Section 4 describes the framwork of the proposed model. Section 5 exhibits evaluation metrics, datasets, andresults. Section 6 provides a conclusion and future works.

Related Work
Until now, many approaches to the QA problem have been proposed. is section provides an overview of the methods based on machine learning and deep learning. e first proposed approaches were based on feature engineering. In these methods, the relationship between question and answer is measured by repeating common words, where bag-of-words and bag-of-grams [58] are commonly applied for this purpose. ese methods are not logical because they do not respect semantic and linguistic features in sentences. Subsequently, however, some studies have utilized language resources such as WordNet [59] to resolve the semantic problem but failed to remove linguistic limitations. Some researchers considered sentences' syntactic and semantic structure [60]. Some authors employed the dependency tree and the tree edit distance algorithm [15,61]. e research [62] confirmed that tools such as WordNet and NER [63] could play an influential role in selecting semantic features. e article [64] provided an effective solution for automated feature selection. ese methods were one of the first attempts to eliminate feature engineering.
Later, with the advent of deep learning, many methods used deep models as an automatic feature engineering tool. Recently, in-depth learning has covered a wide range of applications of NLP [18]. Moreover, recurrent neural network (RNN) and CNN are applied as two strong arms of deep learning in feature extraction [20,21]. e behavior of deep learning methods with question-answer pairs is divided into two categories. In the first category, question and answer are two distinct elements, and deep networks reach their representation vectors separately. Typically, various criteria are adopted to measure the similarity between them. e authors in [65] offered a compare-aggregate system that applies many metrics for similarity measuring.
e study [66] utilized the ELMo language model [26] to overcome question and answer work.
e results reveal the superiority of language models. In the second category, question and answer are Computational Intelligence and Neuroscience assumed to be a single sentence. In [67], a CNN-based approach is presented to score question-answer pairs in a pointwise manner. Another technique in [68] applies the BLSTM network for question answering. Primarily, the embedding of question and answer words is learned and then entered into a BLSTM network, and later the embedding of each sentence is estimated based on the average of its words. Lastly, the answer-question connection is fed to a feedforward network. Siamese network [69] is an essential branch of in-depth learning that has been applied in all fields, especially QA. e network provides two separate representation vectors for question and answer. In [70], the first deep learning task is presented for the AS task. In this study, the most relevant answer to the question is extracted using a CNN and logistic regression. e research [71] implemented the idea presented in [70]. e authors tried to make different models using hidden layers, convolution operations, and activation functions to improve the results. Another work in [72] mixes various models to produce representation vectors for every sentence. In [73], the authors convert each point model into a pair model. eir idea was that pair models could further enhance model performance.
e pair model was also applied to the model in [72]. e study [74] is a preprocessing operation. In this research, named entities are replaced with a unique token that facilitates selecting candidate answers. e impressive effectiveness of this technique was confirmed by applying it to the model presented in [73]. Meanwhile, the authors in [75] claimed that not all the named entities could be replaced with one token, so they considered a token for each named entity. It was later found that using the attention mechanism could produce more valuable models. Unlike the Siamese-based technique, the attention mechanism uses context-sensitive interactions [76] between question and answer. e attention mechanism was first proposed for machine translation but was later employed in other applications such as question answering [77,78]. e approach in [79] considered the attention mechanism and RNNs to succeed in the answer-selection task. It was based on the attention mechanism proposed in [80]. In [81], the authors employed a method based on inter-weighted alignment networks to determine the similarity between a question-answer pair. e article [82] suggested a scheme based on a bidirectional alignment mechanism and stacked RNNs. In the first works, the attention mechanism was performed only on RNN, but later [83] pointed out that combining a CNN and attention mechanism could be more efficient.

Long Short-Term Memory (LSTM).
In a nutshell, RNNs [84] are designed to model sequential inputs. In these networks, a data sequence is mapped to a series of hidden states. e output is then generated using the following equations: (1) where W h and U h are weight matrices and b means bias. θ and τ represent the activation functions such as ReLU and Tanh. x t ∈ R d is the input with dimension d, and h t ∈ R h equals the hidden layer with size h at time t.
RNNs have proven to be successful in many areas of NLP, such as text generation [85] and text summarization [86]. However, later, it became clear that as the length of the input of these networks increases, they suffer from problems such as gradient explosion and vanishing [87]. e LSTM network proposed by Hochreiter and Schmidhuber [88] can prevent the mentioned problems. is is because memory units can effectively handle long dependencies. In particular, LSTM consists of several control gates and one memory unit. Let x t , h t , and c t represent input, hidden state, and memory cell at time t, respectively. Given a sequence of inputs (x 1 , x 2 , . . . , x T ), LSTM should calculate a sequence of hidden units (h 1 , h 2 , . . . , h T ) and memory cells (c 1 , c 2 , . . . , c T ) as output. In terms of formula, the specified process can be defined as follows [89]: where W and b are network parameters. i, f, and o display input gate, forget gate, and output gate, respectively. σ stands for sigmoid function.
Although many problems can be solved under the umbrella of LSTM networks [18,19,90], experiments show that BLSTM can be more effective than LSTM. A BLSTM network [91] is an extended LSTM net that processes input from start to end and vice versa. is process generates two hidden vectors, h → t and h ← t , for a specific input at the moment of t. us, the connected vectors, , form the final hidden vector. e information extracted by the units in the LSTM network is equally important in making the final decision, which reduces system performance. To illustrate the point, consider the sentence "Despite being from Uttar Pradesh, as she was brought up in Bengal, she is convenient in Bengali." In this sentence, words like "Bengali" and "Bengal" should be given more attention, while this is not the case in an original LSTM network. To overcome this problem, the attention mechanism has been considered. In an attention mechanism system, the importance of each hidden layer with a coefficient in the interval [0, 1] is involved in the construction of the final vector. Formally, the hidden unit vector for a particular input of length T is calculated by considering the coefficient α t for each hidden vector h t as follows:

Artificial Bee Colony (ABC)
Algorithm. e ABC algorithm is a technique inspired by the intelligent behaviors of bees in nature. Two general concepts form the main body of the algorithm ABC: food sources and artificial bees. Artificial bees are looking for food sources with high nectar. e position of the food source indicates a solution to the optimization problem, and the amount of nectar corresponds to the quality of a solution. ABC involves three different groups of bees: employed, onlooker, and scout. Employed bees search for food sources with higher nectar in the vicinity of other food sources around them and share their information with onlooker bees in the dance area. e numbers of employed and onlooker bees are the same, and each is equal to half of the colony. Each employed bee exists in a hive, so the number of employed bees equals the total hives. Like employed bees, onlooker bees search for the best food sources in their neighborhood. Employed bees whose food resources do not improve after a few steps are converted to scouts, and a new search begins. e optimization process of ABC is summarized as follows: Initialization Stage. Food sources as bee locations in the search space are initialized as follows: where i refers to the i-th solution that takes the integer value in the interval [1,BN], whereBN is the total number of solutions. Each solution consists of D elements, where D shows the number of weights to be optimized. x j min and x j max are the lowest and highest value in the solution i, respectively. Employed Bee Stage. After initialization, the employed bees identify new sources in the neighborhood of existing food ones. Now they calculate the quality of the designated food sources. If their quality is better, they erase the information of previous sources from memory, replacing it with that of new sources. Otherwise, the data of earlier sources will remain unchanged. Formally, this step can be described by the following formula: where k has an integer value in the interval [ where fit(xi) is the fitness value for the i-th solution.
According to (7), the higher the fit(xi) is, the more likely the observer bee will accept this solution. e onlooker bee goes to it if the selection is performed, and a new solution is generated according to (6). Scout Bee Stage. In the last step, scout bees are employed to escape the local optimum. More specifically, any solution that fails to improve the process after some cycles becomes a scout bee, and the food source is dropped. erefore, a new food source replaces the old one according to (6).
e four steps mentioned above are performed up to several times to meet the termination criteria. e complete ABC algorithm is given in Algorithm 1.

The Framework of RLAS-BIABC
e proposed algorithm considers two critical options for classification. In the first step, we formulate a vector that includes all the learnable weights in our model, and we optimize it utilizing the ABC algorithm. en, we apply the BP algorithm in the rest of the path. Besides, another problem that most classifiers suffer from, including ours, is imbalanced data. To take this aspect into account, we employ the opinions of reinforcement learning. We present these two ideas in two separate sections. e general architecture of the proposed model is shown in Figure 1. Consider a question Q containing a sequence of n words, where Q � (q 1 , q 2 , . . . , q n ), with the answer A, where A � (a 1 , a 2 , . . . , a m ) including m words. Let a i , q j ∈ R D show the D-dimensional visual presentations of a word. Two LSTMs are provided for each question and answer. Two pairs of positive and negative data are used to learn the model. In the positive pair (Q, A), A is the correct answer to question Q; the output of the model should go to one. Meanwhile, in the negative pair (Q, A ′ ), where A ′ is the fake answer to the question, the network should move to zero for this pair. e embedding calculated by LSTMs for question and answer is expressed as follows: where fori � 1 to BNdo (6) Produce new solution x new using (6)  (7) Calculate the fitness f new for x new (8) Replace x new with x i if better (9) end for (10) Calculate the probability p for every solution in X using (7)  (11) //Onlooker Bee Phase (12) fori � 1 to BNdo (13) if rand (0, 1) < p i then (14) Produce new solution x new using (6)  (15) Calculate the fitness f new for x new (16) Replace x new with x i if better (17) end if (18) end for (19) //Scout Bee Phase (20) If an abandoned solution is found, replace it with the solution produced by (6)  (21) Put the best solution ever in x best (22) Itr � Itr + 1. (23) end while (24) returnx best ALGORITHM 1: Pseudocode of the ABC algorithm. 6 Computational Intelligence and Neuroscience where W u , W v , b u , b v represent the parameters of the attention mechanism. After determining the efficient representation of question and answer by the attention mechanism, we form a vector consisting of the connected q, a, and |q − a| according to Figure 1 and enter it into a feedforward network. It has been experimentally confirmed that the difference between two representation vectors can act in a successful decision [92].

BERT-Based Word
Embedding. Word embedding serves as a function of mapping words to semantic vectors for use in deep learning algorithms. Word embedding is a reliable way to extract significant representations of words established in their context. Much research has been conducted to find the best meaningful word representations on neural network models such as Skip-gram [93], GloVe [94], and FastText [95]. Over the last few years, the pretrained language model (PLM), which is a black box with prior knowledge of the natural language and is fine-tuned in NLP works, has been much applied. PLM models generally use unlabeled data to learn model parameters [96]. e paper considers the BERT model [27], one of the latest techniques in the PLM trends. BERT is a bidirectional language model trained on big datasets such as Wikipedia to generate contextual representations. In addition, it is commonly fine-tuned from a neural network dense layer for different classification duties. e fine-tuning functionality includes the contextual or the problem-specific meaning with the pretrained generic meaning and trains it for a classification task. Figure 2 indicates the architecture of a BERT model. BERT uses a bidirectional transformer, in which its representations are jointly conditioned on both the left and right context in layers [97], which differentiates it from the other models, including Word2Vec and GloVe, that build an embedding in one direction to dismiss its contextual differences.

Pretraining Stage.
Weight initialization is an essential point in designing a neural network, the nonobservance of which leads to misleading the model. e proposed structure has two LSTM networks, two attention mechanisms, and one feedforward neural network, each of which has its weights that must be trained. e paper uses an improved ABC algorithm for pretraining weights.

Mutual Learning-Based ABC.
In the standard ABC algorithm, artificial bees randomly select a food source position and change it to create a new position. If the fitness value of the new solution is better, it will replace the current solution. Otherwise, no change will be applied. In other words, in a D-dimensional optimization problem, one dimension is randomly selected, its value is changed, and the better outcome is selected in each iteration. Based on (6), the newly generated solution v j i depends on only two parameters, x j i and x j k , making the food source v j i uncontrollable, sometimes larger and sometimes smaller than the current food source. In the ABC algorithm, a food source with a higher fitness value is required. To always produce a food source a higher value, we consider the fitness information acquired by mutual learning between current and neighboring food sources.
where Fit i and Fit k indicate the fitness value of the current food source and the neighboring food source, respectively. φ j i shows a uniform random number in the interval [0, F], in which F is a nonnegative constant named the mutual learning factor. As we can see, the value v j i depends on their position and their value of fitness. By comparing the current and neighboring food sources, the fitness values of new solutions move to better sources. at is, if the current food source has higher suitability, the candidate solution will move toward a better solution; otherwise, it will tend to move toward the neighboring source. is learning strategy allows making a better candidate solution. e parameter F plays an essential role in balancing the perturbation between related food positions. In addition, F must be a nonnegative positive number to ensure it goes to a better solution. As F increases from zero to a particular value, the perturbation on the corresponding position decreases, meaning that the fitness value of the new food source is close to the higher fitness. A large value of F weakens the power of exploitation and exploration.

Encoding Strategy.
Encoding means the weights are arranged in a vector, which is considered the bees' position in ABC. Choosing the right layout is a challenging task; however, we tried to design the best encoding strategy possible after several experiments. Figure 3 denotes an example of the encoding for two LSTMs, two attention mechanisms, and a two-layer feedforward network. Note that all weight matrices are stored in rows.

Fitness Function.
e purpose of the fitness function is to measure the efficiency of a solution. e paper employs the following function as a competency function: where T is the total number of training samples and y i and y i are the target and predicted labels for the i-th data, respectively.

Classification.
Reinforcement learning (RL) [98] is a subfield of machine learning that solves a problem by making successive decisions [99,100]. Recently, reinforcement learning has achieved excellent results in classification because it can learn valuable features or select high-level samples from noise data. In [101], the classification problem was defined as a sequential decision-making process that used several factors to learn the optimal policy. However, complex simulations between agents and environments have somewhat increased the time complexity. Another work in [102] submitted a solution for learning a relationship in text noise data. For this purpose, the proposed model is divided into two parts: instance selector and relational classifier. e instance selector is designed to extract quality sentences from noise data with the agent help. At the same time, the relational classifier learns better performance from selected clean data and gives delayed reward feedback to the instance selector. Finally, the model results in a better classification and quality dataset. e authors in [103][104][105][106] considered deep reinforcement learning to learn the helpful training data features. Generally, they improved the valuable features of the classifier. e work in [107] used reinforcement learning to classify time series data in which the reward function and the Markov model are designed. So far, little research has been done on the classification of unbalanced data, especially the processing of natural languages using reinforcement learning. In [108], an ensemble pruning method that picks the best sub-classifiers under the reinforcing learning umbrella was developed. is method was effective for small data because it was practically impossible to choose classifiers with many subcategories.
is section describes how to apply reinforcement learning to prevent imbalanced classification. Overall, the agent receives a sample at each step and classifies it. After that, the environment gives immediate and next rewards to the agent. A positive reward is assigned to the agent by the environment when it categorizes the sample correctly. Otherwise, it receives a negative reward. Finally, the agent learns the optimal behavior by maximizing the aggregate rewards and then can classify the samples as accurately as possible.
Let D � (x 1 , l 1 ), (x 2 , l 2 ), . . . , (x T , l T ) be training data, where x i � (q i , a i ) is thei-th sample so that q i and a i are the i-th question and answer that enter the model, respectively. l i ∈ 0, 1 { } shows the target of the i-th example. We consider the following conditions for an agent.

Policy π θ .
e policy π θ is a mapping function π: S ⟶ A where π θ (s t ) denotes the action a t performed by an agent in state s t . In our work, the proposed classification with the set weight θ is recognized as policy π θ .

State s t .
Each example of the training dataset is described as a state. e agent takes the first data x 1 as the initial state s 1 at the start of the training. State s t at each time step t corresponds to x t in the training dataset. e order of the samples in each iteration is different for the agent.

Action a t .
e action performed by the agent is to predict the category label. Hence, the agent's performance is related to the training dataset label. e recommended model is a binary classifier, a t ∈ 0, 1 { }, where zero and one show the minority and majority classes, respectively. In this context, the relevant question and answer are one, and the irrelevant question and answer are zero.

Reward r t .
e agent receives a positive score if the sample is classified correctly and a negative score otherwise. Since minority class instances are more critical because of their small number, the algorithm should consider the size of the score for the minority class more. e reward function is described as follows: r s t , a t , l t � +1, a t � l t and s t ∈ D P −1, a t ≠ l t and s t ∈ D P λ, a t � l t and s t ∈ D N −λ, a t ≠ l t and s t ∈ D N where λ ∈ [0, 1], and D P and D N are related to the minority and majority classes, respectively. l t is the label of the sample x t . e bonus amount is considered the cost of predicting the label. According to this relation, when λ < 1, the amount of the cost of the minority class is more. If the distribution of all classes is balanced, λ � 1, then the prediction cost of all classes is the same. We will examine the different values of λ in our experiments.

Terminal E.
e episode is a transition trajectory from the initial state to the terminal state (s 1 , a 1 , l 1 ), (s 2 , a 2 , l 2 ), . . . , (s t , a t , l t )}. An episode finishes when all instances in the training data are classified or when the agent misclassifies the instance from the minority class. 8 Computational Intelligence and Neuroscience

Transition Probability P.
e model transition probability, i.e., p(s t+1 |s t , a t ), is deterministic. e agent transfers from state s t to state s t+1 according to the order of instances in the dataset.
In the proposed model, the π policy takes the input data and calculates its label probability: e agent aims to identify the data input sample as accurately as possible. e best performance is attributed to the agent when it can maximize its cumulative rewards as follows: Equation (14) is called the return function, the total accumulated return from time t with the discount factor c ∈ (0, 1] until the time when the agent moves in the search space.
e action value Q in RL expresses the expected return for action a in state s, which can be defined as follows: Equation (15) can be extended according to the Bellman Equation [109]: By maximizing function Q under policy π, we can maximize cumulative rewards, namely Q * . e optimal policy π * obtained under function Q * , which is a policy that performs best for our model, is as follows: By combining (16) and (17), function Q * is computed as follows: For low dimensions, the values of the function Q are collected in a table to obtain the optimal value according to the recorded values. However, the function Q can no longer be solved when the dimensions of the problem are continuous. To solve this problem, a deep Q-learning algorithm was adopted to model the function Q with a deep neural network. To that end, the tuple (s, a, r, s ′ ) obtained from (18) is stored in replay memory M. e agent selects a mini-batch B of transitions from M randomly and executes the dissent gradient algorithm on the deep Q network according to the following loss function: where y is the prediction of the function Q, which is formulated as follows: where s ′ indicates the next state s, and a ′ is the action executed in state s ′ .

Overall Algorithm.
We design the simulation environment according to the contents defined above. e network architecture of the policy largely depends on the complexity and number of training examples. In this context, the input of the network depends on the structure of the training samples, and the output is equal to the number of classes of instance data. e general training algorithm of the model presented in Algorithm 2 is shown. First, the initial weights of the policy π are initialized using the ABC algorithm, and then the agent continues the training process until the optimal policy is reached. e choice of action is made based Computational Intelligence and Neuroscience on the greedy policy, and the selected action is evaluated by Algorithm 3. e algorithm is repeated E times, where E in this paper is considered 15,000. At each step, the policy network weights are stored.

Datasets.
A dataset with many negative pairs can be one of the best options to evaluate the performance of the proposed system. We run our experiments on three datasets, LegalQA, TrecQA, and WikiQA, which are widely considered by many researchers. All three datasets have more negative than positive pairs. e statistical information of all datasets is shown in Table 1: (i) TrecQA [110] is derived from TREC track data. Yao et al. [10] made a complete version of the positive and negative pair set. Two training datasets, TRAIN and TRAIN-ALL, are available in this database. e correctness of the answers in TRAIN-ALL is checked automatically by matching pairs with regular expressions. All answers in the TRAIN, DEV, and TEST data were judged manually. We employ the TRAIN-ALL data to train our model. (ii) LegalQA [111] is a Chinese dataset of legal consultative questions collected from a Chinese association. Users' online questions have been answered by licensed lawyers. LegalQA includes four fields: question subject, question body, answer, and label. e positive pair is provided as ground truth directly online. (iii) WikiQA [112] is an open-domain QA dataset in which each question is linked to a Wikipedia page that is assumed to be the topic of the year. To eliminate answer sentence prejudice, all answers in the summary section of the page are considered candidate answers.

Evaluation Metrics.
According to previous research, MAP and MRR are the most common criteria for evaluating answer-selection tasks [77]. MAP measures the ability to rank answers to return the corresponding answer. However, MRR is repeated if a high-scoring match is found: (i) MAP (mean average precision) calculates the mean average precision on the ranking results as follows: where Q denotes the set of questions, n i is the number of answers to the i-th question, and R ij means the set of ranked results to question j from the best result to the j-th answer. (ii) MRR (mean reciprocal rank) evaluates the model suitability according to the position of the first correct answer, computed as follows: where r i indicates the position of the first matching answer for the i-th question.

Baseline Methods.
We evaluate our RLAS-BIABC model with several state-of-the-art methods for answer selection. e following are the details of these methods: KABLSTM [113] is a knowledge-aware method based on attentive BLSTM networks. is method uses knowledge graphs (KG) to learn the representation of questions and answers. EATS [75] adopted an RNN network to measure the similarity between the QA pair. First, it replaces each named entity with a specific word.
is system calculates sentence representation vectors by the attention mechanism. Finally, these vectors are entered into the feedforward network, and the similarity is calculated by the sigmoid function in the last layer. AM-BLSTM [114] considered two LSTM networks for a question and answer separately. e resulting embeddings were combined and entered into an multilayer perceptron (MLP) network for classification. Moreover, traditional techniques, such as penalties for each class, have been employed to prevent imbalanced classification. BERT-Base [115] introduced a search engine and transformer model method for selecting answers. is article adopts simple models such as Jaccard similarity and compare-aggregate to rank the answers to a question. DRCN [116] offered an architecture based on a densely connected recurrent and co-attentive network in which hidden features are maintained at the top layer. Connection operations in this paper are performed using the attention mechanism to preserve information better. In addition, an autoencoder has been adopted to reduce the volume of information. P-CNN [117] introduced a new approach using a positional CNN for text matching that considers positional information at the word, phrase, and sentence levels. DARCNN [118] combined BLSTM, self-attention, crossattention, and CNN to find the global and local features of the question and candidate answer, leading to better semantic modeling. Finally, it utilizes an MLP network to assign a score to a question-answer pair. DASL [119] submitted a model with a Bayesian neural network (BNN) to effectively optimize the loss in the ranking learning process. Another study of this article is how to combine active learning and self-paced learning for model training. KAAS [120] applied an interactive knowledge-enhanced attention network for AS that extracts rich features of question and answer knowledge at several levels. Additionally, an attention and self-attention network is considered to learn the semantic features of sentences.

Details of Implementation.
In this work, Python and PyTorch have been utilized for the implementation. Jupyter has been used to implement project codes. Another library used in this study is NLTK. is library provides classes and methods for processing natural languages in Python. is library can perform a wide range of NLP operations. We use a two-layer BLSTM. Moreover, due to the connection of vectors in the two networks, we employ batch normalization before the data enters the feedforward neural network. Table 2 indicates the values of the other parameters. Input: D � (x 1 , l 1 ), (x 2 , l 2 ), . . . , (x T , l T ) : a training dataset of size T (1) Initialize the weights of policy π using Algorithm 1 (2) Initialize environment ε (3) Initialize replay memory M (4) for episode e � 1 to Edo (5) Shuffle the dataset D (6) s 1 � x 1 (7) fort � 1 to Tdo (8) a t � π 0 (s t ) //select an action based on ε-greedy (9) [r t , end t ] � Reward(x t , a t , l t ) (10) s t+1 � x t+1 (11) Save (s t , a t , r 1 , s t+1 , end t ) to M (12) Sample randomly a mini-batch of transitions (s k , a k , r k , s k+1 ) from M (13) y k � r k , end k � True r k + cmax a′ Q(s k+1 , a′; θ), end k � False (14) Accumulate gradients w.r.t θ: dθ � dθ + z(y k − Q(s, a; θ)) 2 /zθ (15) ifend t � Truethen (16) break (17) end if (18) end for (19) end for ALGORITHM 2: Pseudocode for training RIAS-BIABC.

Experimental
Results. Due to heuristic algorithms working randomly, we repeated all the experiments 10 times. Quantitative results of the three datasets are given in Table 3. In addition to comparing the proposed method with the state-of-the-art algorithms, to evaluate the effectiveness of ABC and RL components on the model, we employ three techniques: AS + random weight, AS-BIABC, and RLAS. AS + random weight is a system applying only random weights for initial weighting. Models AS-BIABC and RLAS accept only ABC and RL, respectively. For the LegalQA dataset, the RLAS-BIABC model has beaten other models, including IKAAS, in the MAP and MRR criteria, so that our model has reduced the error by more than 40% and 24% in these two criteria. By comparing RLAS-BIABC with AS-BIABC and RLAS, we can see that it decreases the error rate by about 51%, indicating the importance of the initialization and RL approaches. For the TrecQA dataset, our algorithm obtained the highest MAP and MRR, followed by EAT algorithm. e error improving rate in this database is approximately 30.13% and 21.00% for MAP and MRR criteria, respectively. In the WikiQA dataset, RLAS-BIABC decreases the classification error by more than 32% and 42% compared to IKAAS and DRCN, respectively.
Next, we prove that the improved ABC is more powerful than others. To do this, we fix all pieces of our algorithm for a fair comparison, including the LSTM networks, the attention mechanisms, and the reinforcement learning, and only change the trainer. To reach this goal, we compare our offered trainer with six conventional algorithms, including GDM [121], GDA [122], GDMA [123], OSS [124], and BR [125], and eight metaheuristic algorithms, including GWO [126], BAT [127], DA [128], SSA [129], COA [130], HMS [131], WOA [132], and ABC [133]. In all metaheuristic methods, population size and function evaluations are 100 and 3,000, respectively. e rest of the parameters of the algorithms are shown in Table 4. e results of metaheuristic and conventional algorithms are collected in Table 5. RLAS-AM-BR and RLAS-BABC performed best for all datasets for conventional and metaheuristic algorithms. As we expected, the metaheuristic algorithms perform better than the conventional ones. Without exaggeration, the improved ABC has a more acceptable performance than all of them, so that compared to the best algorithm, i.e., the main version of ABC, it can diminish the error by approximately 16%.

5.5.1.
e Effect of the Reward Value of Majority Class. e environment helps the agent achieve the goal by considering the reward function.
is article considers two different rewards for the minority and majority classes. Minority class reward was set to +1/−1 while the majority class was set to +λ/−λ. e best value of λ for both TrecQA and WikiQA datasets is 0.5. Generally, as the dataset size increases, the number of negative pairs increases, so λ tends to decrease. For λ � 0, the importance of the majority class is overlooked, and for most λ � 1, the importance of both classes is equal.

Exploration on Loss Function.
Traditional techniques, including manipulating the loss function and data augmentation, can also deal with data imbalances. However, they largely depend on the issue at hand. In the meantime, the loss function has a more colorful role because it can make the minority class more prominent.
To check the inefficiency of the loss functions on our model, we selected the five functions Weighted Cross-Entropy (WCE) [134], Balanced Cross-Entropy (BCE) [135], Focal Loss (FL) [136], Dice Loss (DL) [137], and Tversky Loss (TL) [138]. e WCE and BCE loss functions give weight to the positive and negative samples. e FL function is suitable for applications with imbalanced data. It downweights the contribution of uncomplicated examples and allows the model to focus more on learning complex samples [139]. e evaluation results of these loss functions for the three datasets are shown in Table 6. e results show that all the functions have about the same MRR and MAP in the three datasets. As expected, the FL function performs better than the others, so it is about 51.16% better than the algorithm with the usual loss function, i.e., the RLAS-BABC model.

Case Study.
In this section, we intend to qualitatively evaluate the effectiveness of reinforcement learning in our model. For this purpose, we randomly select a sample from the TrecQA dataset. Given the question, "When were the Nobel Prize awards first given?" top answers are given in Table 7. e left column presents the model results without using reinforcement learning, and the right column shows the model results with reinforcement learning. Our results say that the model without reinforcement learning is more inclined to assign a higher score to negative responses. However, the model with reinforcement learning has assigned as many scores as possible to the answers to the question.

Exploration on Word
Embedding. Word embedding is one of the main components of deep learning models because the input is interpreted as a vector, and in case of incorrect embedding, the model may be misled. is study uses the BERT model as a word embedding, developed as one of the latest embedding models. In order to check other word embeddings on our model, we employ four word embeddings: One-Hot encoding [140], CBOW [141], Skip-gram [93], GloVe [94], and FastText [95]. One-Hot encoding is the vital process of altering the categorical data variables to be supplied to deep learning algorithms, improving predictions and classification accuracy. is word embedding makes a new binary feature for each class and allocates a value of 1 to the feature of each sample that corresponds to its original class. CBOW and Skip-gram are models that use neural networks to map a word to its embedding vector. e GloVe word embedding is an unsupervised learning algorithm performed on a corpus's aggregated global word-word cooccurrence statistics. FastText is word embedding that is an extension of the Skip-gram model. Instead of learning vectors for words, this method represents each word as an n-gram of characters. e results of this experiment are shown in Table 8. As expected, One-Hot encoding has the worst performance among all word embeddings, so in the TrecQA dataset, where this word embedding shows the best performance, the improvement rates for the MAP and MRR criteria are about 64.70% and 72.91%, respectively. CBOW and Skipgram perform almost identically in all three datasets due to their similar architecture, with both being superior to the GloVe word embedding. FastText serves as the best word embedding for all models but still acts poorly on the BERT model. e BERT model decreases errors by more than 11%, 10%, and 19% compared to the FastText model for the WikiQA, TrecQA, and LegalQA datasets.

e Effect of the Parameter F on the Model.
To examine the effect of the parameter F expressed by (10) on the proposed method algorithm performance, F is set to 0.5, 1, 1.5, 2, 2.5, 3.5, 4, 4.5, and 5. e results obtained by these settings for the three datasets are shown in Figure 5. As can be seen, for the LegalQA dataset, when F rises from 0 to 2,  the algorithm performs better and better. However, it can be observed that when F increases from 2 to 5, the method performance decreases. is means that a small or large value of F weakens the algorithm performance. For the TrecQA and WikiQA datasets, the algorithm with F equal to 1.5 and 2 has the best performance compared to other values.

Conclusion and Future Works
is paper presented an approach called RLAS-BIABC for AS, established on an attention mechanism-based LSTM method and the BERT word embedding, combined with an improved ABC algorithm for pretraining and reinforcement learning for training the BP algorithm. e RLAS-AM-ABC model aims to classify the two positive and negative classes, in which the positive pair includes a question and a real answer. In contrast, the negative couple carries a question and a fake answer. Due to many negative pairs in the dataset, the RLAS-BIABC is converted to an imbalanced classification. To overcome this problem, we formulate our model as a sequential decisionmaking process. In this regard, the environment assigned a reward to each classification act at each step, where a minority class has a higher reward. It continued until the agent mistakenly categorized a minority class sample or the number of episodes ended. Initial weighting is another essential characteristic of deep models, which can result in getting stuck in a local optimum. To solve this concern, we initialized the policy weights with the improved ABC algorithm. e paper proposed a mutual learning technique that alters the produced candidate food source with the higher fitness between two individuals chosen by a mutual learning factor. We designed experiments to examine the factors influencing the model. e analyses demonstrate the power of reinforcement learning, BERT, and the improved ABC algorithm for selecting answers.
In future work, while improving the proposed model, we will try to examine the effectiveness of the proposed classifier on other NLP applications. Another task would be to provide a model for generating the answer to a question. As a solution, we will focus on GANs, which today has many applications in almost every field, including NLP tasks.

Data Availability
e data used to support the findings of this study are included within the article. We included the information of datasets in the articles (see part 5.1).

Conflicts of Interest
e authors declare that they have no conflicts of interest.