Performance Analysis of Hybrid Deep Learning Models with Attention Mechanism Positioning and Focal Loss for Text Classification

Over the past few decades, text classification problems have been widely utilized in many real time applications. Leveraging the text classification methods by means of developing new applications in the field of text mining and Natural Language Processing (NLP) is very important. In order to accurately classify tasks in many applications, a deeper insight into deep learning methods is required as there is an exponential growth in the number of complex documents. *e success of any deep learning algorithm depends on its capacity to understand the nonlinear relationships of the complex models within data. *us, a huge challenge for researchers lies in the development of suitable techniques, architectures, and models for text classification. In this paper, hybrid deep learning models, with an emphasis on positioning of attention mechanism analysis, are considered and analyzed well for text classification. *e first hybrid model proposed is called convolutional Bidirectional Long Short-Term Memory (Bi-LSTM) with attention mechanism and output (CBAO) model, and the second hybrid model is called convolutional attention mechanism with Bi-LSTM and output (CABO)model. In the first hybrid model, the attentionmechanism is placed after the Bi-LSTM, and then the output Softmax layer is constructed. In the second hybrid model, the attention mechanism is placed after convolutional layer and followed by Bi-LSTM and the output Softmax layer.*e proposed hybrid models are tested on three datasets, and the results show that when the proposed CBAO model is implemented for IMDB dataset, a high classification accuracy of 92.72% is obtained and when the proposed CABO model is implemented on the same dataset, a high classification accuracy of 90.51% is obtained.


Introduction
A representation topic in NLP analysis is text classification. For managing tremendous amount of text documents in the fields of web mining, information retrieval, and NLP, text classification plays a vital role [1]. e knowledge gained from text expression is employed in an efficient manner; thus, the assignment of one or more predefined topics to a natural text document is done easily in text classification. Versatile machine learning techniques have been utilized for the purpose of text classification such as Bayesian techniques [2], support vector machine [3], K-nearest neighbor (KNN) [4], neural networks [5], and hidden Markov model (HMM) [6]. In recent years, a dramatic improvement is shown by the deep learning techniques for text classification purposes. A good performance is achieved by Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) when implemented for medical image recognition, face recognition, voice recognition, etc. [7]. In order to synthesize images and voice successfully, generative models like Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN) are utilized successfully and implemented in the entertainment industry [8]. Similarly, deep learning has been successfully implemented in NLP fields too. For analyzing and representing the human languages in an automatic manner, NLP seems to be of great utility. CNNs and RNNs have caught huge attention in implementing NLP tasks such as text classification, sentiment analysis, and text summarization [9]. In recent years, CNN has attracted huge attention from the researchers as it shows a good performance in text classification. When CNNs are exploited for text classification, the embedding of text in multidimensional vector space is the most important task. In text classification, word embedding is constantly implemented, where word2vec dependent on skip-gram model is the most commonly used technique [10]. An improved RNN architecture is LSTM, where a gating mechanism which consists of an input gate, forget gate, and output gate is utilized [11]. e determination of data in the previous state whether to retain it or to forget it is explained by these gates. e problem of long-term data conservation along with the vanishing gradient issue is solved by the gating mechanism efficiently. e super ability of LSTM to extract text information in a versatile manner plays a significant part in text classification. In recent years, the utility of LSTM has been explored to a great extent, and researchers are always modifying or revamping LSTMs so that their accuracy can be further improved.
Some of the recent works involving deep learning for efficient text classification are stated as follows. e simplest deep learning models utilized for text representation are feedforward neural networks [12]. Deep Average Network (DAN) is an example of this model, and due to its simplicity, it outperforms other models easily [13]. Text is viewed as a sequence of words in RNN based models. Under the category of RNN based models, a tree-LSTM model was developed by Tai et al. [14], and an extended chain structured LSTM was developed by Zhu et al. [15]. To capture the local structure of a word sequence, RNNs are quite versatile, but when remembering long-range dependencies, they face a lot of difficulties. e patterns across the space are recognized by the CNNs when dealing with text classification. A simple CNN based model for text classification was proposed by Kim [16], character level CNN was proposed by Zhang et al. [17], and text encoding by CNN was proposed by Prusa and Khoshgoftaar [18]. In order to overcome the problem of pooling caused by CNN, capsule networks are utilized for text classification. e representation of sentences as a vector is adapted by capsules in capsule neural networks [19]. Other works involving capsule neural networks for classification are discussed in [20][21][22]. e models with attention mechanism for text classification were again proposed by many researchers. A hierarchical attention network for text classification was proposed by Yang et al. [23]. e hierarchical attention model was extended to cross lingual sentiment classification by Zhou et al. [24]. A directional self-attention network for RNN/CNN was presented by Shen et al. [25], and a LSTM model with inner attention was presented by Liu et al. [26]. To extract the sentence embeddings in a predictable manner, self-attention was used by Lin et al. [27]. A neural semantic encoder (NSE) which is nothing but a memory-augmented neural network was proposed by Munkhdalai and Yu [28]. To process the input questions and to generate their respective answers, a dynamic memory network was proposed by Kumar et al. [29].
Graph neural network models for text classification were also utilized by Peng et al.; they convert the text to graph-ofwords format [30]. A graph convolutional network (GCN) and its respective variants were utilized in [31]. A graph CNN which explored the document-word relations and the cooccurrence of words was built in [32]. For the text matching applications, Siamese Neural Networks (SNNs) were used widely [33]. Transformers were also utilized for text classification as they allow for more parallelization than other deep learning models, so that very huge models can be trained quite efficiently [34].
Hybrid models were also developed; they combine the usage of LSTM and CNN, so that the local and global features of the words and sentences can be captured well [35]. To combine the unique merits of LSTM, CRF, and CNN and to address the individual weaknesses, a Bi-LSTM-CRF-CNN hybrid model was also proposed in [36]. A convolutional LSTM (C-LSTM) was proposed by Zhou et al., where a sequence of higher-level phrases is extracted by CNN and then passed on a LSTM network so that the sentence representation is obtained [37]. A CNN-RNN model was proposed by Tang et al., where a CNN is utilized to learn the representation of sentences and a gated RNN is used to learn a document representation [38]. For document modeling, a hierarchical model called dependency sensitive D-SCNN was proposed by Zhang et al. [39]. A hybrid deep learning architecture called HDLTex which utilizes Multilayer Perceptron (MLP), RNN, and CNN was utilized by Kowsari et al. so that the count hierarchy can be well understood for easy text classification [40]. Based on word2vec's Continuous Bag of Words (CBOW) design, a Bi-LSTM neural network architecture was designed by Melamud et al. where, for the variable length sentence contexts, the general context embedding function was learned very efficiently [41]. A famous text classification model dependent on word2vec and LSTM was proposed by Xiao et al. so that the text can be classified efficiently in the security field, where, in order to overcome the high dimensionality, a pretrained word2vec model was used [42]. A hybrid model utilizing CNN and LSTM with the effective use of normalization techniques, dropout techniques, and rectified linear units was proposed by Rehman et al., and the results show a high accuracy and precision rate [43]. A hybrid approach was proposed by She and Zhang where the local features were extracted with the help of CNN and the weakness of the LSTM was tackled well by this model [44]. An efficient classification model termed CNN-COIF-LSTM was proposed in [45], where the experiments were conducted with different variants. A local CNN-LSTM model was proposed by Wang et al. where the regional information in sentences is considered along with the long-distance dependencies between sentences [46]. By combining different word embedding with different learning approaches like LSTM, Bi-LSTM, CNN, and GRU, a normal hybrid model was proposed by Salur and Aydin [47]. A self-interaction technique with attention perspective intermingled with label embedding was utilized by Dong et al. for efficient text classification, where attention schemes have been given good importance [48]. 2 Scientific Programming Boosting of the model is done by attention mechanism so that the learnable parameters are reduced, and the accuracy is improved further. e total quality of the input features shrinks as the model utilizes CNN to extract the features from various positions in a sentence. e contextual information is extracted from the features of LSTM. Over the inputs, the bias alignment is utilized by the attention mechanism, and therefore the weights to the input components are assigned, which are highly correlated with the classification. us, during the training phase, the number of parameters to be learned is still mitigated. For the variable length sequences, the weight distribution is also enhanced by the attention mechanism. To make sure that the model has a good accuracy rate, the weights that were pretrained utilizing word2vec are utilized. Later, the performance metrics are analyzed thoroughly.
In this work, two hybrid models with respect to the positioning of the attention mechanism is utilized and developed for efficient text classification. e organization of the work is as follows: Section 2 discusses the implementation of the CBAO model, and Section 3 discusses the implementation of the CABO model, followed by the results and discussion in Section 4 and conclusion in Section 5.

Deep Learning Framework
Model-CBAO Model

Word2vec
Process. e natural language is transformed into distributed vector representations by the popular sequence embedding technique called word2vec. In a multidimensional space, the capturing of contextual word-toword relationships is done, and therefore it is usually utilized as a significant step for prediction in information and semantic retrieval assignments. ere are two definite components in the word2vec process, namely, CBOW and skipgram. When the context words are given, the target word is inferred by the CBOW component. When an input word is specified, the context words are inferred by the skip-gram component.

CNN Modeling.
In the text classification tasks, 2D CNNs have been recently used in a similar way to image classification tasks and have performed better than the sequenced based techniques. A convolution layer and subsampling layer are utilized to generate a group of convolutions and pooling; thus, the feature map is constructed by the CNN. A 1D cross-correlation operation is utilized by the convolution layer in 1D CNN by means of involving a sliding convolution window in our work. Once the input text progresses from left to right, the sliding convolution window has variable size kernels. A 1D global maximum pooling layer is present in the max-overtime pooling layer which utilizes it efficiently; thus, the number of features required to encode the text is reduced. A 1D CNN structure which is utilized for text classification is presented in Figure 1.

Bi-LSTM Process.
e Bi-LSTM neural network comprises LSTM units that operate in both directions so the past and future context information can be incorporated very easily. Without the need to retain any duplicate context information, the long-term dependencies can be well read by Bi-LSTM. For sequential modeling issues, this Bi-LSTM has excellent demonstrations, and therefore it can be applied to text classification very efficiently. In order to apprehend the possessions in two contexts, the propagation of the two parallel layers of the Bi-LSTM network is done in two directions with forward and reverse passes. Figure 2 illustrates the sequential feature capturing process using Bi-LSTM architecture.

Attention Mechanism.
ere are two important issues in the RNN based seq2seq models. ere are chances for information loss along with vanishing gradient problem, as the compression of all the information in one fixed-size vector cannot be done. When the length of the word increases, the accuracy deteriorates to a great extent. erefore, to enhance the prediction accuracy, attention mechanism has been proposed recently. When the input sequences become too low, sometimes the accuracy can deteriorate. In every time step, the prediction of the output word is done by the decoder, and the referencing of the full input sentences from the encoder is done once again. All the input sentences need not be referenced in an equal aspect. More attention is given to the input word that is highly related to the word which has to be predicted. To each input, the initial weights are assigned by the attention mechanism. During the training phase, the update of these weights is done based on the correlation between the input and ultimate prediction.

CBAO Model.
Considerable research attention has been devoted to LSTMs, and many studies based on them are quite popular. To manage the drawbacks of the vanishing gradients, many gating schemes utilized by LSTMs are engaged to aid them in tracking long-term dependencies when long sequences are taken into input. However, to extract both the local content information and the context information of future tokens, the existing LSTMs fail to a large extent. e various relationships between specific parts of a document cannot be recognized properly, and therefore the accuracy of LSTM is further damaged. To handle these drawbacks efficiently, a 1D CNN is proposed here. e distinct merits of LSTM and CNN are leveraged well in this study, and therefore a hybrid model based on it is proposed where, for text feature extraction, a CNN is utilized and, for sentiment classification, the Bi-LSTM component with an attention mechanism is utilized. Between the adjacent words, the complex association is captured so that the classification accuracy is increased, thereby achieving the objective of this study. e proposed CBAO model architecture is indicated in Figure 3.

Sequence Embedding Layer.
e preprocessing of the text was done before the data was fed to the model. Preprocessing encompasses tasks like eliminating redundant, meaningless, and duplicate words along with the conversion of other forms of words to approximations. A distinct and quite meaningful sequence of words is provided by the preprocessed dataset, where a unique identification is provided to each word. e distributed specification of every preprocessed input token is learned thoroughly by the embedding layer. e latent alliances between the words which are more probable to emerge in similar conditions are reflected by the token representation. Vectors pretrained by word2vec's skip-gram model were utilized.

1D-CNN Model.
Extraction of features from the input text is the main role of the convolution process. From the original text, the low-level semantic features are extracted by the convolutional layers, and the number of dimensions is mitigated so that the text is processed as sequential data. Several 1D convolutional kernels are utilized in this study so that the convolution is performed over the input vectors. By means of concatenating the embedding vectors of the component words, a vector of sequential text is obtained and defined as where the number of tokens in the text is indicated by T.
Convolution kernels with different sizes are used for Y 1: T so that the fundamental unigram, bigram, and trigram features using a 1D CNN are captured. When a window of "s" words that extends from t: t + s is considered as an input during the t th convolution, the features are generated for that particular window by the convolution process and are represented as follows: where the embedding vectors of the words in the window are represented by y t: t+s− 1 , the learnable weight matrix is represented as W s , and the bias is represented as b s . In the different regions of the text, each filter is implemented, and therefore the feature map of the filter with a suitable convolution size s is represented as Between several adjacent words, the hidden correlations are well captured, and this is one of the primary advantages of utilizing a convolution kernel with varied widths. By utilizing max-outline pooling, the number of trainable parameters is reduced during the process of feature learning and serves as a salient characteristic of employing a CNN for textual feature extraction. Many convolution channels act upon the input, and every unique channel comprises values in various time steps. e largest value of all time steps is achieved from the output of each convolution channel during max-overtime pooling. In the feature maps, the maxovertime pooling is implemented with a convolution size "s" for each convolution kernel and is represented as e concatenation of p s for every filter size s � 1, 2, 3 is done to obtain the final feature map by extracting the unigram, bigram, and trigram hidden features and is represented as e total number of dimensions in the input features is mitigated and passed to the classifier or prediction model by utilizing the CNN over LSTM.

Bi-LSTM Attention
Layer. In this model, the primary classification constituent was constructed on an attention based Bi-LSTM. e correlation is not the same between each word and the final classification even though the input features are shrunk by the CNN which is meant for the prediction purposes. e advantages of CNN and Bi-LSTM are leveraged in this study. In order to encode the longdistance word dependencies in an effective manner, Bi-LSTM is used. e features generated and created by the  e easy and full-fledged access to the contextual information, for both preceding and subsequent information, is provided by the Bi-LSTM. e information procured by Bi-LSTM can be considered as two various textual renditions.
To the Bi-LSTM model, the features reaped from the CNN are fed so that a representation of the sequence is produced. e attention layer gets this final feature representation and thus the features that are highly correlated are selected for final classification. e prediction accuracy is considerably increased by the attention mechanism, and the number of learned weights required for the prediction is reduced. e attention mechanism utilized in this proposed model is Bahdanau attention with its respective attention scores.

CABO Model.
e proposed CABO model comprises 4 layers: word embedding layer, CNN layer with attention mechanism, Bi-LSTM layer, and output Softmax layer. Initially, when a text is given as input, this text is transferred into a discourse vector by the word embedding layer, and this is done utilizing a dictionary index. e extraction of the features from the text is done with the help of CNN layers with attention mechanism. e features are learned deeply by means of using Bi-LSTM layer. Finally, with the help of Softmax output layer, the classification result is given as the output by the model with a good loss function. e detailed explanation is given in the following subsections. Figure 4 shows the architecture of the proposed CABO model.

Word Embedding.
Mapping the words to real vector is the primary idea of word embedding. In order to gain a good structural expression of the data, the preprocessing of the text should be done before word embedding is done. To remove the low frequency words, the preprocessing of the words is done. In the sequence w 1 , w 2 , . . . , w l , the length of the input text is projected as l. In order to get the word vector matrix V � v 1 , v 2 , . . . , v l , word2vec was used, where V ∈ R l×s w , where s w indicates the word vector size. For every distance, the word vector has to be fine-tuned very carefully during the training phase, so that the feature extraction performance is improved. is is done in order to avoid the confusion as words can have various meanings in specific contexts.

CNN with Attention Mechanism.
For the purpose of feature extraction, the most widely used connection-based model is CNN. e connection-based multiple filters which have similar window size are present in the convolution layer, and they progress towards the output of the final layer. In order to learn the local features of the word vectors, 2 ResNet blocks are utilized, and they are shown in Figure 5. In one block, 3 convolutional layers are present, followed by batch normalization and ReLU activation. e building block is defined as follows: where, for one ResNet block, the input and output vectors are represented by y and z, respectively. After the activation of ReLU, the final output of the ResNet layer is expressed as h 1 , h 2 , . . . , h l . In order to explore the important components of the high level semantic, the attention mechanism is added to the ResNet layer. For all the states h 1 , h 2 , . . . , h l , where e t indicates the encoded state denoted by the weighted sum of h 1 , h 2 , . . . , h l at time t. e weight of h i is expressed as q t i . W ∈ R s×s and s t ∈ R s are utilized for transformation of h i into a scalar. e outputs of the attention mechanism and ResNet are multiplied by the model, and it sends the output to the next layer.

Bi-LSTM Module.
In order to overcome the gradient vanishing issue of the RNN, LSTM has been proposed. e duplication of the first recurrent layer in the network is done by Bi-LSTM so that the two layers can exist side by side. By utilizing the sample sentence, the demonstration of the bidirectional network is done. "I am very tired and therefore I want to AAA for the whole day." From the understanding of the word "tired," the prediction of the AAA may be "sleep," "eat," Scientific Programming "dance," "sing," "play," etc. Based on the following words "the whole day," the irrelevant options that are unsuitable in this context can be identified and removed. A Bi-LSTM learns the information from the extracted context while the Uni-LSTM learns the information from only one direction. e framework of Bi-LSTM is shown in Figure 6. e input is expressed as matrix Y, and the output is expressed as matrix O. e output of the forward layer memory cell is expressed by the sequence h 1 , h 2 , . . . , h l , and the output of the backward layer memory cell is expressed as h 1 ′ , h 2 ′ , . . . , h l ′ . e full operation of Bi-LSTM is expressed as follows: e weight matrices of the network are expressed by U i , V i , and W i , respectively. e nonlinear activation functions are expressed as f(·) and g(·).
Based on the forward layer state h t and the backward layer state h t ′ , the computation of the output o t is done at each time step t.

Output Layer with Good Focal Loss for Both the CBAO and CABO Models.
Softmax classification is used in the output layer for both models. e number of text classes is nothing but the output size, and (9) is used to express the conditional probability value of each type, where, to achieve the probability normalization, Softmax nonlinear activation is utilized as where the probability of reflecting the features of the class i z 1 , z 2 , . . . , z N is denoted by p i , and it indicates the output. i and j ∈ 1, 2, . . . , N { }, where the number of classes is expressed by N.
In order to solve the class imbalance problem, focal loss function is utilized instead of the cross entropy. By adding a Gaussian weight, the focal loss is improved so that the small weights can be enhanced.
e Gaussian weight is greater if there are a smaller number of samples in a particular class, and therefore the attention paid by the model is more. e specification of the improved loss function [49] is expressed as where the weighting factor is expressed by α i , and the count of every class is expressed by C i . To the cross-entropy loss, a modulating factor (1 − ρ i ) r is added by the focal loss with parameter c ≥ 0. e focal loss is equivalent to cross entropy when c � 0. α is decreased when c increases. e c � 2.5 and α � 0.5; α i ∈ [0, 1] is managed in this work after several trialand-error experiments. To control the weight of loss of every class, the parameters β and σ are utilized effectively.

Results and Discussion
In this section, the datasets analyzed along with the parameter settings of the proposed deep learning techniques, performance evaluation metrics, and comparison with previous works are discussed in detail.  Figure 4: e proposed CABO model architecture. 6 Scientific Programming

Datasets Analyzed.
To evaluate the performance of the proposed model, the experiments are conducted on three large-scale public datasets as follows.

Yelp 2016.
is dataset is obtained from the Yelp Dataset Challenge (https://www.yelp.com/dataset/challenge). A fivelevel rating is present in this dataset which ranges from 1 to 5, and therefore the documents can be classified into five classes.

Amazon Review.
It is obtained from Amazon products data (http://jmcauley.ucsd.edu/data/amazon/). From May 1996 to July 2014, the product reviews and the metadata from Amazon are obtained in this dataset. To the product reviews, a five-level rating is provided ranging from 1 to 5. All the other necessary and in-depth details of the dataset can be obtained from the mentioned web link.

IMDB.
IMDB dataset has been developed (https:// www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50kmovie-reviews) for the purpose of binary sentiment classification of the movie reviews. In this dataset, the numbers of positive and negative reviews are equally divided such as 25,000 reviews for training set and 25,000 reviews for test set.

Parameter Settings of the Proposed Deep Leaning
Techniques. As far as the IMDB dataset is concerned, the proposed two models have the following specifications when performing the experiment. e embedding of the word vector utilized in this investigation was done using skip-gram technique in word2vec. e embedding size was set to 500. e dropout rate was set as 0.4, and the batch size was set to 128.
e Adam optimizer was utilized along with L2 regularizer. As far as the Yelp 2016 and Amazon review datasets are concerned, the proposed two models have the following specifications. Once the embedding of the word vector using skipgram technique in word2vec is done, the update of the word embedding matrix is randomly generated based on the stochastic gradient descent process. e embedding dimension is set as 500, and in all the layers, the scale of the gradients is almost similar. e loss function utilized is focal loss function, Scientific Programming and the batch size is set to 50. When the norm exceeds a specific threshold, the scaling gradient is utilized to adopt the gradient clipping. e models are trained well with a learning rate of 0.001 by the stochastic gradient approach. e number of batches is set accordingly so that overfitting does not occur. e kernel size of the CNN is set as 5, the dropout rate is set as 0.5, and the number of filters is considered as 512 in our experiment. e activation function is ReLU, and the weights are updated using Adam optimizer. For different parameters, an adaptive and independent learning rate was designed using Adam by means of computing the first order moment and second order moment of the gradient. To protect against overfitting, random selection of the data is done by the dropout layer, thereby making the model more effective and robust. A good and reasonable dropout rate along with input dimensionality is necessary for modeling as these factors can affect the time efficiency to a large extent.

Performance Evaluation Metrics.
e following performance metric are used for evaluation in our work. e classifier performance is computed by accuracy and is expressed as e sensitivity is expressed as e specificity is expressed as e precision is expressed as e geometric mean (g-mean) is indicated as follows: e Matthews Correlation Coefficient (MCC) which is nothing but a Chi-square statistic is expressed as From  Figure 7 demonstrates the performance measures of major six parameters for the CBAO model. It is found from Figure 7 that CBAO model attained a higher accuracy, 92.72%, in IMDB dataset than in the other two datasets. e model reached an accuracy of 63.78% for Yelp 2016 dataset and 66.02% for Amazon review dataset. For Yelp 2016 dataset, a low value of F-score is due to lesser specificity of 57.14%, and again a low value of F-score is achieved in Amazon review dataset due to the same reason. IMDB dataset achieved a high g-mean value of 92.68% when compared to the g-mean values of 63.43% and 65.80% in the Yelp 2016 and Amazon review datasets, respectively. Figure 8 indicates the performance measures of five minor parameters, namely, error rate, FPR, false positive rate (FPR), false negative rate (balanced coefficient rate (BCR) and J-coefficient, for the CBAO model. It is found from Figure 8 that IMDB dataset demonstrates a low error rate of 7.28% and a high error rate of 36.22% and 33.98% in Yelp 2016 dataset and Amazon review dataset, respectively. In the case of J-coefficient parameter, the CBAO model is well distinguished by a higher value of 85.71% in the IMDB dataset. For the Yelp 2016 dataset and Amazon review dataset, the J-coefficient maintained the values of 48.27% and 46.15% accordingly.
is low value of J-coefficient is due to the presence of high error rate in the classification of the CBAO model for Yelp 2016 dataset and Amazon review dataset. Figure 9 exhibits the performance measures of major six parameters for the CABO model. It is found from Figure 9 that CABO model attained a high accuracy of 90.51% in IMDB dataset compared to the other two datasets. e model reached an accuracy of 68.15% for Yelp 2016 dataset and 60.75% for Amazon review dataset. In the case of Yelp 2016 dataset, a low value of F-score is due to lesser sensitivity of 60.12%, and a low value of F-score is achieved in Amazon review dataset for the same reason. IMDB dataset achieved a high g-mean value of 90.38% when compared to the g-mean values of 67.67% and 60.45% in the Yelp 2016 dataset and Amazon review dataset, respectively. Figure 10 demonstrates the performance measures of five minor parameters, namely, error rate, FPR, FNR, BCR, and J-coefficient, for the CABO model. It is found from Figure 10 that IMDB dataset demonstrates a low error rate of 9.49% and a high error rate of 31       training [54], and a large BERT model [55] produced an accuracy of 95.79%. As far as Amazon datasets are concerned, a character level CNN [17] produced 59.46%, deep pyramid CNN [56] produced 65.82%, CCapsNet [57] produced 60.95%, large BERT model [55] produced 62.20%, and base BERT model [55] produced 61.60%. As far as Yelp datasets are concerned, the character level CNN [17] produced 62.05%, deep pyramid CNN [56] produced 69.40%, CCapsNet [57] produced 65.85%, BERT fine-tuned model [58] produced 62.92%, and base BERT model [55] produced 70.58%. However, when our results with the proposed two models are considered and analyzed, the results obtained are pretty good as a high classification accuracy of 92.72% was obtained with the proposed CBAO deep learning model for IMDB dataset, a good classification accuracy of 66.02% was obtained with the proposed CBAO deep learning model for Amazon review model, and again a good classification accuracy of 68.15% was obtained with the proposed CABO model for Yelp 2016 dataset. Consequently, when comparing our work with the previously obtained results, the results seem to be promising and effective though many of the previously published results have surpassed our results by obtaining slightly more classification accuracy. In the field of machine learning and pattern recognition, it is not always important to obtain a better classification accuracy, but the methodology implemented along with the interpretation of the results is quite important, which has been very well focused on in our work. However, this research mainly focused on the analysis of attention mechanism positioning methodology and, from the results obtained, the method seems to be pretty fine for carrying out future analysis with a similar kind of strategy not only for text classification but for other domains too.

Conclusion and Future Work
e technique of categorizing text into a group of words is called text classification. e analysis of text can be done automatically by text classification using NLP by means of assigning a set of predefined categories depending on its context. us, NLP which is classified into rule-based systems, machine-based systems, and hybrid systems is utilized for various applications such as topic detection, sentiment analysis, and natural language inference purposes. In this work, a fair survey of text classification models is initially introduced, and then two kinds of hybrid models are designed by giving special emphasis to the positioning of the attention mechanism. e results show that when the proposed CBAO deep learning model is tested on IMDB dataset, Amazon review dataset, and Yelp 2016 dataset, a high classification accuracy of 92.72%, 66.02%, and 63.78% is obtained. For the proposed CBAO deep learning model, an error rate of 7.28%, 33.98%, and 36.22% is obtained for IMDB dataset, Amazon review dataset, and Yelp 2016 dataset, respectively. Similarly, when the proposed CABO deep learning model is tested on IMDB dataset, Amazon review dataset, and Yelp 2016 dataset, a high classification accuracy of 90.51%, 60.75%, and 68.15% is obtained. For the proposed CABO deep learning model, an error rate of 9.49%, 39.25%, and 31.85% is obtained for IMDB dataset, Amazon review dataset, and Yelp 2016 dataset, respectively. In the future, we aim to work with a plethora of other deep learning models for efficient text classification.
Data Availability e data will be made available to researchers upon request to the corresponding author.