Chinese Microblog Sentiment Detection Based on CNN-BiGRU and Multihead Attention Mechanism

With the rapid development of the Internet, Weibo has gradually become one of the commonly used social tools in society at present. We can express our opinions on Weibo anytime and anywhere. Weibo is widely used and people can express themselves freely on it; thus, the amount of comments on Weibo has become extremely large. In order to count up the attitudes of users towards a certain event, Weibo managers often need to evaluate the position of a certain microblog in an appropriate way. In traditional position detection tasks, researchers mainly mine text semantic features through constructing feature engineering and sentiment dictionary, but it takes a large amount of manpower in feature selection and design. However, it is an eﬀective method to analyze the sentiment state of microblog comments. Deep learning is developing in an increasingly mature direction, and the utilization of deep learning methods for sentiment detection has become increasingly popular. The application of convolutional neural networks (CNN), bidirectional GRU (BiGRU), and multihead attention mechanism-(multihead attention-) combined method CNN-BiGRU-MAttention (CBMA) to conduct Chinese microblog sentiment detection was proposed in this paper. Firstly, CNN were applied to extract local features of text vectors. Afterward, BiGRU networks were applied to extract the global features of the text to solve the problem that the single CNN cannot obtain global semantic information and the disappearance of the traditional recurrent neural network (RNN) gradient. At last, it was concluded that the CBMA algorithm is more accurate for Chinese microblog sentiment detection through a variety of algorithm experiments.


Introduction
Microblog refers to a broadcast social media and network platform based on user relationship information sharing, dissemination, and acquisition, which shares short real-time information through the following mechanism. Users are able to realize instant sharing and communication of information in multimedia forms such as text, pictures, and videos. e most famous microblog application in the world is Twitter [1] in the United States. In China, Sina Weibo, Tencent Weibo, and NetEase Weibo are the microblog applications possessed by tremendous users. With the popularization of the Internet, an increasing number of people begin to use Weibo, and the number of Weibo posts and comments posted by users is increasing as well, which also makes it increasingly difficult for Weibo managers to evaluate the sentiment of a certain microblog in the traditional way. e sentiment trend evaluation of microblog position refers to the judgment of sentiment trend through the analysis of microblog comments.
When we comment on a microblog, we can enter text or insert the emoji provided by Weibo official into the text. Chinese microblog comments are shown in Figure 1(a).
ree microblog comments were selected in the figure, which are statements commented by users and statements received by the Weibo background. ere will be obvious emoji in user comments, but it does not affect data storage. It is since that when storing these comments, all emoji will be converted into text, as shown in Figure 1(b). Emoji is a way for users to express their sentiment, and we hope to apply this important information when training data.
Deep learning and related technologies are an important research field of experts and scholars. We can train precise models through these techniques from tremendous features [2] and then apply the model to determine the classification result of a piece of data. ese deep learning methods include convolutional neural networks, recurrent neural networks, and attention mechanisms. CNN [3] and RNN [4] are common deep learning methods in the field of natural language processing. CNN have the capacity to realize the learning and representation of data sample features well through "end-to-end" learning. However, the recurrent neural network is mostly used to process sequence data in accordance with actual application requirements. In order to be able to memorize longer data sequences, cyclic neural networks have gradually evolved into a long short-term memory (LSTM) networks [5] and a bidirectional long short-term memory (BiLSTM) network [6]. LSTM is a time cyclic neural network that is specially designed to solve the long-term dependency problem of general RNN. Compared with the unidirectional LSTM model, the BiLSTM has the capacity to analyze a large amount of contextual information from the context effectively. As the optimized structure of BiLSTM, BiGRU [7] remains the original effect while making the structure simpler. e application of CNN is able to encode the character information of each word into its character representation, which can extract the character features on the dataset effectively.
A CBMA algorithm with feature templates was proposed in this paper on the basis of the previous description.
is algorithm extracts features with CNN firstly and then extracts global features of text with BiGRU, so as to solve the problems of single convolutional neural networks failing to obtain global semantic information and the disappearance of traditional circular neural network gradient. At last, the multihead attention mechanism is applied to improve the accuracy. e CBMA algorithm combines a deep learning algorithm with a manually selected feature template. It is able to achieve better performance than existing models while simplifying the structure simultaneously.

Related Operating
Traditional machine learning methods have been applied in classification problems. For example, Li et al. [8] used the XGBoost classification model to analyze the relationship between user characteristics and rumor refuting behavior from the five main rumor categories of economics, sociology, catastrophe, political science, and military science. At the same time, some researchers have improved traditional methods to better classify. For example, Han et al. [9] proposed a Fisher kernel function and FK-SVM method based on probabilistic latent semantic analysis. Fisher kernel function improves the kernel function of support vector machine. Compared with HIST-SVM and PLSA-SVM, the accuracy of FK-SVM method is improved.
With the rapid development of natural language processing, numerous deep learning methods have begun to be applied to classification problems. Kim [10] utilized CNN to perform sentence-level classification tasks on pretrained word vectors. A series of experiments based on word2vec convolutional neural networks presented that convolutional neural networks can be applied to sentence classification tasks well. Dai et al. [11] proposed a black-box backdoor attack to deal with the text classification system based on LSTM. Sentiment analysis experiments were applied to evaluate backdoor attacks. Experimental results presented that a small number of poisoned samples can obtain a higher attack success rate. Ye et al. [12] proposed a Web service classification method based on WiDE and BI-LSTM model. e wide area learning model was applied to realize the width prediction of the Web service category, which captured the interaction between feature vectors of Web service description documents.
In recent years, some researchers have begun to use deep learning methods for sentiment analysis. Li et al. [13] proposed a BiLSTM sentiment classification method based on the self-attention mechanism and multichannel features.  2 Scientific Programming e model consists of two parts: self-attention mechanism and multichannel features. Fu et al. [14] utilized the attention-based CNN-LSTM network to learn general sentence representations in embedded systems and introduced an attention mechanism. Experimental results presented that the CNN encoder is small in size, suitable for small embedded systems, and possessed with excellent performance. Sun et al. [15] proposed a sentiment analysis method for product comment that combines semantic feature mining and dictionary-based technology, which has more advantages than traditional machine learning methods. Yu et al. [16] proposed a word vector refinement model that does not require an annotated corpus, which can be applied to any pretrained word vector. Experiments presented that this method is able to improve the word embedding and sentiment embedding of traditional binary, ternary, and fine-grained sentiment classification. In addition, the performance of various deep neural network models has also been improved. Xu et al. [17] proposed a Chinese sentiment analysis method based on an extended dictionary. e extended sentiment dictionary includes the basic sentiment dictionary, part of the field sentiment words, and polysemous field sentiment words. e naive Bayesian field classifier was applied to classify the text field where the polysemous sentiment word was located, so as to distinguish the sentiment polarity of the word. Chanlekha et al. [18] developed a semiautomatic sentiment dictionary construction tool for sentiment analysis in ai. is method utilized sentiment cooccurrence and contextual consistency features to propagate sentiment polarity to unresolved sentiment feature pairs. Naseem and Musial [19] proposed a sentiment analysis method DICET based on a converter. is method improves the quality of tweets through encoding the representations in the converter and applying deep intelligent context embedding. At the same time, the emotional, polysemous, syntactic, and semantic knowledge of words are taken into consideration. Sailunaz and Alhajj [20] detect and analyze the sentiment people express in their Twitter posts and utilize them to generate recommendations.
From the above researches, it was found that popular machine learning methods such as logistic regression, SVM, and XGBoost have achieved certain results in the early stage of sentiment analysis research. However, due to its weak feature extraction capability and nonlinear fitting capability, it was difficult to adapt to the current sentiment analysis problems in the big data environment. Nevertheless, most sentiment analysis researchers based on deep learning often failed to take the context relation into consideration when using CNN for sentiment classification task, while the LSTM model can only consider the above relation with a slow convergence rate.
e BiGRU model with bidirectional sequential structure happened to be able to solve this problem. However, the direct application of the BiGRU model may cause excessive computational overhead due to the excessively high input dimension. erefore, the application of CNN was taken into consideration to reduce the dimension of the word vector matrix formed by the original data in this paper and then integrated the BiGRU model for sentiment analysis. At last, a multihead attention mechanism was introduced to further improve the operating efficiency and prediction accuracy of the model.

Materials and Methods
CBMA algorithm was applied in Chinese microblog sentiment detection in this paper.
is algorithm model is composed of CNN, BiGRU, and attention mechanism. In order to elaborate the combined model in more detail, the aspects of word embedding, CNN, BiGRU, multihead attention mechanism, and CNMA algorithm model structure were introduced in this paper, respectively.

Word Embedding.
Word embedding is a general term for language model and representation learning technology in natural language processing (NLP). It refers to embedding a high-dimensional space whose dimension is the number of all words into a continuous vector space with a much lower dimension [21], and each word or phrase is mapped to a vector on the real number field. Word embedding methods include artificial neural networks, dimension reduction of word cooccurrence matrix, probability model, and explicit representation of the word in context. In the underlying input, the method of applying word embedding to represent phrases has greatly improved the effect of the parser and text sentiment analysis on NLP [22].
Chinese microblog comments are usually a continuous Chinese sentence. In order to better train the deep learning model, it is necessary to decompose comments into multiple words and then train the words into word vectors. e data processing stage process of this paper is shown in Figure 2.
e dataset utilized in this paper is weibo_senti_100 k, which is a CSV file containing two columns. e sentiment state of the comment is indicated with 1 or 0 in the first column, and the second column is the content of the comment. In order to facilitate the calculation, the data was converted into a TXT file. e first column and the second column were separated by the TAB key and then the word segmentation was performed. English sentences consist of multiple English words, and each word is separated by a space. Although Chinese sentences are also composed of multiple Chinese words, the adjacent words are closely connected and there is no separator. It is necessary to split the sentence into multiple words if you would like to train words as the most basic input in deep learning. A stammering tokenizer was applied commonly for word segmentation. After the word segmentation is completed, the stop terms need to be removed and then carry out the word vector training.
Word2Vec was applied to train Chinese word vectors commonly. Word2Vec is a word embedding model proposed by Mikolov in 2013. It can be utilized for word vector calculation and word vector generation. e algorithm used by Word2Vec is a shallow neural network with a layer number of 3. e word vector generated can be applied as input to other neural networks in numerous tasks. Word2Vec mainly includes two models: CBOW (continuous bag of words) and Skip-gram. CBOW generates the current headword from the context information of the word Scientific Programming [23], while Skip-gram generates its contextual words from the current headword. In this way, word vectors containing certain semantic parameters can be obtained. e word2Vec model utilized in this paper is the Skip-gram model. Skip-gram model consists of a three-layer structure of input layer, mapping layer, and output layer. e content shown in Figure 3 is the architecture diagram of the model.
In terms of a known word w t (w t− n , w t− n− 1 , . . . , w t− 1 , w t , w t+1 , . . . , w t+n− 1 , w t+n ), there are 2n context words as target words, and the probability of achieving the target word is p(count(w)|w). Its objective function is shown in formula (1): (1) In terms of the Skip-gram model, the idea is to generate (input and output) datasets. First of all, a dataset is established for the known words and their context and the window size is set. Afterward, such a dataset is generated by combining the input words and the target words of window size.

CNN.
CNN is one of the representative algorithms of deep learning algorithm [24]. It includes convolution calculation and is a feedforward neural network with deep structure. Scientists have been working on convolutional neural networks since the 1980s and 1990s. After entering the 21st century, CNN have developed rapidly with the introduction of deep learning theory and the improvement of computer equipment, and people have begun to apply CNN to computer vision and natural language processing. CNN was constructed through imitating the biological visual perception mechanism, which is able to perform supervised learning and unsupervised learning. e convolution kernel parameter sharing in the hidden layer and the sparsity of the connections between layers enable the CNN to obtain lattice point features with a small amount of calculation. e structure of the CNN is shown in Figure 4. e whole structure is composed of input layer, convolutional layer, pooling layer, and fully connected layer.
Each input in the input layer is a sentence [25]. However, this sentence is a sentence after word segmentation. In addition, the input is the word vector of each word in the sentence and one word vector corresponds to one row of the input layer in the above figure. Suppose that the comment text sentence is preprocessed into n words, each word is converted into a vector through Word2Vec word embedding, which is mapped into an m-dimensional vector, and the word sequence in the sentence is spliced and mapped into n × m-dimensional matrix: e convolutional layer performs convolution calculation on the input through the convolution kernel to obtain the feature map. One convolution kernel is one feature extractor, and a plurality of convolution kernels are a plurality of feature extractors. In order to better use convolution kernels to extract features, multiple convolution kernels are utilized to conduct feature extraction generally. [P 1 , P 2 , P 3 , . . . , P z ] is used to represent the combination containing z convolution kernels, where P z represents the size of the z-th convolution kernel, that is, the longitudinal dimension of the convolution kernel window. e horizontal size of the convolution kernel window is the vector dimension of the word vector. rough the calculation of z convolution kernels, z feature map vectors will be obtained. In terms of the sentence information as n × m-dimensional matrix, assuming that the size of the convolution window is h, the size of the convolution kernel will be k × m. Specifically, slide k words in accordance with the step length t, and apply the convolution kernel to perform the convolution operation to extract the local features of the text on the input word windows x h 1 , x h+1 2 , x h+2 3 , ..., x n n− h+1 . Assuming that the input sentence d is composed of n word vectors x 1 , x 2 , . . . , x n , the operation of the convolutional layer can be expressed as

Input layer Convolutional layer
Pooling layer Feature vector e eigenvector y obtained after convolution kernel extraction is y � y 1 , y 2 , y 3 , . . . , y n− h+1 . (4) After the convolution operation, the pooling layer performs pooling processing on each eigenvector, and a multidimensional vector is converted into a value after pooling processing, which is used as an element of the pooled vector. e pooling method used by the pooling layer is the maximum pooling method; that is, the sequence output from the convolutional layer is input to the pooling layer. e maximum pooling method will select the largest element in the sequence y 1 , y 2 , y 3 , . . . , y n− h+1 and eventually obtain a new vector y:

BiGRU. Gated Recurrent Unit (GRU) was proposed by
Cho et al. [26], which is a kind of RNN. Similar to LSTM, it is also proposed to solve problems such as long-term memory and gradients in backpropagation. RNN is a class of recurrent neural networks which performs recursion in the evolutionary direction of sequences with sequential data as input, and all neurons are connected in a chain. Due to the addition of cyclic factors in the hidden layer, neurons are able to receive information from their own historical moments as well as other neurons at the same time. erefore, RNN has the characteristics of memory and parameter sharing. In addition, RNN is superior in the nonlinear feature learning of serial data [27]. In terms of the problem of RNN gradient disappearing and being unable to learn long-term historical load features, scholars proposed LSTM, which has the capacity to learn the correlation information between long short-term sequence data. In recent years, in response to the problem of LSTM with excessive parameters and slow convergence rate [28], GRU has been derived. GRU is a variant of LSTM, which has fewer parameters and has been possessed with faster convergence performance while maintaining good learning performance of LSTM. e GRU model is internally composed of updating gate and resetting gate. Different from LSTM, GRU replaces the input gate and forgetting gate of LSTM with updating gate, where the updating gate represents the influence of the output information of the hidden layer neurons at the previous moment on the hidden layer neurons at the current moment. When the updating gate value is larger, the influence degree is greater. e resetting gate represents the neglect degree of the hidden layer neuron output at the previous moment. When the value of the resetting gate is larger, the less information is ignored. e structure of GRU is shown in Figure 5. e hidden layer unit A can be calculated by the following formula: where z t and r t are the updating gate and resetting gate, respectively; σ is the Sigmoid function; tanh is the hyperbolic tangent function; W r , U r , W z , U z , and U are all training parameter matrices. e candidate activation state h t at the current moment is jointly determined by the resetting gate r t , the output h t− 1 of the hidden layer neuron at the previous moment, the input x t at the current moment, and the training parameter matrices W and U.
BiGRU network has the capacity to learn the relationship between past and future load influencing factors and current load, which is more conducive to extracting the deep features of load data [29]. e structure of BiGRU is shown in Figure 6.
It is calculated as and A 2 ′ are calculated as In the forward calculation, the hidden layer value s t is related to s t− 1 . In the reverse calculation, the hidden layer value s t is related to s t− 1 . e final output depends on the sum of the forward and reverse calculations. e calculation method of the bidirectional recurrent neural network is

Cross-Entropy Loss Function.
e cross-entropy loss function is often applied for classification problems, especially for the classification problem in neural networks [30], and the cross entropy is used as the loss function frequently. In addition, since cross-entropy involves calculating the probability of each category, cross entropy appears with the Sigmoid (or softmax) function [31] almost every time. e expression of the Sigmoid function is as follows: After deriving the Sigmoid function σ(z), the following function will be obtained: Scientific Programming When the value of x is larger or smaller, the curve of the Sigmoid function will be more smooth, which indicates that the derivative σ ′ (x) is closer to zero. In the case of dichotomy, there are only two cases where the model needs to predict in the end [32]. e predicted probabilities are p and 1 − p for each of these categories. At this time, the crossentropy loss function can be expressed as where y i represents the label of sample i, positive class is 1, negative class is 0, and p i represents the probability that sample i is predicted to be positive. Learning tasks were divided into dichotomy and polychotomy cases. e learning processes of these two situations were discussed, respectively. Take a gradient descent of a single sample as an example: e first two formulae are the linear and nonlinear parts of the forward propagation, respectively. e third formula is the mean square error loss function. e fourth formula is the cross-entropy loss function. e purpose of gradient descent, explicitly, is to reduce the distance between the true value and the predicted value. However, the loss function is applied to measure the distance between the true value and the predicted value. erefore, the purpose of gradient descent is to reduce the value of the loss function. How to reduce the value of the loss function? e variables are only w and b; thus, what we have to do is to constantly modify the values of w and b to make the loss function increasingly smaller. e renewal process of w and b is as follows: where α represents the learning rate, which is used to control the step length, that is, the length of one step down.

Multihead Attention Mechanism.
e visual attention mechanism is a brain signal processing mechanism unique to human vision. Human vision scans the global image quickly to obtain the target area that needs to be focused on, which is commonly known as the focus of attention. Afterward, more attention resources are devoted to this area to obtain more detailed information about the target that needs to be paid attention, thus inhibiting other pieces of useless information [33].
is is a means for human beings to quickly screen out high-value information from a large amount of information with limited attention resources. It is a survival mechanism formed in the long-term evolution of human beings. e human visual attention mechanism greatly improves the efficiency and accuracy of visual information processing.
In recent years, the attention mechanism has been widely applied in various fields of deep learning. Attention mechanisms are commonly applied in different types of tasks, whether image processing, speech recognition, or natural language processing. erefore, it is necessary to understand the working principle of the attention mechanism for technicians who are concerned about the development of deep learning technology.
In the task of Chinese text classification, it is necessary to pay attention to the word vector vectors of key Chinese words and ignore the word vectors which are not related to the context. Adopting the attention mechanism to the input text data enables the word vectors of key Chinese words to become the dominant information, thereby improving the efficiency and accuracy of the entire neural network model. e attention mechanism will also prompt the model to focus on the Chinese words when similar sentences appear again in the future and improve the learning and generalization capabilities of the model. e self-attention mechanism is a special case of the general attention mechanism. Q, K, and V are applied to represent the attention-related query matrix, key matrix, and value matrix, respectively. In the self-attention mechanism, Q � K � V. Its advantage lies in that it ignores the distance between words and directly calculates the dependency relationship. In addition, it has the capacity to learn the internal structure of a sentence [34] and pay attention to the connection between its internal words. Combined with RNN, the application of CNN model is conducive to improving model learning ability and enhancing the interpretability of the neural network. e basic structure of multihead attention is shown in Figure 7. e scaled dot-product attention at the central position is a variant of the general attention. Given matrices Q ∈ R n * d , K ∈ R n * d , and V ∈ R n * d , scaled dot-product attention can be calculated by the following formula: where d is the number of hidden units in the neural network. Multihead attention adopts the self-attention mechanism, which means Q � K � V in the figure. e advantage of this is that the information of the current position and the information of all other positions can be calculated to capture the dependencies within the entire sequence. For example, if the input is a sentence, each word in it must be attention calculated with all words in the sentence. In this model, multihead attention performs linear transformation on the inputs Q, K, and V, three vectors, before performing calculations. Since it is a "multihead attention" mechanism, the calculation of the scaled dot-product attention part needs to be performed for numerous times. e number of "heads" means the number of calculations, but the linear projections of Q, K, and V are different for each head calculation. Take the i-th head as an example: Since this layer receives the output of the BI-GRU layer, therefore e final result of this head is 3.6. CBMA Model. e CBMA model is shown in Figure 8. Before the model training, microblog comments were segmented into words. Afterward, microblog comments were converted into word vectors through word2Vec embedding, and the trained word vectors were taken as the input of the model. A convolutional network was applied firstly in this model to perform feature extraction on the input word vector. e output after convolution feature extraction was utilized as the input of BiGRU. e BiGRU was followed by an attention mechanism module, which then performed pooling processing through the maximum pooling layer and was connected to a fully connected layer. At last, the sigmoid classification function is applied for classification. In addition, the cross-entropy loss function is utilized to evaluate the model, and the category of the input sentence will be obtained in the end.

Evaluation Index.
Accuracy, precision, recall, and F-measure (F1) were applied as the evaluation indexes of the model in this paper. Accuracy is the score of sentiment Scientific Programming correctly predicted in all microblog comments [35], which is the percentage of examples that the classifier obtains from the total number of examples predicted by a given label. e precision is the fraction of relevant instances among all retrieved instances. e recall rate is the fraction of the total amount of relevant instances that are actually retrieved. F-measure (F1) is the harmonic average of accuracy and recall rate. eir calculation formulae are shown as follows: where T P represented the number of positive evaluation samples correctly predicted in positive evaluation samples, F P represented the number of positive evaluation samples incorrectly predicted in negative evaluation samples, F N represented the number of negative evaluation samples incorrectly predicted in positive evaluation samples, and T N represented the number of negative evaluation samples correctly predicted in negative evaluation samples.

4.2.
Dataset. e dataset utilized in this experiment is the microblog comment corpus weibo_senti_100 k from Sina Weibo, which contains sentiment annotation data. All the data performed sentiment annotation. e distribution of positive and negative data in the dataset is shown in Table 1. e dataset has a total of 119988 data, including 59,994 pieces of positive comment data and 59,994 pieces of negative comment data. e datasets were segmented and the positive and negative data were divided into training data, test data, and verification data in a certain proportion. e specific ratio of the division is shown in Table 2.  experimental results. In terms of this problem, 7 convolution kernels with different quantities and sizes were utilized for experimental testing in this paper. e accuracy, precision, recall rate, and F1 value of the test results are shown in Table 3.

Experiment with Different Convolution
From the experimental data in the above table, it can be seen that when four convolution kernels [1,3,5,7] were utilized, the accuracy, precision, recall rate, and F1 value of the model were slightly higher than other convolution kernel combination methods. An analysis of the experimental process of these 7 types of convolution kernels is shown in Figure 9. First of all, the variation trend of the accuracy of verification set during training is presented. e six figures above correspond to the experimental results of convolution kernel allocation method experiments in Table 4. ese results mainly include the changes of the training set with the number of iterations and the change of accuracy of the validation set with the increase of iteration times.

Experiments with Different Layers of BiGRU.
In the CBMA model utilized in this paper, the number of layers of the BiGRU can be one layer or multiple layers. e optimal number of layers was selected through experimental comparison. In order to verify the influence of the number of BiGRU layers on this model and to find an optimal number of BiGRU layers, multiple numbers of two-way GRU layers were applied to conduct experiments. e experimental results are shown in Table 5.
With the increase in the number of BiGRU, the training time also changed. e change in the training time of each round after changing the BiGRU in this experiment is shown in Figure 10.

Experiments with Different Learning Rates.
As an important hyperparameter in supervised learning and deep learning, learning rate determines whether the objective function converges to the local minimum and when it converges to the minimum. A suitable learning rate can make the objective function converge to the local minimum in a suitable time. If the learning rate is too small, the convergence will be excessively slow. If the learning rate is too large, it will cause the cost function oscillation. In order to find an optimal learning rate, experiments with multiple learning rates were conducted. e accuracy, precision, recall, and F1 value after the experiment were recorded. e experimental results are shown in Table 6. When the model conducted training under these five learning rates, as the number of iterations increased, the accuracy changes were shown in Figure 11. e change in the loss function is shown in Figure 12

Experiments with Different Learning Rates.
A variety of methods would be carried out to test. ese methods included traditional machine learning algorithms, such as decision tree, KNN, Naive Bayes, random forest, GBDT, SVM, and logistic regression. e results of the experiment also recorded accuracy, precision, recall rate, and F1 value. ese four evaluation indexes were applied to evaluate each model. e experimental results are shown in Table 4.
At the same time, some deep learning algorithms have also been tested, such as the combined model of GRU and multihead attention, BiGRU and multihead attention mechanism model, and convolution and GRU multihead attention model. e results of the experiment also recorded accuracy, precision, recall rate, and F1 value. ese four evaluation indexes were applied to evaluate each model. e experimental results are shown in Table 7.

Experiment Analysis of Experiment 1.
In order to verify the influence of the convolutional network layer on the CBMA algorithm model in the feature extraction process, the convolution kernel with the same number and size was applied to carry out experiments.

Contrast Test of Multilayer
BiGRU. e multilayer BiGRU test results are shown in Table 5, and the model training time is shown in Figure 12. It can be seen from Table 5 that as the number of GRU layers increased, the  Figure 9: Experiments with different convolution kernels, the accuracy of training set, and validation set variations with the number of iterations. (a) Convolution kernel: [1,3], (b) convolution kernel: [1,3], (c) convolution kernel: [1,3,5], (d) convolution kernel: [1,3,5,7], (e) convolution kernel: [1,3,5,7,9,11], and (f ) convolution kernel: [1,3,5,7,9,11,13]. accuracy rate increased slightly. In addition, the accuracy of the model was the same when the number of layers was 2 and 3. It can be seen from Figure 12 that as the number of BiGRU layer increased, the training time for each iteration cycle would also increase significantly. When the number of iterations of the model increased, the training time of the model would greatly increase. In view of this case, a BiGRU layer was selected for training in the CBMA algorithm model.

Contrast Test of Various Learning Rates.
e contrast test results of various learning rates are shown in Table 1. A total of 5 learning rates were utilized in the experiment. It can be seen from the experimental results that as the learning rate decreased, the accuracy increased firstly and then decreased. When the learning rate was 0.001, the accuracy rate      In order to verify the feasibility and reliability of the algorithm proposed in this paper in Chinese microblog sentiment detection, the same embedded word vector was put into a variety of models for training and testing. e experimental results are shown in Table 4. e models utilized in the test included traditional machine learning algorithms. It can be seen from Table 4 that, among the traditional machine learning methods, the test accuracy of the decision tree was the lowest compared to other methods, which was only 72.47%. Logistic regression had the highest accuracy rate compared with other methods, which was 97.65%. Experiments were conducted through applying a variety of traditional machine learning methods. Experimental results presented that the CBMA model proposed by us had a great advantage in accuracy.

Comparison of Traditional Deep Learning Methods.
In addition, four deep learning models were applied for testing as well. e test results are shown in Table 7. It can be seen from the test results that the accuracy of GRU-MAttention model was the lowest, which was 97.35%. e CBMA model had the highest accuracy of 97.65%, which presented that the CBMA model had better results than other deep learning models.

Conclusions
Aiming at the sentiment detection of Chinese microblog, a CBMA algorithm model that combines CNN and BiGRU networks and introduces multihead attention mechanism was proposed based on the respective characteristics of CNN, bidirectional long short-term memory networks, and multihead attention mechanism, which is applied to the sentiment detection field of Chinese microblog. e advantages of the CNN in extracting local features of the text and the BiGRU network in extracting the global features of the text were fully taken into consideration in this model, as well as the information in the context of the text, and the features of the text were extracted effectively. Experimental analysis in every small step has been carried out in this paper, such as testing various convolutions and various BiGRU. Moreover, various traditional machine learning methods were tested as well. Tremendous experiments presented that CBMA algorithm model has a better effect on the weibo_senti_100 k dataset of microblog comments. In addition, we hope that the study in this paper has the capacity to play a certain role in the field of microblog sentiment detection as well.
Data Availability e datasets used in this paper to produce the experimental results are publicly available. Weibo_senti_100k can be downloaded from https://github.com/SophonPlus/ ChineseNlpCorpus/blob/master/datasets/weibo_senti_100k/ intro.ipynb.