Weighted Joint Sentiment-Topic Model for Sentiment Analysis Compared to ALGA: Adaptive Lexicon Learning Using Genetic Algorithm

Latent Dirichlet Allocation (LDA) is an approach to unsupervised learning that aims to investigate the semantics among words in a document as well as the influence of a subject on a word. As an LDA-based model, Joint Sentiment-Topic (JST) examines the impact of topics and emotions on words. The emotion parameter is insufficient, and additional parameters may play valuable roles in achieving better performance. In this study, two new topic models, Weighted Joint Sentiment-Topic (WJST) and Weighted Joint Sentiment-Topic 1 (WJST1), have been presented to extend and improve JST through two new parameters that can generate a sentiment dictionary. In the proposed methods, each word in a document affects its neighbors, and different words in the document may be affected simultaneously by several neighbor words. Therefore, proposed models consider the effect of words on each other, which, from our view, is an important factor and can increase the performance of baseline methods. Regarding evaluation results, the new parameters have an immense effect on model accuracy. While not requiring labeled data, the proposed methods are more accurate than discriminative models such as SVM and logistic regression in accordance with evaluation results. The proposed methods are simple with a low number of parameters. While providing a broad perception of connections between different words in documents of a single collection (single-domain) or multiple collections (multidomain), the proposed methods have prepared solutions for two different situations (single-domain and multidomain). WJST is suitable for multidomain datasets, and WJST1 is a version of WJST which is suitable for single-domain datasets. While being able to detect emotion at the level of the document, the proposed models improve the evaluation outcomes of the baseline approaches. Thirteen datasets with different sizes have been used in implementations. In this study, perplexity, opinion mining at the level of the document, and topic_coherency are employed for assessment. Also, a statistical test called Friedman test is used to check whether the results of the proposed models are statistically different from the results of other algorithms. As can be seen from results, the accuracy of proposed methods is above 80% for most of the datasets. WJST1 achieves the highest accuracy on Movie dataset with 97 percent, and WJST achieves the highest accuracy on Electronic dataset with 86 percent. The proposed models obtain better results compared to Adaptive Lexicon learning using Genetic Algorithm (ALGA), which employs an evolutionary approach to make an emotion dictionary. Results show that the proposed methods perform better with different topic number settings, especially for WJST1 with 97% accuracy at |Z| = 5 on the Movie dataset.


Introduction
Opinion extraction is one of the main branches of natural language processing (NLP) research. Comment extraction (emotion analysis) now is widely used in websites containing different types of merchandise. Online product reviews can help customers buy a product and help manufacturers discover new opportunities by analyzing user feedback. Consequently, automated analysis of reviews is critical. Emotion Analyzer can browse comments on the web and categorize many comments as positive or negative tags. is research is important because it makes managing customer requests easier and more efficient because product owners automatically extract customer feedback and use customer feedback to sell products. ere are different methods for extracting opinions and analyzing them, and in this research, an intelligent method has been used [1][2][3][4][5][6][7]. Topic modeling presumes that the input text document set contains several unknown subjects that need recognition. Each subject (topic) is an unknown distribution of words, and each review (text document) is a distribution of subjects. e aim is to detect concealed knowledge in textual data related to the user's comments. Several methods perform subject modelings, such as Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantics Analysis (PLSA). PLSA is a method that can produce the data perceived in a document-term matrix. LDA is a probabilistic method because it is exhibited in a probabilistic language, and it is a generative model because it is about ensuring that documents are produced. LDA has based on the premise that a review is a combination of subjects in which each topic is distributed over words. e linear growth of PLSA parameters indicates that the method is prone to overfitting. LDA can be easily extended to new documents. In addition, increasing the training data size does not lead to the growth of LDA-related parameters [7].
In LDA, subjects are related to documents, and words are related to subjects. To model the emotion of reviews, Joint Sentiment-Topic (JST) [8] establishes an extra layer of emotion between the layers of document and subject, where the emotion labels are related to the documents, the subjects are related to the emotion labels, and words are tagged with emotions and related topics. is study assumes that each word in a document affects its neighbors, and different words in the document may be affected simultaneously by several neighbor words.
us, the proposed models consider the effect of words on each other. e proposed models add two parameters (weight and window) to JST. e window parameter represents the range of the effect of a word, and the weight parameter represents the strength of the effect of the word. ese two parameters play an important role in better classification, as seen in the evaluation section. Using the parameters weight and window, two new methods are introduced that have revealed notable dominance over the baseline algorithms, such as JST, Topic Sentiment modeling (TS) [9], Reverse-JST (RJST) [10], and Tying-JST model (TJST) [8].
More and more improved algorithms and strategies are used to solve sentiment analysis problems. However, none of the researchers have improved the accuracy besides generating a sentiment dictionary. Different from other related studies, in this study, the proposed models improve topic-model-based sentiment classification using two parameters (weight and window). e proposed models consider the effect of words on each other. ey can also generate a sentiment dictionary that includes words and scores that specify positive and negative labels and their weight. Accuracy is calculated using two formulas. Finally, by evaluating the proposed methods and the comparison with other algorithms on thirteen datasets of different sizes, the results show that the algorithms presented in this study are superior to the compared algorithms in terms of accuracy, perplexity, and topic_coherency. e rest of this article is arranged as follows: Section 2 shows a summarized overview of previous works in emotion analysis and the use of topic modeling in emotion analysis. e proposed models are provided in Section 3. e evaluation results are discussed in Section 4, and Section 5 concludes this article.

Related Works
e value of emotion analysis may be highlighted by analyzing customer happiness from online services like email. It is also feasible to employ emotion mining to evaluate the opinions of various people in order to make them aware of things that have favorable reviews. Major types of classification in emotion analysis are document, sentence, and aspect. An opinion is a quadrilateral (g, s, h, t), where g is the target, s is sentiment, h is the author's opinion, and t is the opinion expression time [11,12,13]. Many attempts have been made to detect emotions and explore the knowledge embedded in text data. Topic modeling obtains concealed subjects of documents. In topic modeling, the aim is to discover the best set of hidden variables that can express the observed data. LDA has been used as a topic model to effectively explore subjects in the documents [7]. LDA has motivated countless algorithms to expand to solve different problems [14][15][16][17]. In [18], the authors exhibit three topic models which make better LDA using date, helpfulness, and subtopic parameters. Articles [8,10,19] describe the methodology JST. is model expands LDA using a sentiment layer.
is method cannot accurately identify the different emotions and is used as a baseline method in most articles. Several methods are similar to JST [8,10,20]). e aspect and Sentiment Unification Model (ASUM) [20] is similar to JST. JST assumes that each word represents an aspect, but ASUM assumes that each sentence represents a description of an aspect. A variation of the JST model is TJST [8]. e main difference between JST and TJST is that to sample a word in a document during the generative process of documents, JST selects a subject-document distribution for each document, whereas TJST uses one subject-document distribution for all documents. According to [10], the emotion influences the subject in JST, whereas in RJST, the subject influences the emotion. According to [9], there is only one topic-sentiment distribution for all documents in the TS, while there is a distribution for each document in RJST.
Several methods have been introduced for text emotion analysis that uses topic modeling [21][22][23]78]. In [24], the authors introduce an algorithm that creates a review containing both shared subjects and subjects distributed over words as special data. Two topic models are proposed in [79]: Multilabel Supervised Topic Model (MSTM) and Sentiment Latent Topic Model (SLTM). Both methods could be used to categorize social emotions. In [25], the authors introduce a Sentiment Enriched and Latent-Dirichlet-Allocation-based review rating Prediction (SELDAP) to predict ratings using topics and sentiments of reviews. In [26], the authors introduce a method named Hierarchical Clinical Embeddings combined with Topic modeling (HCET), which can integrate five types of Electronic Health Record (EHR) data over several visits to predict depression. e authors of [80] presented the word Sense aware 2 Computational Intelligence and Neuroscience LDA (SLDA) approach that uses word sense in topic formation. In [27], the authors introduce a survey of different short text topic modeling methods. ey provide a detailed analysis of algorithms and discuss their performance. e authors proposed a segment-level joint topic-sentiment model (STSM) in [81], where each sentence is divided into parts by conjunctions, and the assumption that all terms in a section convey the same emotion is presented. In [28], the authors provided a thorough examination of subject modeling methods.
Deep learning provides an approach to utilizing large volumes of calculation and data using little manual engineering. Recently, deep learning approaches to analyzing emotions have reached a considerable triumph [29,30,47,77]. Optimization methods have developed significantly in recent years [31][32][33][34][35][36][37]. Optimization methods are widely used in the feature section, notably for text. In [38], the authors proposed a multiobjective-grey wolf-optimization algorithm to categorize sentiments. In [39], the authors proposed a binary grey wolf optimizer method to classify labels in the text. In the following article [40], the authors introduced a new optimization method that mimics the model of a successful person in society. eir article used this method to categorize emotions, which achieved very good results. ere are several works on using user behavior for sentiment analysis. Tag sentiment aspect (TSA) framework, a new probabilistic generative topic framework, was presented by [48] with three implementation editions. TSA is on the basis of LDA. In [41], the authors concentrate on user-based methods on social networks, where users create text data to show their views on different topics and make connections with other users to create a social network. In [42], the authors used a signed social network to detect the emotions of reviews as an unsupervised approach. Various works use other techniques for sentiment analysis problems [43][44][45]. In Adaptive Lexicon learning using a Genetic Algorithm (ALGA) [46], some emotion dictionaries for a dataset in the training stage are constructed using the genetic method. ese sets are utilized in the testing stage. Each lexicon comprises both words and their scores. A chromosome is modeled as a vector of emotional words and scores in the genetic approach. Scores are in the range of (the lowest score of an emotional word, the highest score of an emotional word). e main goal of ALGA is to create a lexicon that minimizes the error in the training stage.
In [47], the authors proposed a deep learning-based topiclevel opinion mining method. e approach is novel in that it works at thelevel of the sentence to explore the subject using online latent semanticindexing and then employs a subjectlevel attention method in an extendedshort-term memory network to detect emotion. In [62], the authors proposed a joint aspect-based sentiment topic model that extracts multigrained aspects and emotions. In [49], parts-of-speech (POS) tagging is performed via a hidden Markov model, and unigrams, bigrams, and bi-tagged features are extracted. Also, the nonparametric hierarchical Dirichlet process is employed to extract the joint sentiment-topic features. In [50], the authors used an unsupervised machine learning method to extract emotion at the document and word levels. In [51], the authors proposed a new framework for joint sentiment-topic modeling based on the Restricted Boltzmann Machine (RBM), a type of neural network. In [52], the authors proposed a probabilistic method to incorporate textual reviews and overall ratings, considering their natural connection for a joint sentiment-topic prediction. In [53], the authors proposed a hybrid topic modelbased method for aspect extraction and emotion categorization of reviews. LDA is used for aspect extraction and two-layer bidirectional long short-term memory for emotion categorization. In [54], the authors proposed a joint sentiment-topic model that uses Markov Random Field Regularizer and can extract more coherent and diverse topics from short texts. In [55], the authors proposed a topic model with a new document-level latent sentiment variable for each topic, which moderates the word frequency within a topic. In [56], the authors proposed a new method for text emotion detection, aiming to improve the LSTM network by integrating emotional intelligence and attention mechanism. In [57], the authors proposed a new model for aspect-based emotion detection. e model is a novel adaptation of the LDA algorithm for product aspect extraction.
In [58], the authors introduced a new deep learning-based algorithm for emotion detection, using available ratings as weak supervision signals. In [59], the authors introduced a new deep learning-based algorithm for emotion detection, using two hidden layers. e first layer learns sentence vectors to represent the semantics of sentences, and in the second layer, the relations of sentences are encoded. In [60], the authors introduced a transformer-based model for emotion detection that encodes representation from a transformer and applies deep embedding to improve the quality of tweets. In [61], the authors introduced an attention-based deep method using two independent layers. By having to consider temporal information flow in two directions, it will retrieve both past and future contexts.
In this study, the proposed methods have tried to increase the accuracy with fewer parameters and, at the same time, simplicity compared to the existing methods. e proposed methods analyze emotions at the document-level and create an emotional dictionary. ey are also the first methods that create an emotional dictionary through a topic modeling technique automatically and accurately. e proposed methods are the first methods that consider the words in the text and their effect on each other in a dynamic and weighty way. Table 1 compares a number of articles presented in recent years in emotion analysis in terms of method, language, and dataset. In the method column, as can be seen, the combination of topic modeling and deep learning methods has recently been considered. In the language column, it is specified in which language the proposed method has been tested. e name of the dataset that has been tested can also be seen in the dataset column.

Proposed Models
is study proposes two novel topic sentiment models called Weighted Joint Sentiment-Topic (WJST) and Weighted Joint Sentiment-Topic 1 (WJST1). e proposed models improve JST using two extra parameters (weight and window).

Computational Intelligence and Neuroscience
According to Figure 1, the data type of the dataset is text. Preprocessing is performed by lowercasing all words, removing the stop words and words with too low and too high frequency, stemming, removing digits, and removing nonalphabetic characters such as (#, ! . . .). Proposed models can be summarized as follows: (1) in the Generative Model part, the procedure of generating a word in a document under a topic model is illustrated. (2) In the Plate Notation part, a graphical representation of the subject model is provided (in the style of plate notation). (3) In the Model Inference part, Gibbs sampling will be used (to fulfill approximation inference). In the Evaluation phase, the model's performance is evaluated using accuracy, perplexity, and topic_coherency.

Motivation.
e proposed models add two parameters to JST as latent variables in this study. From our view, it is assumed that the words in the documents affect their neighbors, and different words in the document may be affected simultaneously by several neighbor words. For example, in the sentence "My phone has a small memory, and its pictures quality is low," the unigram small affects the unigram memory, and the bigram small memory affects the unigrams phone and pictures. So, unigram small affects unigrams memory, phone, and picture.
According to Figure 2, the reviews as input text data types are used for sentiment classification. e proposed models consider the effect of words on each other. ey adopt Gibbs sampling to perform approximate inference of distributions. After completing the sampling in the Gibbs sampling algorithm, latent variables' distribution can be calculated. Sentiment classification at the document-level is calculated based on the probability of a sentiment label given to a document.
Like the above example, a word can affect neighbor words in many sentences. So, in the proposed models, we consider the effect of words on each other using two parameters. e window parameter represents the range of the impact of a word, and the weight parameter represents the strength of the effect of the word. In the proposed models, each word has a weight, a sentiment label, and a topic and affects its neighbors as much as its window size, which means that each word has a window. For instance, as can be seen in Figure 3, word w 3 has the window size equal to 1 and affects words w 2 and w 4 , and w 6 has the window size equal to 2 and affects words w 4 , w 5 , w 7 , and w 8 . If word w 3 had weight h and negative sentiment, words w 2 and w 4 would have weight h and negative sentiment as well. Each word is affected by its neighbors. So, different words in a document may be affected simultaneously by several neighbor words. Furthermore, |Q| is the number of separate windows, |E| is the number of distinct weights, |S| is the number of distinct sentiment labels, and |Z| is the number of distinct topics. In the present study, five sets θ, φ, π, ψ, and ξ require to be inferred which are latent variables. e hyperparameters α, β, c, δ, and μ are given based on the experience, which can be the prior observation counts before observing any actual words, where α is Dirichlet prior distribution for θ, β is Dirichlet prior distribution for φ, c is Dirichlet prior distribution for π, δ is Dirichlet prior distribution for ψ, and μ is Dirichlet prior distribution for ξ. e latent parameters z, s, q, e, φ, θ, π, ξ, and ψ require to be approximated using observed variables, where z is topic variable, s is sentiment variable, q is window variable, and e is weight variable. e proposed models demonstrate the process of generating words in documents. Furthermore, they can approximate the latent variables. In the present study, the main aim of the proposed topic models is to categorize sentiments at the document-level.

3.2.1.
e Problem We Are Trying to Solve or Improve. Analyzing user satisfaction with various services, products, or movies is the main problem in this study, mainly reflected in users' comments. A user's comment is formed by a message as text on the Internet which can be a tweet or a simple message on a website. So, for example, it is feasible to employ emotion mining to evaluate the opinions of various people in order to make them aware of things that have favorable reviews.

e Solution to the Problem.
Many attempts have been made to detect emotions and explore the knowledge embedded in text data. Topic modeling as a known method can obtain concealed subjects of documents. LDA has been used as a topic model to effectively explore issues in the documents. As an LDA-based model, JST examines the impact of topics and emotions on words. e emotion parameter is insufficient, and additional parameters may play valuable roles in achieving better performance.
is study presents two new topic models that extend and improve JST through two new parameters and generate a sentiment dictionary. e proposed models consider the effect of words on each other, which, from our view, is an important factor and can increase the performance of baseline methods. Several methods have been introduced for text emotion analysis that uses topic modeling. However, none of the researchers have improved the accuracy besides generating a sentiment dictionary. Different from other related studies, in this study, the proposed models improve topic-model-based sentiment classification using two parameters (weight and window). e proposed models are deeply described step-by-step in the following sections.

e General Structure of WJST.
is subsection introduces a new model named WJST, which improves JST using two parameters (weight and window). e primary goal of WJST is to classify sentiments at the document-level. A summary of symbols applied in the model is prepared in Table 2. e process of generating a word of a document in WJST can be outlined as follows: (1) for each document, an author first decides the distribution of sentiments. For example, sentiments are 70% positive and 30% negative, so the proposed model chooses a sentiment label from the per-      weight, and window. So, WJST draws a word from the percorpus word distribution that depends on the topic, sentiment label, weight, and window. e words with different topics may have different window sizes. For example, a word with topic memory has a smaller window size than a word with topic mobile because topic mobile is more general than topic memory which can cover topic memory. So, the topic affects window size. e words with different topics may have different weights. For example, word size in topic memory is significant and considerable weight because all customers like memories with larger capacity. Word size in topic mobile is not important as word size in topic memory, and it has a small weight in topic mobile because some customers may like mobile phones with small size (iPhone 6s), and some customers may like the mobile phones with large size (iPhone 6s+). So, the topic affects weight. e words with different sentiment labels may have different weights. For example, suppose that topic memory contains two words size and cost. If the word size is positive, positive size will be more important than the word cost, and its weight will be larger than the cost. If word size is negative, the positive cost will be more important than the word size, and its weight will be larger than the size. Positive size means using words like large and big because customers like memories with larger capacity sizes. Negative size means using words like small because costumers do not like memories with smaller capacity size. Positive cost means using words like low and cheap because costumers like low-priced memories. Negative cost means using words like high and expensive because costumers do not like high-priced memories. So, sentiment label affects weight. e proposed model is parametric in this study [63]. Furthermore, the number of topics is constant.
e generative model of WJST is demonstrated in Figure 4. e symbols of Multi and Dir demonstrate distributions of Multinomial and Dirichlet, respectively. Five sets of latent variables θ, φ, π, ψ, and ξ require to be inferred which are latent variables. e hyperparameters α, β, c, δ, and μ are given based on the experience, which can be the prior observation counts before observing any actual words. e latent parameters z, s, q, e, θ, φ, π, ψ, and ξ require to be approximated using observed variables. e plate notation of WJST is exhibited in Figure 5. e plate notation is a method for expressing variables repeating in a graphical model. Furthermore, a probabilistic model shows the conditional dependency layout among the random variables as a graph.
According to Figure 5, the joint probability distributions for the model WJST can be factored as follows: P(w, z, s, q, e) � P(w|z, s, q, e) × P(z|s, r) × P(s|r) × P(q|z, r) × P(e|z, s, r), (1) where by integrating out φ, we achieve: where |V| is the vocabulary size, |S| is the number of sentiment labels, |Z| is the number of topics, |Q| is the number of distinct windows, and |E| is the number of weights. e symbol N w,z,s,q,e is the number of times the word w has been assigned to topic z, window q, weight e, and sentiment s. e symbol N z,s,q,e is the number of words with topic z, window q, weight e, and sentiment s.The symbol β is Dirichlet prior to φ. The symbol Γ is the gamma function. In addition, by integrating out θ, we achieve: where |R| is the number of documents and N z,s,r is the number of words with topic z with sentiment s in document r. e symbol N s,r is the number of words with sentiment s in document r. e symbol α is Dirichlet before θ. And by integrating out π, we achieve: where F s,r is the effect of words with sentiment s in document r, which is equal to w∈r |e w,s,r × (1 + 2 × q w,s,r )| where e w,s,r is the weight of word w with sentiment s in document r and q w,s,r is the window size of word w with sentiment s in document r. e symbol F r is the sum of the effect of words with different sentiments (positive and negative) in document r, which is equal to s∈ positive,negative { } F s,r . The symbol c is Dirichlet before π. And by integrating out ξ, we achieve: where |Q| is the number of distinct windows. e symbol N q,z,r is the number of words with topic z and window q in document r. e symbol N z,r is the number of words with topic z in document r. e symbol μ is Dirichlet before ξ. And by integrating out ψ, we achieve: where |E| is the number of weights, N e,z,s,r is the number of words with topic z, weight e, and sentiment s in document r. e symbol N z,s,r is the number of words with sentiment s and topic z in document r. The symbol δ is Dirichlet before ψ. To estimate the parameters φ, θ, π, ξ, and ψ, we need to evaluate the above distributions. ese distributions are difficult to assess directly, so we adopt Gibbs sampling to  Computational Intelligence and Neuroscience perform approximate inference. Gibbs sampling is a widely used inference technique and is a popular approach for parameter estimation and inference in many topic models such as LDA [7]. e advantage of using the Gibbs sampling method is that it is simple and easy to implement. In this study, Gibbs sampling is used to estimate the distributions of the latent variables. e pseudocode of the Gibbs sampling algorithm is given in Figure 6 for the proposed model, and the meanings of all variables are seen in Table 2. e algorithm will sample each variable (z, s, q , and e) based on the following formula by canceling terms in equations (2)-(6) (by replacing terms in (1) with those in equations (2)- (6): where z r,i , s r,i , q r,i , and e r,i are topic, sentiment, window, and weight assignments, respectively, for all the words in the collection, except for the word considered at position i in document r. Posterior inference of parameters is performed using Gibbs sampling, as demonstrated in Figure 6.
In the section of initialization, the method randomly sets the parameters. A sentiment dictionary is employed for initializing sentiment labels. e sentiment dictionary contains words and scores that specify positive and negative labels and their weight. In this study, AFINN [64] is used as a sentiment dictionary, improving the model's accuracy. At the end of the sampling algorithm, each word has a weight and a sentiment label. erefore, a dictionary can generate sentiment scores (weights and sentiment labels) and words. e scores are extracted from a dataset based on P(w| s, e). Each word's weight and sentiment with the most probability are selected as sentiment scores among all documents. Adaptive Lexicon learning using Genetic Algorithm (ALGA) [46] uses the genetic algorithm to generate a sentiment dictionary. However, we use topic modeling in WJST, to generate this dictionary. In WJST, the window size is different for various words. At each step of the sampling algorithm, count variables such as F s,r and F r are updated after sampling sentiment label, weight, and window size. After completing the sampling, the distribution of latent variables (φ, θ, π, ξ, and ψ) can be calculated as follows: π � F s,r + c e probability of a word given a topic would be equal to s,q,e P(w|z, s, q, e), and the probability of a sentiment label given a document for sentiment classification at the document-level is calculated using π.
e time complexity of the proposed method quantifies the amount of time taken by the Gibbs sampling algorithm to run as the main function. Given the number of words in all documents w ALL (w ALL � r∈R N r , where N r is the number of words in document r), the number of topics |Z|, the number of distinct windows |Q|, the number of weights | E|, and the total number of sentiment labels |S|, the time complexity of each Gibbs sampling iteration would be O(w ALL ·|S|·|Z|·|Q|·|E|). Furthermore, given the number of iterations G, the total time complexity of WJST would be O(G·w ALL ·|S|·|Z|·|Q|·|E|). Table 3 compares different methods in terms of time complexity. Figure 7. e distributions θ, ξ, and ψ in WJST depend on the document, but in WJST1, the distributions θ, ξ, and ψ do not rely on the document. Dependency between documents of a domain is more than documents in different domains. A pattern in documents of a domain may not exist in documents of other domains. So, calculations on multidomain datasets should be local and not cover all domains. For example, considering the distributions P(z|s) and P(z|s, r), where z is topic, s is sentiment, and r documents, in the first state P(z|s), topic depends on sentiment.

e General Structure of WJST1. A version of WJST called WJST1 is presented in
e distribution covers all documents in different domains. Perhaps a topic is positive in one domain and negative in another domain. So, it is better to depend the topic on the documents of a domain, not all domains. us, the topic is limited to the document (and Computational Intelligence and Neuroscience domain), and contradiction between different domains is eliminated. So, WJST is suitable for multidomain datasets, and WJST1 is a version of WJST suitable for single-domain datasets. According to Figure 7, ξ is the probability of q given z, θ is the probability of z given s, and ψ is the probability of e given z and s, and the joint probability distribution for WJST1 can be factored as follows: P(w, z, s, q, e) � P(w|z, s, q, e) × P(z|s) × P(s|r) × P(q|z) × P(e|z, s), (12) where by integrating out θ, we achieve: where N z,s is the number of words with topic z and sentiment s. e symbol N s is the number of words with sentiment s. e symbol α is Dirichlet before θ. And by integrating out ξ, we achieve: where N q,z is the number of words with topic z and window q. e symbol N z is the number of words with topic z. e symbol μ is Dirichlet before ξ. And by integrating out ψ, we achieve: where N e,z,s is the number of words with topic z, weight e, and sentiment s. e symbol N z,s is the number of words with sentiment s and topic z. e symbol δ is Dirichlet before ψ. e symbols P(w|z, s, q, e) and P(s|r) are calculated using equations (2) and (4), respectively. After completing the sampling, the distribution of latent variables (θ, ξ, and ψ) is calculated as follows: And φ and π are computed through equations (8)

Experimental Results
e present study executes the methods on a computer with an Intel Core i7 CPU and 8 GB RAM. Proposed models are compared on 13 datasets. 4 datasets crawled from Amazon (https://www.amazon.com) opinions include Electronic, Movie, Android, and Automotive. 2 MDS datasets [65] contain Magazines and Sports. A dataset crawled from the IMDB movie archive [3] is MR. 3 UCI datasets [66] include Amazon, Yelp, and IMDB. 3 Twitter datasets [46] include STS-Test, SOMD, and Sanders. Data preprocessing contains (1) lowercasing all words, (2) removing digits, nonalphabetic characters, stop words, and words with too low and too high frequency, and (3) stemming. e details of the datasets are provided in Table 4. e number of topics is unknown, provided as a constant amount at the beginning of the Gibbs sampling algorithm. In this study, α, c, β, and δ specific distributions are symmetric, and we empirically set the value of parameters, and this setting demonstrates fairly good performance in our experiments. Table 5 exhibits the initialization of parameters used in different algorithms.
A sentiment dictionary is employed for initializing sentiment labels. Sentiment dictionaries such as AFINN [64], IMDB [67], 8-K [67], and Bing Liu [68,69] contain words and scores that specify positive and negative labels as  well as their weight. In the present study, AFINN is used as a sentiment dictionary which improves the model's accuracy. Sentiment detection at the document-level, perplexity, and topic_coherency are used to compare the efficacy of proposed models as three standard parameters which are used in different papers [7,70,[71][72][73].
In the present study, the Accuracy parameter uses the formula of ((TP + TN))⁄ ((TP + FP + TN + FN)), where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
π distribution equation (10) determines how likely each comment is positive or negative. For example, if the value of P(+) is more significant than the value of P(−) (for a comment), the comment will be positive. e Accuracy's formula uses π distribution (equation (10)) to calculate TP,   So, sentiment analysis (sentiment detection) at the document-level is realized using π distribution (equation (10)), and the formula of ((TP + TN))⁄ ((TP + FP + TN + FN)) is used to compute the Accuracy. e error formula can be calculated using the formula of (1-Accuracy). Accuracy, perplexity, and topic_coherency are used for evaluations in the present study. Further study can investigate more parameters such as MSE, MAE, and RMSE for future research.
Furthermore, Better methods have lower perplexity and also higher topic_coherency. Given a test dataset D Test , the perplexity is computed through where w r are the words in document r, N r is the length of document r, and P(w r ) is the probability of words in document r. e lower value of the formula over a held-out document demonstrates Better generalization efficacy. e evaluation results are shown in Tables 6-8, 9-14, and the proposed models demonstrate better results. In the report of Tables 6-8, 9-14, the perplexity of proposed methods is lower than that of baseline models. In the report of Tables 9-12, the perplexity is reduced with an increase in topics. Topic_coherency is also calculated using where is the list of M words that have a high probability in the topic z i , C(V (z i ) ) is topic_coherency for the topic z i , Z is the set of all topics, |Z| is the number of distinct topics, DF is the document frequency, and CODF is the co-occurrence of two words in different documents. A smoothing count of 1 is included to avoid taking the logarithm of zero. In the present study, topic_coherency is computed through (18), equal to the average of top-ic_coherency values in Z. Furthermore, a higher value of topic_coherency reflects the better quality of the detected topics. M is equal to 10, and results are demonstrated in [9][10][11][12][13][14]. A different number of topics (5, 10, 15, and 20) and different distinct windows (1, 2, 3, 4, 5, and 6) are applied for evaluating models. In this part, baseline methods include JST [8], RJST [10], TJST [8], and TS [9]. In the present section, the Friedman test [74,75] is used to examine the achievements of the comparison methods. e Friedman test is a nonparametric multiple comparison test utilized to examine the differences between algorithms by assigning the lowest rank to the best approach in minimization problems and the highest rank to the best approach in maximization problems. ere are several methods for the validation of classification and topic modeling-based problems. Still, the methods used in this study are the most common and are used in most articles related to our article for evaluation. Also, there are various methods for validation that we will try to use in a future study to evaluate the proposed methods. e following is the reason for choosing the validation methods used in this study: We chose accuracy, perplexity, and coherence score as evaluation metrics because of their popularity in classification and topic modeling problems. Perplexity is an essential metric that, in theory, represents how well a model behaved on unseen data and is provided using the normalized log-likelihood technique. Meanwhile, the coherence score measures the degree of semantic similarity between high-scoring words and helps distinguish the semantical interpretation of topics based on statistical inference. e main question we want to answer is whether the proposed methods can improve the performance of text sentiment classification.
is study compares proposed methods with different baselines, including JST and recently representative approaches. Consider a Confusion Matrix for a classification problem that predicts whether a comment has positive sentiment or not. e total number of correctly detected cases is one of the more obvious measures. When all of the classes are equally important, it is typically utilized. When True positives and True negatives are more significant, accuracy is employed. According to the accuracy criterion, one can immediately know whether the model is adequately trained or not and how it works in general. e most popular measurement for classification issues is accuracy, which is the proportion of correctly predicted cases to all cases. is metric's opposite, or error, can be calculated as 1-accuracy. In machine learning, an accuracy parameter is an excellent option for sentiment classification when the classes in the dataset are almost evenly distributed. Also, we will try to use various metrics such as recall and precision in future studies to evaluate the proposed methods.
We use the Friedman test to compare the results produced by the proposed methods and the competitors to verify the classification performance. Friedman's test is used to examine the achievements of the comparison methods. e Friedman test is a nonparametric multiple comparison test that is utilized to explore the differences between algorithms by assigning the lowest rank to the best approach in minimization problems and the highest rank to the best approach in maximization problems.
Topic modeling is one of the most important NLP fields. It aims to explain a textual dataset by decomposing it into two distributions: topics and words. A topic modeling In contrast, the appeal of quantitative metrics such as perplexity is the ability to standardize, automate, and scale the evaluation of topic models. In natural language processing, perplexity is a traditional metric for evaluating topic models.
e lower value of the formula over a held-out document demonstrates better generalization efficacy.
Perplexity's inability to capture context and the relationships between words within a topic or across topics within a document is one of its drawbacks. For human understanding, semantic context is important. Approaches like topic coherency have been designed to tackle this problem by capturing the context between words in a subject. Extracting topic words is one of the main tasks in topic modeling. In most articles about topic modeling, topic_coherency is shown as a number that represents the overall topics' interpretability and is used to assess the topics' quality. e higher the topic_coherency value, the better the quality of the subjects extracted.

Sentiment Scores for the Words in a Dataset.
In this section, a dictionary is generated, including sentiment scores (weights and sentiment labels) and words. e scores are extracted from datasets based on P(w| s, e). e weight and sentiment with the most probability are selected for each word as a sentiment score. e extracted scores for some phrases in the form of unigram can be seen in Tables 15 and 16. ALGA [46] uses the genetic algorithm to generate a sentiment dictionary; however, we use topic modeling in the proposed models to create this dictionary. According to Tables 15 and 16, ten words from each dataset are selected and scored by the proposed models. For example, the word nice obtains a score of 4 in WJST and obtains a score of 5 in WJST1. e scores are different in the proposed methods; for example, the word serious achieves a score of 1 in WJSTand a score of -2 in WJST1. Table 15 is related to Android, Automotive, Electronic, and Movie datasets. Table 16 is associated with STS, Sanders, and SOMD datasets.     e topics are extracted from datasets based on P(w|z) in this section. A topic is a multinomial distribution over words based on topics, sentiments, weights, and window sizes. e top words could approximately reflect the meaning of a topic. Tables 17-19 show some examples of topics extracted from Movie, Android, and Electronic datasets by different models. Each row shows the top 10 words for the corresponding topic and sentiment label. e top 10 words from each topic were extracted and then used for topic_coherency. Extracting topic words is one of the main tasks in topic modeling. is section lists the top 10 words in three examples for Movie, Android, and Electronic datasets. e listed words for each topic describe the topic. e listed words for the proposed methods have a better topic_coherency value than baseline methods because they have a higher value of topic_coherency. e higher the topic_coherency value, the better the quality of the subjects extracted.

Sentiment Classification at Document-Level.
In this section, the number of distinct windows is three, and the models use the AFINN sentiment dictionary in the initialization section of the Gibbs sampling algorithm. A document is classified based on P(s| r), which is the probability of a sentiment given by a document. A document is classified as negative if P(+|r) < P(−| r) and vice versa. Determining sentiment is important which is calculated using two formulas in this paper. In the first formula, P(s|r) � N s,r /N r where N s,r is the number of words with sentiment s in document r and N r is the number of words in document r. In the second formula, P(s|r) � F s,r /F r where F s,r is the effect of words with sentiment s in document r and F r is equal to the sum of the effect of words with different sentiments in document r. In all evaluations, accuracy1 is calculated based on the first formula, and accuracy2 is calculated based on the second formula. As shown in Figure 8, the document is negative according to the first formula, and the document is positive according to the second formula, and the weight of positive words is more than negative ones, although the number of negative words is more than positive ones, and positive words can affect sentiment analysis at document-level.
In this section, the best values for each method (the highest accuracy, the lowest perplexity, and the highest topic_coherency) are selected from Tables 9-12 and are  listed in Tables 6 and 7. Table 6 compares the models based on four datasets (Android, Automotive, Movie, and Electronic) and Table 7 compares the models based on six datasets (Magazine, Sports, MR, Amazon, IMDB, and Yelp).  Tables 6 and 7 are evaluated on unigram words. AFINN method classifies each document according to the P(s|r) � N s,r /N r where the word sentiment label is directly obtained from the AFINN sentiment lexicon. e RND method classifies each document according to the P(s|r) � N s,r /N r where the word sentiment label is determined randomly, and in the AFINN + RND method, the algorithm uses both AFINN and RND methods. e improvement over these methods will reflect how much the proposed methods and baseline methods can learn from a dataset. e report in Tables 6 and 7 shows that the proposed models perform better than JST. Based on the results, the    proposed methods have a significant improvement over AFINN and the baseline methods on all datasets. As seen from AFINN-based methods results, the results calculated based on the sentiment lexicon are below 70% for most datasets. In this study, parameters perplexity and top-ic_coherency are not calculated for AFINN, RND, and AFINN + RND methods. TS and RJST methods have lower accuracy than other methods on all datasets, but JST and TJST achieve better performance. As can be seen from the results, TJST outperforms JST on all datasets because, in JST, the distribution θ depends on the document, but in TJST, the distribution θ does not depend on the document and is generally estimated because it uses all documents for computations. According to Tables 6 and 7, WJST1 has higher accuracy than other methods. WJST1 outperforms WJST because, in WJST, the distributions θ, ξ, and ψ depend on the document, but in WJST1, the distributions θ, ξ, and ψ do not depend on the document and are generally estimated because they use all documents for computations. e perplexity value varies on different datasets because the size of datasets is different, according to Table 4. e analysis of the Friedman test on the results of Tables 6 and 7 demonstrates that there is a statistically significant difference between the performances of the algorithms in terms of accuracy with χ 2 (10) � 92.091 and p < 0.01, in terms of perplexity with χ 2 (5) � 38.629 and p < 0.01, and in terms of top-ic_coherency with χ 2 (5) � 5.508 and p > 0.1. e mean rank of the algorithms based on the Friedman test, which is demonstrated in Figure 9, indicates that WJST1 ranks first among all the algorithms in 7 in terms of accuracy and topic_coherency. According to Figure 9, if the experiment intends to find the minimum value (perplexity), the Friedman test assigns the lowest rank to the best-performing algorithm. If the problem intends to find the maximum value (accuracy and topic_coherency), the Friedman test assigns the highest rank to the best-performing algorithm.
According to Figure 9, -1 is accuracry1 and -2 is accuracy2. As shown in Figure 10, average values of accuracy, perplexity, and topic_coherency are equal to the average values in each column of Tables 6 and 7 for each method, in which the values are calculated on Android, Automotive, Electronic, Movie, Magazine, Sport, MR, Amazon, IMDB, and Yelp datasets. According to the results, WJST has a lower perplexity value than other methods. WJST1 outperforms WJST and baseline methods in terms of accuracy and topic_coherency. According to Figure 10, -1 is accuracry1 and -2 is accuracy2.

Evaluation Results According to the Different Situations, with AFINN and NO_AFINN States.
In this section, the study aims to examine the impact of the AFINN dictionary in the initialization part of Gibbs sampling on the proposed models. e results of the evaluation are shown in Table 8. In this section, the number of distinct windows is three, and the number of topics is ten. e most effective is visible in WJST1 on the Movie dataset, where the accuracy in the NO_AFINN state is equal to 0.48 and is equal to 0.95 in the AFINN state. Prior sentiment information affects perplexity and topic_coherency lower than accuracy. According to Table 8, it can be seen that using the AFINN dictionary is more effective than using the NO_AFINN state. In the NO_AFINN state, prior sentiment information was not incorporated into the models for sentiment words in the initialization section of the Gibbs sampling algorithm.

Evaluation Results According to the Different Sentiment
Dictionaries. In this subsection, the study compares different dictionaries achieved by the proposed models. e output of WJST and WJST1 can be a weighted sentiment dictionary. Using the obtained dictionary by each method on each dataset, other datasets will be evaluated. Each document will be classified according to P(s|r), where the word   Tables 20 and 21. e methods AFINN + w, Android + w, ELEC + w, Auto + w, and MOV + w classify each document according to P(s|r) � F s,r /F r , where the weight and sentiment label is directly obtained from AFINN, Android, Electronic, Automotive, and Movie lexicons, and window size is considered to be one for all words in all documents. Methods Bing_Liu, 8K, Android, Automotive, ELEC, MOV, and IMDB classify each document according to P(s|r) � N s,r /N r where the word sentiment label is directly obtained from Bing_Liu, 8K, Android, Automotive, Electronic, Movie, and IMDB lexicons, respectively. e Bing_Liu + RND method uses both Bing_Liu and RND methods. In the IMDB + RND method, the algorithm uses both IMDB and RND methods. e 8K + RND method utilizes both 8K and RND methods. In the Android + RND method, the algorithm uses both Android and RND methods. In the Auto + RND method, the algorithm uses both Auto and RND methods. In the ELEC + RND method, the algorithm uses both ELEC and RND methods. In the MOV + RND method, the algorithm uses both MOV and RND methods. According to Table 20, the AFINN method achieves the highest accuracy on one dataset. Proposed methods achieve the highest accuracy on six datasets. According to the results in Table 21, the proposed methods achieve the highest accuracy on seven datasets. In AFINN, Bing Liu, IMDB, and 8-k dictionaries, sentiment and score values are set manually for each word, but proposed models use topic modeling to generate dictionaries. Proposed methods such as MOV and ELEC perform well on the datasets on which they are created based on. e dictionaries achieved by the proposed models are dependent on the application domain. e analysis of the Friedman test on the results of Table 20 demonstrates that there is a statistically significant difference between the performances of competitors in terms of accuracy with χ 2 (27) � 70.070 and p < 0.01. e analysis also shows that there is a statistically significant difference between the performances of the algorithms in Table 21 in terms of accuracy with χ 2 (27) � 79.740 and p < 0.01. e mean rank of the algorithms can be seen in Figure 11. As shown in Figure 12 0  2  3  1  1  0  6  1   1  3  5  4  5  1  1 1 The first formula = The scond formula = w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8   Figure 9: According to the Friedman test, the mean rank of algorithms in Tables 6 and 7.   22 Computational Intelligence and Neuroscience

Evaluation Results According to the Different Number of
Topics. In this subsection, the proposed models are examined based on the different topics (5, 10, 15, and 20). e AFINN dictionary is utilized in methods, and the number of distinct windows is three. Evaluations results are shown in Tables 9-12. e proposed methods are better than the baseline methods based on the results. e results show that increasing the number of topics will decrease the perplexity value. WJST1 achieves the highest accuracy on the Movie dataset with 97 percent, but the highest accuracy value on the Movie dataset in WJST is equal to 84 percent. Results show that the proposed methods perform better with different topic number settings, especially for WJST1 with 97% accuracy at |Z| = 5 on the Movie dataset. Based on the results, WJST has a lower perplexity than other methods. WJST1 outperforms WJST and baseline methods in terms of accuracy and topic_coherency. TJST performs better than the WJST method in terms of accuracy, but WJST achieves higher accuracy than JST and other baseline methods. is observation shows that modeling the parameters weight and window improves sentiment classification at the documentlevel. According to (18), a lower topic_coherency value suggests that the retrieved subjects are of worse quality than one with a highertopic_coherency. e words in a subject accurately describe the subject and have a stronger association with one another.

Evaluations Results According to the Different Number of Distinct Windows.
In this subsection, the proposed models are evaluated according to the different number of separate windows (1, 2, 3, 4, 5, and 6), which are effective for improving the proposed models. In this experiment, the number of topics is five, and the models use the AFINN sentiment dictionary. Based on Table 13, the proposed methods are compared according to accuracy, perplexity, and topic_coherency. e results show that increasing the number of distinct windows will decrease the perplexity value. In the report in Table 13, an increase in the size of a window will reduce the accuracy because it will increase the number of words in the window, and each term may not affect all neighbors in its window.
For instance, as shown in Figure 13, the word terrible has a window size equal to 3. In Table 13, it is assumed that each word affects all neighbors in its window, so Table 13 takes that the unigram terrible effects unigrams film, last, season, sophie, best, and actress. As shown in Figure 13, the word terrible can affect unigrams film, last, and season, but it is not about unigrams sophie, best, and actress. So, finding the words that can be involved in a window is a new challenge that we introduce in this study, and two methods are presented. e first method assumes that each word affects all neighbors in its window, and the second method assumes that each word affects some random neighbors in its window. So, the first method selects all neighbors, but the second method selects some neighbors randomly. In this study, all evaluations are calculated based on the first method, and the second method is considered for the evaluation in Table 14.
As shown in Table 13, accuracy, perplexity, and top-ic_coherency values are calculated on Android, Automotive, Electronic, and Movie datasets before random selection using the first method. As shown in Table 14, accuracy, perplexity, and topic_coherency values are calculated on Android, Automotive, Electronic, and Movie datasets after random selection using the second method. So, in Table 13, it is assumed that each word affects all neighbors in its window, but in Table 14, it is assumed that each word affects some random neighbors in its window. As shown in Tables 13 and 14, average values of accuracy, perplexity, and topic_coherency are equal to average values in each column of Tables 13 and 14 for each window size. e values are calculated on Android Automotive, Electronic, and Movie datasets. According to the results, the second method is more stable than the first method in terms of accuracy, but the first method has higher accuracy than the second method. e second method outperforms the first method in terms of perplexity. e first method performs better than the second in terms of accuracy and topic_coherency. e analysis of the Friedman test on the results of Table 13 demonstrates that there is a statistically significant difference between the performances of the algorithms in terms of accuracy with χ 2 (5) � 27.608 and p < 0.01, in terms of perplexity with χ 2 (5) � 35.143 and p < 0.01, and in terms of topic_coherency with χ 2 (5) � 6.232 and p � 0.284. e mean rank of the algorithms based on the Friedman test, which is demonstrated in Figure 14, indicates that (w �1) outperforms other windows in terms of accuracy. Still, it has lower perplexity and topic_coherency values than other windows. (w � 6) outperforms other windows in perplexity and topic_coherency, but it has a lower accuracy value than other windows. (w � 3) provides a special situation for proposed algorithms in which accuracy, topic_coherency, and perplexity values are between the highest and lowest values (w �1, 2, 4, 5, and 6). So, in this study, all evaluations are calculated based on (w � 3). e mean rank of the algorithms based on Table 14, which is demonstrated in Figure 15, indicates that (w �1) ranks first among all the algorithms in Table 14 in terms of accuracy. e analysis of the Friedman test indicates that there is a statistically significant difference in terms of accuracy with χ 2 (5) � 17.550 and p < 0.01, in terms of perplexity with χ 2 (5) � 37.857 and p < 0.01, and in terms of top-ic_coherency with χ 2 (5) � 5.857 and p � 0.320.

Sentiment Classification Using Proposed Methods in
Comparison to ALGA. In this subsection, WJST and WJST1 are compared to ALGA [46]. ree datasets have been selected for evaluating the methods. Evaluations results are shown in Figure 16, which compares the results of models with each other according to accuracy, perplexity, and topic_coherency metrics. In ALGA [46], several sentiment lexicons are created for a dataset during the training stage using a genetic algorithm. During the testing process, these dictionaries are employed. Every dictionary has some words and scores. Each chromosome is represented as a vector of sentiment words and their scores in the genetic algorithm employed in the method. e scores are spread between a feeling word's lowest and maximum scores. e primary goal of ALGA is to create a dictionary that reduces errors on training datasets. e sum of scores for words of each instance T i in dataset D m using dictionary L k is calculated using equation (19) and is treated as a feature [46]: Finding the values of words in the dictionaries (chromosomes) and adding them together is how the ALGA value for each instance is calculated. In (19), W j represents the words of T i , and v k (W j ) shows the score of W j in L k . As mentioned in [46], ALGA will predict a positive instance when the ALGA feature is positive and a negative instance when the ALGA feature is negative. By dividing the number of correct predictions of instances of a given dataset by the total cases, ALGA's accuracy is calculated. In this subsection, proposed methods are compared with ALGA [46] because it can automatically generate a sentiment dictionary. ALGA generates a sentiment dictionary using the genetic algorithm, but proposed methods generate a sentiment dictionary using topic modeling. In proposed models, each document is classified based on P(s| r), the probability of  Computational Intelligence and Neuroscience sentiment label given a document. In proposed models, two labels (+, −) are considered, and a document is classified as negative if P(+|r) < P(−|r) and vice versa. Evaluations results can be seen in Figure 16. In this subsection, the number of distinct windows is three, and the number of topics is five. e models use the AFINN sentiment dictionary. According to Figures 16(a)-16(c), each column compares different methods on a dataset. e details of the datasets used in this section are illustrated in Table 4. In this subsection, only the accuracy is considered for the evaluation of ALGA. e ALGA-SW value is achieved by executing ALGA without taking stopwords into account. According to the results, WJST has higher accuracy than TJST on all datasets. WJST1 outperforms WJST and TJST on all datasets. ALGA and ALGA-SW perform better than other methods in terms of accuracy, but WJST1 achieves higher accuracy than ALGA and ALGA-SW on STS and Sanders datasets. e RND method achieves the lowest accuracy value on Sanders and STS datasets. e AFINN + RND method has higher accuracy than the RND method and has a lower accuracy than the AFINN method on Sanders and STS datasets. e RND method outperforms TJST, AFINN, and AFINN + RND methods on the SOMD dataset. In this study, parameters perplexity and topic_coherency are not calculated for ALGA, ALGA-SW, AFINN, RND, and AFINN + RND methods. According to Figure 16, -1 is accuracry1, and -2 is accuracy2.

Sentiment Classification Using Proposed Methods on
Multidomain Datasets. In this subsection, the performance of the proposed methods is compared with baseline methods on a multidomain dataset. In this experiment, the number of distinct windows and topics is three and five, respectively, and the models use the AFINN sentiment dictionary. e multidomain dataset contains reviews taken from multiple domains (product types). e details of the multidomain dataset used in this section are illustrated in Table 22.
As shown in Figure 17, accuracy, perplexity, and top-ic_coherency values are calculated on a multidomain dataset that contains Android. Automotive, Electronic, and Movie domains. e methods WJST-dictionary and WJST1-dictionary classify each document according to P(s|r) � N s,r /N r . e word sentiment label is directly obtained from WJST and WJST1 lexicons achieved by WJST and WJST1 on the multidomain dataset. According to Figure 17, -1 is accuracry1 and -2 is accuracy2. Based on the results, WJST has a lower perplexity value than other methods. WJST1 outperforms WJST in terms of topic_coherency. WJST performs better than the WJST1 method in terms of accuracy because the distributions θ, ξ, and ψ in WJST depend on document, but in WJST1, the distributions θ, ξ, and ψ do not depend on the document. Dependency between documents of a domain is more than documents in different domains. A pattern in the documents of a domain may not exist in the documents of other domains. erefore, calculations on multidomain datasets should be local and not cover all domains. For example, considering the distributions P(z|s) and P(z|s, r), where z is topic, s is the sentiment, and r is documents. In the first state (P(z|s)), the topic depends on sentiment, and the distribution covers all documents in different domains. Perhaps a topic was positive in one domain and negative in another. erefore, it is better to depend the topic on the documents of a domain, not all Before Preprocessing: Game-of-thrones was the best film. The last season was terrible. Sophie was the best actress.

After Preprocessing:
game-of-thrones was the best film last actress best sophie terrible season  Table 13 according to the Friedman test.   Table 14 according to  the Friedman test. domains.
us, the topic is limited to the document (and domain), and contradiction between different domains is eliminated. erefore, WJST is suitable for multidomain datasets, and WJST1 is a version of WJST suitable for singledomain datasets. Sentiment classification on multidomain datasets is a challenge, and our solution in this study is using WJST, whose distributions (θ, ξ, and ψ) depend on the document. Sentiment classification on multidomain datasets is a challenge, and further studies can be conducted to investigate this problem for future research.

Comparison with Discriminative Models.
e proposed methods are compared to baseline approaches such as logistic regression and SVM on four datasets in the following experiment.
e multidomain dataset contains Android, Automotive, Electronic, and Movie domains. As shown in Table 24, the accuracy value is calculated on four datasets. e results demonstrate that proposed methods have improved notably over AFINN and the baseline methods on all datasets. Based on Table 24, methods use two systems in preprocessing phase, which includes Bag of Word (BOW) and Term Frequency Inverse Document Frequency (TF- IDF). In the BOW system, more word frequency reflects more importance of the word. TF-IDF system believes that high frequency may not be able to provide much information. Furthermore, rare words contribute more weight to the method. According to evaluation results, the results of the TF-IDF system are better than the BOW system.

Comparison with JST According to Extended Features.
Suppose a unigram corresponds to the sentiment lexicon. In that case, its polarity will be equal to the subjectivity of the lexicon in order to identify the emotion label of the unigram for trying to prepare prior emotion information. e following technique is used to decide the emotion label of a bigram to prepare prior emotion information: If words of the bigram have the same polarity, the bigram's polarity will be the same as that of the words. If one of the words is in the lexicon, the bigram's polarity will equal the lexicon's subjectivity. e bigram's polarity will be opposed to the polarity of the second word if the first word is 'not.' e following methodology is used to decide the emotion label of a trigram in order to prepare prior emotion information. If words of the trigram have the same polarity, the trigram's polarity will be the same as that of the words. If one of them is in the lexicon, the trigram's polarity will equal the   -  [76] 0.841 ------WS-TSWE [76] 0.824 ------TSWE-P [76] 0.726 ------TSWE + P [76] 0.782 ------JSTH [76] 0.681 ------HTSM [76] 0  lexicon's subjectivity. e trigram's polarity will be opposed to the second or third word's polarity if the first or second word is 'not.' e proposed methods are compared with JST on four datasets (single-domain) according to extended features (bigrams and trigrams) in the following experiment. As shown in Table 25, the accuracy value is calculated on four datasets, and the experiment extends the features to bigrams and trigrams. According to the results, WJST1 outperforms WJST. According to evaluations results, proposed models outperform JST because the additional parameters can influence the process of producing words in a review appropriately. e perplexity value varies on different datasets because the size of datasets is different. As the number of grams increases, perplexity is increased because in each document, in addition to the unigrams (+bigrams), bigrams (+trigrams) are added to the data, and the size of the dataset is increased. As the number of grams increases, accuracy is improved. In some cases, it gets worse because higher grams (bigram or trigrams) are sometimes meaningless.

Discussions on the Limitations of the Proposed Methods.
Although the analysis of the results of the evaluation can demonstrate the best performance of proposed methods, proposed methods have some limitations, as follows: (1) e first limitation is the time complexity of the proposed methods (O(G·w ALL ·|S|·|Z|·|Q|·|E|)) which is more than baseline methods (O(G·w ALL ·|S|·|Z|)) according to Table 3 in Section 3.3. (2) e second limitation is the window size. On the report in Table 13, an increase in the size of a window will decrease the accuracy because it will increase the number of words in the window, and each term may not affect all neighbors in its window. erefore, finding the words that can be involved in a window is a new challenge that we introduce in this study, and two methods are presented in Section 4.7. e first method assumes that each word affects all neighbors in its window, and the second method assumes that each word affects some random neighbors in its window. erefore, the first method selects all neighbors, but the second method selects some neighbors randomly. According to the results, the second method is more stable than the first method in terms of accuracy, but the first method has a higher accuracy value than the second method. e second method outperforms the first method in terms of perplexity. e first method performs better than the second in terms of accuracy and topic_coherency.

A Concise Description of the Proposed Solutions and the
Results. e main problem in this study is to examine a user's opinion about a product or movie, for example. is means identifying whether a user has a positive or negative idea about a subject (a product or movie).
Two novel models have been proposed that use topic modeling to solve the above problem. So, our solution is using a technique named topic modeling. Proposed models extend and improve JST (as a topic model) through two new parameters. To improve JST, proposed models consider the effect of words on each other. e new parameters have an immense effect on model accuracy regarding evaluation results. According to evaluations results, the proposed models outperform JST because the additional parameters can influence the process of producing words in a review appropriately. ey can improve sentiment classification at the document-level. Also, in the evaluation results report, proposed methods are more accurate than discriminative models such as SVM and logistic regression. Proposed methods are more flexible than discriminative models because other information, such as the top 10 words, can be extracted from the heart of the data.

Conclusion
In this study, two new models called WJST and WJST1 have been presented that extend JST and improve accuracy metrics. Reviewing the various articles about sentiment analysis indicates that the proposed models are associated with innovation and lead to remarkable results compared to the baseline methods. e proposed models can generate a sentiment dictionary. According to evaluations results, the proposed models consider the effect of words on each other using the extra parameters, which are important and influential. e evaluation results indicate that the accuracy has been improved compared to the baseline methods such as JST, RJST, TS, TJST, and ALGA. Results show that the proposed methods perform better with different topic number settings. WJST1 outperforms other methods in terms of accuracy, demonstrating its effectiveness of that. Prior sentiment information affects perplexity and top-ic_coherency lower than accuracy.
According to evaluations results, using the AFINN dictionary as prior sentiment information is more effective than using the NO_AFINN state. ALGA uses the genetic algorithm to generate a sentiment dictionary; however, proposed methods use topic modeling to generate this dictionary. According to the evaluation results, the proposed models outperform JST because the additional parameters could influence the process of producing words in a review appropriately. ey have the potential to increase the emotion detection's accuracy at the level of the document. e proposed methods are unsupervised, and no labeled data is required. Proposed methods can automatically assess web comments and categorize reviews as positive or negative. e proposed methods have tried to increase the accuracy with fewer parameters and, at the same time, simplicity compared to the existing methods. e proposed methods both analyze emotions at the document-level and create an emotional dictionary. ey are also the first methods to create an emotional dictionary through a topic modeling technique and in an automatic and accurate way. e proposed methods are the first methods that consider the words in the text and their effect on each other in a dynamic and weighty way. Also, they are parametric.

Future Work
e proposed models are parametric in the present study, and further studies will be conducted to investigate nonparametric models. Sentiment classification on multidomain datasets is a challenge, and further studies can be conducted to investigate this problem for future research. In future research, the proposed methods can be evaluated on more datasets. More parameters can also be assessed. Twitter social network data have obtained significant attention in natural language processing studies, with certain conditions, such as short data length. In future articles, the proposed methods can be modified to analyze emotions with specific Twitter data.
Data Availability e datasets used during the current study are available from the corresponding author (a.osmani@qiau.ac.ir) upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.