Extraction of Psychological Effects of COVID-19 Pandemic through Topic-Level Sentiment Dynamics

,


Introduction
COVID-19 is more than just an infectious disease spread by droplets emitted when people cough, sneeze, or talk; misinformation spread promoted on social media has made it a source of stress, depression, and anxiety. Fake information spreads quickly on social media, which negatively impacts mental health. During this period of social distancing and lockdown, individuals rely primarily on the Internet, and the most activity is reported on social media. Opinion mining and sentiment analysis are emerging Natural Language Processing application whose importance is becoming progressively higher. It analyzes a textual dataset for people's opinions, assessments, sentiments, attitudes, and emotions using opinions mining and sentiment analysis. It is possible to determine people's sentiments by applying sentiment analysis to almost every social media platform, such as Twitter, Facebook, YouTube, and Tumbler.
It has been proven that comprehending the emotions expressed through certain resources, such as tweets, blogs, reports, documents, or segments from political speeches, is significant for humans [1]. However, enormous, large number of opinions are challenging task for human processing. e extraction of sentiments from multiple sources that keep growing in volume, complexity, and diversity requires automated processes. Online social networks (OSNs) provide a medium (where different people can engage, demonstrate, and express their ideas). Microblogging is a fast way for communication such as Twitter, Tumbler, and Facebook are most popular microblogging platforms, in which millions of messages appear. By analyzing the content over social media, a notable transformation in the shaping of public perceptions has been made in a society that revealed dominant users. [2]. Recently, data are available on social media platform related to COVID-19 pandemic need to process and extract meaningful information to create awareness among people. Due to this pandemic situation, government has imposed lock down from January 2020 to September 2020 to save people's life. e period of lock down was very tough for people as they are bounded in their homes and used social media apps for exchange information and awareness. Twitter has become a popular social media platform, which people share their thoughts, views, audio, videos, and comments on various topics and ideas. Mostly tweets were viral on COVID-19 with hashtag symbol [3]. e hashtag is a set of keywords that are helpful to find useful information. Hashtag (#) symbol that indicates posted information, comment, and ideas, are important as 50 million information are organized with keyword hashtag on Twitter [4]. People from all over the world are affected badly due to this pandemic. In present study we target "#sentences" on COVID-19 pandemic from people all over the world in Twitter. Six important keywords or trends related to COVID-19 pandemic are targeted as follows: (1) #Quarantine (2) #Covid-19 (3) #Quarantine Days (4) #Quarantine Life (5) #My Pandemic Plan (6) #Quarantine and Chill Data related to these trends for the duration January-September 2020 was collected to find people's everyday life and their daily routines using sentiment analysis and topic modeling. In this proposed study, the following objectives were set as follows: (1) Collection of data from Twitter on COVID-19 pandemic (2) Analyzing people views using polarity tendency (3) Comparison between algorithms for better visualization of results (4) Which trend is mainly focused by people on social media users using LDA models?

Related Work
With the tremendous increase in COVID-19 all over the world, researchers applied sentiment analysis methods focused on social media to observe people's mental well-being.
is section contains the summary of work related to COVID-19 research based on social media data. A. Jenifer describes hybrid approaches to sentiment analysis based on Twitter data. A sentiment lexicon was created and enhanced by Senti-WordNet, along with semantic rules, unsupervised Machine Learning methods, and fuzzy sets [5]. TA hybrid standard classification was first carried out and was then upgraded to a hybrid advanced classification [6]. ey builtin the hybrid advanced approach into the linguistic semantic polarity classification that was modeled using fuzzy sets. e new sentiment analysis methodology was used to compute the polarity of a given sentence for the movie review dataset.
Suresh et al. [5] described a fuzzy clustering model using real tweets collected over a one-year period for the purpose of analyzing the sentiments associated with a particular brand name. ey conducted a comparative study with Kmeans clustering algorithm, expectation-maximization techniques, and accuracy, precision, recall, and time complexity were used. According to experimental analysis, the proposed method was proved to be effective in performing high-quality sentiment analysis on twitter. As compared to the other two methods, this model gave an accuracy of 76.4 and required less time to build. Supriya et al. [7] presented a three-step algorithm presented for analyzing the public sentiment in Twitter tweets. e algorithm steps consisted of cleaning, entity identification, and classification for sentiment analysis. e performance of the classifier was measured using precision, recall, and accuracy. Elaziz et al [8] proposed a novel approach to visual diagnosis of COVID-19 through machine learning by classifying the chest X-ray images into two classes, positive COVID-19 patient or negative COVID-19 person. ey used new fractional multichannel exponent moments (FrMEMs) for features extraction from the chest X-ray images. ey utilized a framework to accelerate the computational process. After that, they used a modified MRFO (Manta-Ray Foraging Optimization) that was based on differential evolution to extract the most important features. ey performed this methodology on two COVID-19 datasets and got the accuracy of 96.09% and 98.09% from these two datasets. Jain and Sinha [2] proposed weighted correlated influence (WCI) approach in order to integrate the relative impact of trendspecific and timeline-based features of twitter users. ey used the Twitter trend #Coronavirus Pandemic to quantify their proposed approach performance. e proposed WCI showed better performance than the existing methods. A Sharma et al. [9] gave insights of the foremost issues the firms are facing due to COVID-19 and how they are examining the strategic options. ey took data from twitter about NASDAQ 100 firms and used text analytics tools to find out the issues that firms are facing and the strategies they are adopting. ey also recommended some futuristic strategies for innovation of the supply chain. Samuel et al. [10] provided insight into COVID-19 pandemic fear sentiment progression. ey also outlined interrelated methods, implications, opportunities, and limitations. eir analysis was based on Covid-19 linked Tweets and R statistical tools along with text mining packages. ey also established evidence that growth of fear-sentiments existed 2 Complexity from the beginning of COVID-19, as the outbreak reached its peak in the US, by applying descriptive text analytics. Furthermore, they provided a methodological overview of two fundamental machine learning classification approaches (Naïve Bayes and logistic regression), as applied to textual analytics, and compared their efficiency when it came to categorizing coronavirus tweets. Both Naïve Bayes and logistic regression classification methods provided an accuracy of 91% and 74%, respectively, with short length tweets but both approaches showed relatively lower accuracy with lengthy tweets. Li et al. [11] examined the impact of COVID-19 on mental health. ey used the method of online ecological recognition [12] based on several machine learning predictive models to evaluate Weibo (a Twitter-like microblogging framework in China) articles. ey used the collected data to calculate the word frequency, scores of emotional indicators (depression, anxiety, indignation, and Oxford happiness), as well as cognitive indicators (life satisfaction and risk judgment). ey performed sentiment analysis and sample t-test to examine the differences before and after the affirmation of COVID- 19. e results showed that the scores of negative emotions were increased as compared to positive ones. Cinelli et al. [13] used different social media platforms (Twitter, Instagram, Reddit) to analyze awareness and concern in the subject of COVID-19 and provided a differential assessment of the global discourse evolution of each platform and their users. ey found similar spreading patterns from reliable and suspicious information sources. Zhou et al. [14] analyzed the sentiment dynamics of people of New South Wales (NSW) Australia during the COVID-19 period by exploiting the tweets on Twitter. ey analyzed the sentiments at local government areas (LGAs) level that was based on more than 94 million tweets collected from Twitter for a 5-month period started from 1st January 2020. e results showed that the positive sentiments were decreased due to massive increase in COVID-19 confirmed cases. Han et al. [15] proposed a topic extraction and classification model to analyze media data in the early stage of COVID-19 in China. ey generalized COVID-19 related microblogs into 7 topics and 13 more detailed subtopics. However, their study had some limitations.
ey used social media to analyze text only, but pictures and videos could also be informative.

Proposed Methodology
is research is divided into a series of steps as shown in Figure 1. e first step is to collect the data through Twitter API, after collecting the significant number of tweets, all these tweets are stored in a text file. In the second step, the classification accuracy was improved by performing some preprocessing techniques such as case folding, cleansing, word formalization, and stemming. ese processes are conducted for Lexicon-based machine learning approaches. In the Lexicon-based approaches, Text Blob, Vader Sentiment, and Afinn have been used to determine the polarity of each Twitter user. Topic modeling has been used to find the useful information from the group of tweets. Furthermore, we adopt the t-distributed stochastic neighbor embedding (t-SNE) technique that partly reduces the fact that humans cannot perceive vector spaces beyond 3D.

Data Set.
Real time data were collected from Twitter using the scripting language Python for getting off data from Twitter. e data were collected from 1 st January 2020 to 19 th august 2020. For the collection and distribution of the datasets, Twitter API (Tweepy) was used.
API collects data real-time data of twitter on geographical regions of all countries illustrated in Figure 2.
ere should be a valid twitter account, and the application should be registered on Twitter to extract the tweets. e user sends the request to API for the twitter data, and it returns data according to the user-defined query. A sample of 16696 tweets was extracted. In this work, query was "Quarantine Life", all data were extracted belonging to this keyword.
e extracted data included Tweet ids, names, screen names, locations, descriptions, followers, and following counts of users.

Preprocessing.
Text mining needs some primary section which in essence is preparing for the document to be converted to make it more structured. In this analysis, preprocessing steps are as follows: (1) Case folding is the first step that replaces whole text into lowercase, i.e., "Alan, self-isolation, Day 4" into "Alan, self-isolation, day 4." (2) Cleansing is required to derive the figures used in this analysis. From this stage, we exclude grammar, symbols, abbreviations, specifying the client, and tweets. e only characters left in this phase are words.
(3) Formalization on twitter results is limited to 160 letters only. erefore, Twitter users tend to write unorthodox sentences. To resolve this, a formalization application is required in order to embed the word in its default form. (4) Stemming is the step process word using a tool that converts words holding a document into their fundamental forms using fixed rules.

Lexicon-Based Approaches.
Machine learning algorithms were applied to check polarity in text. We used three algorithms AFINN, VADER (Valence Aware Dictionary for Sentiment Reasoning), and TEXTBLOB to check polarity in text and for semantics analysis.

VADER.
VADER is a rule-based lexicon and analysis tool that is particularly used for sentiment analysis. It is used to extract sentiments being expressed in social media, and it performs exceptionally very well in this domain. VADER sentiment analysis [16] is primarily based on definite key factors such as punctuation, capitalization, conjunctions, degree modifiers, and preceding trigram. VADER classifies the sentiments into positive, neutral, and negative categories and secure complex scores which is determined by summing up Complexity each word's valence scores in the lexicon and normalized in the range (−1, 1), the most extreme positive is "1," and the most extreme negative is "−1." If the compound score is less than −0.05, the text will be considered negative; if the score is greater than 0.05, the text will be considered positive; if the score is between 0.05 and −0.05, the polarity of the text will be neutral. VADER has one major advantage that it does not entail the preprocessing of data and training of model can be utilized directly on the raw tweets to generate sentiment polarity. It also supports emoji for sentiment classification and is fast enough to be used online without affecting speed-performance.

AFINN.
Afinn is an English word list with an integer between 5 and −5 which has been significantly designed for microblogs such as tweets. It has the biggest advantage that it gets updated with new terms and phrases every year.

TEXTBLOB.
Text blob is a Python library (just like a python string) which is used for processing the textual data. It aims in providing a consistent API to deal with common NLP (natural language processing) tasks such as part-of-speech tagging, noun phrase extraction, translation, text mining, text processing modules, text analysis, sentiment analysis, classification, and more. Text blob analyzes the text on sentence level [16]. Firstly, it takes input from the dataset, and then it splits the review into sentences. e polarity of the entire dataset can be determined by counting the number of negative and positive sentences and deciding whether the response is positive or negative based on the total number of negative and positive reviews. A sentiment () function can be used to find the polarity and subjectivity of a given review. It returns a tuple with two parameters called polarity and subjectivity. e function returns a tuple consisting of polarity and subjectivity, where the polarity score ranges        6 Complexity from −1 to 1. e subjectivity range is 0 to 1, where 1 is the most subjective and 0 is the most objective.

Topic
Modeling. Topic Modeling [17] consists of finding the information contained in textual documents and presenting it in the form of themes (depending on the technique used, the relative importance of the themes can also be found). Topic modeling is therefore an unsupervised technique for classifying documents in multiple themes. From the perspective of the representation space, the TM is a reduction of dimensions in the vector representation. Instead of representing a document of a corpus by a vector in the space of the words, composing the vocabulary of this corpus is represented by a vector. In the space of the themes of this corpus, each value of this vector corresponds to the relative importance of the theme in this document. Popular modern technique Latent Dirichlet allocation (LDA) was used in this study [18], and LDA is used for topic recognition in documents. It basically tells how many topics exist on similarity bases in each document.
is model observes all words and produces topic distribution with P (P�probability) as shown in Figure 3. Researchers prefer LDA method for finding topics within context-based documents or text-based data [19,20].
Mathematical representation of LDA:  Complexity where βK is the word distribution of the topic K, θ is the topic proportions of the document, and z is the topic assignment of the word in a document.

Polarity Calculation.
A total of 16696 tweets were collected from twitter API. e collected records do not have a target group of people. To make the target group view, three lexicon algorithms has been used. Every analyzer describes whether the tweet is a positive, neutral, or negative. e claim of the authors that social media has been incapable to present the proper direction in which the netizens shall combat a pandemic like COVID-19, has been confirmed by the Word Cloud as shown in Figures 4(a)-4(c). Most of the words that have been described in each of the sentiments have been visualized using the WordCloud modules. ese also present words that do not verify any efficiency in representing a possible solution during crises. Among the three sentiment analyzers we found that text blob had the highest rate of tweets with the neutral sentiment 44.42%. Vader sentiment gave the highest negative sentiment rate of 86.46%. However, Afinn gave the highest sentiment rate 42.08% as shown in Table 1 and Figure 5. Table 2 shows some random tweets with target class: positive, negative, and neutral. We perform experiments through Affin, Veder, and text blob, by taking random tweets, and then the algorithm analyzes which tweets contain positive, negative, and neutral sentiments.

Topic Modeling.
After finding the sentiment from the data, the next step is to identify the topic. Topic modeling is the best way to discover how many abstract topics exist in the corpus. A document contains multiple topics in LDA models. e dominant topic is usually one of the topics. Table 3 shows extracted dominant topic for each sentence. Every keyword in LDA topic modeling contains weights and these weights show how much a specific keyword is important in topics. However, word counts represent the frequency of repeated words in specific topic. Figure 5 shows the weight of the topic and the keywords. Our goal is to find words that are found across multiple topics and whose relative frequency outweighs their weight. In many cases, such words are not as important as they seem. From Figure 6, by using t-SNE plot high dimensional data into a lower dimension which is difficult for humans to understand such as word embedding. It categories words according to four topics and overlaid the word-level sentiment by color. From Figures 7 and 8, we need to select four topics to analyze using Python 3.6.1 and LDAvis [16]. We set λ � 1 and set 4 topics and their keywords. Topics' names were generated according to their similar keywords to expatiate the topics. Bubbles are represented the topics and the size of the bubble is proportional to its prevalence in the corpus. Similar topics take shape close to each other; topics further apart are less similar. e topic distance is used to determine their centers [16].

Conclusion
VADAR, AFINN, and TEXT BLOB polarity shows positive, negative, and neutral frequencies. Most results between these dictionaries-based algorithms are neutral. is study described that majority people were afraid of the COVID-19 pandemic situation and on the other side some people were enjoying their lockdown period such as they prefer to live at home, play, watch movies, and reading. e proposed study suggests that physical activities and exercise can refine cognition during the pandemic. e COVID-19 outbreak has been studied for its etiology, clinical features, transmission patterns, and management, but little has been done to explore its effects on mental health and ways to prevent stigmatization. People's behaviors can significantly influence the dynamics of the pandemic by altering its severity, transmission, spread, and consequences. Raising public awareness can help deal with this calamity in the present situation. Despite the fact that high-frequency interventions had little concomitant effect on cognition functions, the threshold remains to be worked out. In the future, we will extract various kinds of vaccination tweets datasets to study and analyze vaccine efficacy and effectiveness.
Data Availability e data that support the findings of this study are available from the corresponding author, upon reasonable request