Smart Sentiment Analysis-based Search Engine Classification Intelligence

Search engines are widely used for finding information on the internet. However, there are limitations in the current search approach, such as providing popular but not necessarily relevant results. This research addresses the issue of polysemy in search results by implementing a search function that determines the sentimentality of the retrieved information. The study utilizes a web crawler to collect data from the British Broadcasting Corporation (BBC) news site, and the sentimentality of the news articles is determined using the Sentistrength program. The results demonstrate that the proposed search function improves recall value while accurately retrieving nonpolysemous news. Furthermore, Sentistrength outperforms deep learning and clustering methods in classifying search results. The methodology presented in this article can be applied to analyze the sentimentality and reputation of entities on the internet.


Introduction
Over the past few years, there have been signifcant advancements in natural language processing (NLP) [1].Tese advancements have been driven by improvements in linguistic models that can predict words, characters, and sentences from textual data [1,2].Chatbot models are considered the most efcient NLP programs, demonstrating accurate performance across various existing datasets for diferent NLP problems, including question answering, translation, news article generation, sentiment analysis (opinion mining), and unscrambling words.Tis has been highlighted in literature by several researchers [3][4][5].
Nevertheless, the current evaluation and training approach for NLP favors only algorithms that have access to large datasets.In addition, the polysemy of textual patterns can negatively impact the performance of deployed NLP models.
To achieve human-like language capability, an NLP program must employ complex and disruptive technologies, while also addressing the need for feature engineering [1,2].Te study aimed to address the challenges in generating accurate news using NLP by bridging the gap between NLP expectations and the difculties encountered.Te main methodology used in NLP is the engineering of textual patterns.However, the study recognized the shortcomings of implementing standard NLP to textual data without feature engineering.To address this, the study proposed a methodology that incorporates feature engineering by utilizing NLP models like embeddings and tokenizers to extract relevant features from existing data.
Te proposed methodology aims to improve computational efciency by reducing empirical errors and increasing accuracy levels.Tis article advocates for the adoption of an NLP system that focuses on extracting relevant textual data from the web for sentiment analysis.Although NLP has achieved notable progress, it is not yet equipped to address real-world problems with accuracy [3,6], since its success is dependent on big data, evaluation metrics, and training approaches that favor probabilistic and heuristics learning.Tere are currently numerous well-known news websites available [7,8], and the majority of them include an integrated search feature.Two popular news sites, China (https://www.chinanews.com) and BBC (https://www.bbc.co.uk), have built-in search functions, but these are limited and do not include advanced options like sentiment analysis.Tis study aims to design and implement a smart search engine with sentiment analysis capabilities to determine the opinion of search results and categorize them as negative, positive, or neutral.Such a search engine can automatically extract brand visibility or reputation from the Internet in real time by scoring search mentions positively or negatively.
Tus, the implementation of a smart search engine that incorporates sentiment analysis can provide a real-time understanding of people's attitudes towards specifc brands during their Internet news search.Te sentiment analysis results can provide relevant insights to improve a brand's market share, competitive advantage, and reputation.Additionally, search engines with smart functions and sentiment analysis can infuence consumers to purchase products with a positive reputation.
In the context of news classifcation and categorization, a smart function based on sentiment analysis can foster brand trust and elevate a brand's reputation if the sentiment analysis results are positive.Text categorization and classifcation share many similarities, with the latter being a subset of the former [9][10][11].In text categorization, the frst step is to represent the text by preprocessing the documents and creating a vector space containing the words present in the documents using the bag-of-words (BoW) model.
Consequently, the classifcation of text documents relies on the proximity of keyword vectors, with the signifcance of keywords in the documents often determined by weighting schemes such as term frequency or word frequency.In contrast, sentiment analysis involves identifying relevant keywords in a textual document using linguistic patterns [11].
Te sentiment classifcation feld has adopted the n-gram technique, which involves dividing sentences into tokens and using the sequence of tokens for text representation.Additionally, the part of speech (POS) technique is employed to tag words with their grammatical attributes, such as nouns, adverbs, verbs, or adjectives.Tese types of representations are commonly used in sentiment classifcation research [9,11,12].Tis study employed a smart function to obtain unambiguous search results from the BBC search engine and then conducted sentiment analysis to assess the reputation of BBC news.
Users typically use keywords to search for news and expect the most relevant results.However, search engines often prioritize results based on their popularity rather than relevance, leading to a multiconnotation issue where desired results cannot be obtained due to the multiple meanings of the keywords.Search engines such as Google, Baidu, and Yandex prioritize recent or credible results, but these may not always be relevant to the user's needs.For example, searching for information about "apples" may yield results about both fruits and phones.Te issue described is related to the problem of polysemy [8], which refers to multiple meanings of a word.
Tis challenge can be addressed by enabling users to provide more specifc information to refne their search.However, only a few studies have addressed this issue.Some studies have utilized techniques such as anomaly detection and neural networks to classify search results that were manually collected [13].
Tis study aimed to optimize the categorization and classifcation of search results by implementing a search function on the search engine, resulting in fast and automated data collection.Te proposed method also integrated sentiment analysis to determine the sentimentality of BBC news.Te goal was to present a computational pipeline capable of quickly fnding the desired data on the Internet to classify patterns and analyze an entity's web presence.To achieve this, a search function was employed to extract BBC news and classify it into three categories (COVID-19, vaccine, and travel), and sentiment analysis algorithms were used to detect the news polarity as either negative, neutral, or positive.By using sentiment analysis as a proxy, this research utilized a classifed search engine to understand the polarity assigned to a web entity.A database with various categories of news was created, but the search function was not able to search the entire Internet.
However, it worked together with the search engine to automatically gather data.Each news item was tagged in the database to make sentiment analysis computation easier on the labeled data.Te tags were fxed and divided into diferent categories.Tis article proposes a framework for processing textual data that brings engineering practices and paradigms to NLP in order to extract sentiment from web news.Te proposed search function is envisioned as an optimized algorithmic pipeline that provides the most relevant results from the search engine.Tis article discusses how data quality and quantity can be addressed in engineering textual patterns and emphasizes the importance of including unstructured feature engineering, which is currently not widely available due to limited packages, in NLP problems [14].

Research Aim.
Te main goal of this study was to create a search function that could gather BBC news from the Internet, while also minimizing polysemous results.Tis search function was integrated into the BBC search engine, and the collected data were stored in a database.Sentiment analysis algorithms were then applied to this database to determine the polarity of the BBC news.Te performance of 2 International Journal of Intelligent Systems the sentiment analysis was evaluated using metrics such as precision, accuracy, F1 score, and recall.Te following research questions were the focus of this study: (i) What efect do smart functions have on a search engine for news categorization?(ii) Which sentiment analysis model is the most suitable to classify news?(iii) How does NLP preprocess news data?Te implementation of the smart function not only aims to improve the user's search experience but also facilitates the automated collection and sentiment analysis of relevant BBC news data.By reducing the polysemous issues, the function enables quick and efcient collection of data, which can be further categorized through automated sentiment analysis.Te results of the study demonstrate that VADER was the most accurate and precise sentiment analysis model suitable for integration into a search engine for automated sentiment analysis.
Moreover, this research highlights the importance of NLP techniques in the processing of unstructured data.By engineering unstructured patterns and standardizing fle formats in the database, data preparation and analysis become more manageable.Tis NLP-based approach enables the utilization of native formats of unstructured patterns, which can be processed and analyzed efciently to provide insights into the sentimentality of BBC news data.Terefore, this study provides a comprehensive framework for the collection, processing, and sentiment analysis of unstructured data, which can be adapted and utilized for other similar applications.

Research Rationale and Contribution.
Polysemy is a phenomenon in language where a word can have multiple meanings or interpretations [15].Tis can lead to ambiguity and confusion in NLP tasks like search queries, where the context of the query may not always be clear.Sentiment analysis of search results can help with this problem by providing additional context and clues to disambiguate the meaning of a query.Sentiment analysis involves analyzing the emotional tone or sentiment of a piece of text by using NLP techniques to classify it as positive, negative, or neutral [16].When applied to search results, sentiment analysis can provide valuable information about the context and connotations of the words in the results.For example, consider a search query for the word bank.
Depending on the context, this could refer to a fnancial institution or the side of a river.Sentiment analysis of the search results could help determine which meaning is most likely based on the emotional tone of the text.If the majority of the results are associated with positive fnancial news or reviews, it is more likely that the user was searching for a fnancial institution.Conversely, if the results are associated with negative river pollution or natural disaster news, it is more likely that the user was searching for the side of a river.In summary, sentiment analysis can help with the problem of polysemy by providing additional context and emotional clues to disambiguate the meaning of a search query.By analyzing the sentiment of search results, it is possible to infer the user's intended meaning and provide more accurate and relevant results.
Te aim of this research paper is threefold.First, it aims to present a solution for the polysemous problems encountered in search engines.Second, it seeks to identify the requirements that optimized search functions must meet.Tird, it argues for the use of optimized search functions to retrieve relevant data and conduct sentiment analysis to evaluate the strength of an entity's web presence.Te paper introduces a search function that can be utilized for search engine optimization in practical applications.Te research contributes to the felds of opinion mining and search engine optimization by presenting a relevant theory that can be easily implemented and tested for news categorization and classifcation.Te primary objective is to present a computational methodology that analyzes the web presence of an entity to study its sentimentality using sentiment analysis as a proxy.Te paper introduces such a framework and evaluates its performance on BBC news for news categorization and classifcation.Te design of the study is depicted in Figure 1.Te article is organized into several sections, including the background (Section 2), research methodology (Section 3), and results (Section 4).Te conclusion and recommendations are presented in Section 5.

Background
Although NLP has made signifcant advancements in achieving human-level performance in various tasks, researchers who seek to apply NLP to real-world problems still encounter challenges that demand more data annotation and training.To address this issue, an iterative and humancentered framework for NLP that incorporates feature engineering has been proposed.
Tis study recommends involving human participation in each stage of the data analysis process by focusing on the engineering of textual data [12] to enable the mitigation of technical issues in the proposed computational framework.Te proposed framework, which adopts an NLP approach that focuses on the challenges of engineering textual data, holds promise for analyzing BBC news to determine their sentimentality.With this approach, it is expected that systems can be implemented to capture or understand the meaning of the analyzed text, thus mitigating technical issues in the proposed computational framework [17].
One major obstacle in developing robust NLP systems is the "symbolic problem," in which symbols are interpreted based on their morphology rather than their intended meaning.Tis complication presents a signifcant challenge in assessing the capacity of computational machines to comprehend word meanings [17,18].
NLP systems assume that the meaning of a sentence can be determined by analyzing a set of words in it.However, in technical and textual contexts, this assumption may not hold true since generalization is not always closely associated with sentence meaning.Furthermore, NLP systems do not learn efectively from training features and require signifcant retraining to improve the model's performance, as reported in recent studies [18][19][20].

International Journal of Intelligent Systems
Te semantic disadvantage can be attributed to the evaluation metrics and current training methods that do not promote the implementation of human-like generalization models.As a result, NLP systems learn heuristically rather than learning the expected generalizations [19].To address this issue, the proposed engineering approach involves human supervision and intervention, along with computational techniques such as annotation, tagging, and textual structuring, to facilitate sentiment analysis computation.Additionally, machine learning algorithms can be implemented as a tagging and forecasting approach to mitigate spurious correlations and heuristics that afect NLP learning.
In a study conducted by Bi [11], a novel approach was introduced for sentiment analysis that utilized a belief functions framework to merge sentiment analysis outputs.Te study evaluated the efectiveness of sentiment stratifcation individually and in combination.Te results showed that the combined algorithms performed better than single classifers across fve review datasets.Bi [11] developed a dichotomous ensemble learning method for sentiment analysis by using a triplet evidential scheme [9] to formulate negative and positive polarities, and negative and positive propositions for neutral polarity representation.Tis approach reduced the negative efects of neutral sentiments through ensemble learning and classifed patterns with evidential reasoning schemes.
Te proposed approach achieved a reduction in both the training time's complexity and cost, even when the neutral polarity was not considered.Te method represented text reviews using a bag-of-words (BoW) approach and performed tenfold cross-validation.Te proposed methodology was evaluated using the F score metric, and a total of 10,000 algorithms of triplet belief functions were utilized through three combination techniques with two groups of nine and eight learning classifers.Te results showed that the proposed method achieved an accuracy of 87%-90%, outperforming existing techniques [11].Jalil et al. [21] employed both deep learning and machine learning techniques to perform sentiment analysis on data related to the corona virus disease 2019 (COVID- 19).Teir experiments yielded an accuracy of 93% to 95%, which was higher than that of other techniques used in the study.In another study, Reddy et al. [22] applied an NLP algorithm to segment a collection of medical publications.Te methodology involved segmenting all the words and the associated information.
Yogesh Pawade [23] demonstrated the benefts of a search engine optimization approach by using Google to extract essential information such as users' location and time spent on each website.Although the study did not involve pattern matching, it highlighted the potential of NLP algorithms in enhancing search engine optimization.
In the proposed methodology, user behavior analysis can be utilized to enhance the search engine experience by predicting user intent when searching for polysemous keywords.For example, if a user frequently searches for animal-related information and inputs the keyword Jaguar, the search engine can predict that the user is likely searching for animal-related data, rather than information about the car brand.By analyzing user behavior and preferences, the search engine can improve its ability to provide relevant and accurate search results.Tis approach is similar to previous studies that have used user behavior analysis to enhance search engine optimization and improve the quality of user experience [24].
Bi [10] conducted an experiment to evaluate the relationship between accuracy and diversity of classifers using pairwise and nonpairwise diversity measures and evidential combination rules.Te study utilized Yager's and Proportion's rules to generate ensemble learning negative classifers, which did not prioritize minimizing the error of selected classifers.Empirical results showed that increasing diversity decreased the accuracy of ensemble learning.However, the study did not investigate the behaviors of member classifers concerning the efciency of the ensembling scheme.Huang et al. [25] introduced a novel approach, called the inverted index method, for efcient execution of temporal queries, which was demonstrated on a COVID-19 dataset [25].
Te proposed method utilized a reverse sort-index approach, enabling real-time query processing to facilitate COVID-19 research.
Te authors developed several categories of queries, including nontemporal, relative temporal, and absolute temporal queries.Te results of the experiment suggest that the reverse sort indexing method is more efcient than current techniques in facilitating fast-time query execution for search engines.Te inverted index is capable of rapidly retrieving nonpolysemous big data.Te proposed methodology is limited to searching and storing BBC news search results in the database.In the future, this approach can be applied to other search engines to extract more data.Nevertheless, sorting indexes, as highlighted by Huang et al. [25], remain an essential aspect of search engine optimization.
Xu et al. [20] devised a text-based approach for aircraft fault diagnosis by employing Word2vec and convolutional neural networks (CNNs).Te experiments utilized a large corpus of text fles, and Word2vec was employed to retrieve textual pattern vectors which were then passed to the CNNs for fnal decision-making.A cloud similarity measure (CSM) was used by the CNN model to detect faulty knowledge, resulting in improved classifer performance and supporting aircraft maintenance.By combining structured and unstructured patterns for fault diagnosis, Xu et al. [20] were able to identify the underlying cause of the aircraft fault.
Te proposed framework utilized Word2vec to extract the contents of search results and suggest results based on the similarity of their vectors.Tis approach allows users to view results relevant to the keywords they have entered.Similar techniques have been employed by many search engines but with various optimization search functions.
Reddy et al. [22] utilized NLP to extract nonpolysemous information from a vast medical corpus, while with the proposed methodology, NLP was used to select relevant information from BBC news.Yogesh Pawade [23] highlighted the importance of search engine optimization using an inverted index on a large dataset.Te proposed methodology also employs the inverted index technique but with fewer entries than Yogesh Pawade [23].However, the proposed technique supports the idea that the search function can be scaled to search big data efciently.
Te study contends that a large dataset may contain spurious correlations as a result of its size.Tese correlations can overpower the identifcation of pertinent information that cannot be diferentiated algorithmically.Te more extensive the features examined, the higher the likelihood that spurious correlations will result in inaccurate conclusions and dominate the ultimate outcomes.
Dilrukshi and De Zoysa [26] developed a news classifcation and categorization system based on stratifying news by relevance.Te study utilized a Twitter dataset and focused on articles related to Sri Lanka, utilizing dimensionreduction techniques.Frequent words were found to be less informative for text categorization, so irrelevant words were removed as noisy data.Te models were evaluated using precision and recall values for each category, and the evaluation was independently conducted for each category.
Bun and Ishizuka [27] conducted a study on news articles to group them based on their topics.Tey utilized the term frequency proportional document frequency (TF-PDF) approach, which calculated the relevance of a word in a specifc text.Te study revealed that the signifcance of a word proportionally increases based on its frequency of occurrence in the text.
Kapusta and Obonya [28] conducted research on identifying fake news and categorizing news articles as either authentic or fake.Teir objective was to develop a feature set from foating languages, such as Slovak, and apply it to detect fake news.Tey created a dataset using news articles from various publishers and labeled the features after scrutinizing the authenticity of the information.Te study emphasized morphological learning approaches over contextual learning approaches.Tey introduced a technique to classify foating languages using part-of-speech (POS) tagging, which yielded reasonable accuracy.
Daud et al. [29] utilized hyperparameter-optimized support vector machines (HOSVMs) to categorize online news articles into diferent categories, highlighting the costefectiveness and simplicity of this approach.Five other adaptive computation methods were also optimized and compared for news categorization.Te results showed that the HOSVM performed better than the other models, while the nonoptimized scheme performed worse than the alternative methods.
Yousef and Voskergian [30] introduced TextNetTopics, a feature selection model that employs bag-of-topics (BoT) instead of the bag-of-words (BoW) technique.Te BoT method selects topics instead of individual words.Te model utilized neural network layers to capture and analyze relevant information from sentences.Te experiments demonstrated that TextNetTopics outperformed other feature selection models and achieved better results on various textual datasets.
Bi et al. [9] proposed a class-indiferent approach for merging classifer outputs using evidential structures (triplet and quartet) with Dempster's rule of combination.Tis ensemble methodology aimed to distinguish relevant observations from irrelevant ones by representing classifer results and providing pragmatic ways to apply the Dempster-Shafer theory of evidence to the ensemble learning scheme.To combine the mass functions, a formalism modeling classifer outputs as triplet mass functions were designed to provide decision support.In addition, a comparative analysis was conducted with dichotomous structures to compare the proposed method with majority voting and Dempster's rule.
Te experiment was conducted using the UCI dataset, which demonstrated the advantages of the proposed approach.Table 1 presents a literature-based comparative analysis to validate the superior performance of the proposed model.According to Table 1, recent studies have not extensively explored news classifcation/categorization based on sentiment analysis.While many studies have utilized benchmarked datasets to discover news categories, International Journal of Intelligent Systems these datasets may not be applicable in real-world scenarios due to the presence of unstructured data.Furthermore, while much attention has been given to classifying news as authentic or fake, very few studies have focused on news classifcation and categorization using sentiment analysis.Te proposed function, as demonstrated in Table 1, outperforms most of the papers discussed in this section.In summary, the literature review highlights these key fndings.
Tere is no single sentiment analysis method that performs better than others across diferent datasets.Additionally, there are only a few studies that explicitly apply established fusion techniques to combine classifer results for sentiment classifcation.Terefore, more extensive experimental work is needed to apply evidential reasoning approaches to the combination of classifers for sentiment classifcation [9,10].
To determine sentiment classifcation, it is necessary to address the uncertainty that arises due to the ambiguity between positive, negative, and neutral categories.Moreover, there is a current lack of efective methods to deal with this uncertainty, particularly in the context of sentiment classifcation.

Research Methodology
A search function was developed to retrieve BBC news information based on specifc keywords (Figure 2), aiming to address the problem of polysemy and enhance search function accuracy.Te web mining method was used to collect data by extracting content from web documents.A web crawler was employed to crawl the BBC news website and retrieve the required data [13,31].Te collected data were subjected to NLTK preprocessing and inverted index phases before being recorded in the database.To improve search accuracy, a search function based on a search engine was developed to crawl the BBC news website for specifc keywords.When these keywords matched the database features, corresponding results were produced.Otherwise, the most similar results were provided.Te NLTK was chosen to implement the search function by tokenizing sentences into words.Te inverted index helped in collecting relevant information, while Word2vector computed news similarity.Te BM25, a probabilistic retrieval model developed by Stephen E. Robertson in the 1970s and 1980s, was used in the optimization scheme [32].Te BM25 method is utilized to tokenize the user's keywords into distinct words and then apply a ranking function to arrange matching information based on their signifcance.Tis method utilizes a probabilistic retrieval approach to match patterns with their corresponding indexed information [32].
To determine the similarity of each document, a score is generally calculated.Te primary purpose of BM25 is to rank web documents based on specifc queries, and it can also be seen as a measure of relevant information in some cases [33].In the preprocessing phase, BM25 was implemented as a ranking method for the search engine to estimate a document's relevance for a given query.Te BM25 has been found to be more adaptable compared to the traditional term frequency-inverse document frequency (TF-IDF) method [23,24].Tis adaptability of BM25 makes it more fexible.Te mathematical expression for BM25 is presented as follows:

BM25 � Number of documents containing k i
Total number of all documents . (1) Te feature's weight w i,j is computed as follows: For k i,j > 0 and f i,j > k i,j where the weight of documents containing keywords k i is represented by w i,j , the total number of all documents is represented by f i,j .Here, i is the number of characters and j is the index of a character.Moreover, the normalization of term frequency by document length is not accurately lower-bounded with the BM25 technique.Te current limitation of this approach is that long documents that do not match the user query may receive an unfair score and be deemed similar to shorter documents that do not include the query.

Te Classifcation Complexity of the BM25.
Te classifcation complexity of the BM25 technique can be demonstrated by analyzing its time complexity in terms of the number of documents and the length of the documents.Let N be the number of documents, n be the average length of the documents, and m be the number of keywords.Te BM25 technique involves the following steps:  [25] Inverted index COVID-19 64% 70% Weighting Dilrukshi and De Zoysa [26] Naive Bayes Twitter 90% 62% Ambiguous Bun and Ishizuka [27] TF-PDF News archive --Stop words Kapusta and Obonya [28] Decision tree Slovak text 68% 75% Noise Zhu et al. [29] Sentiment analysis Chinese text 90% 68% Feature engineering Yousef and Voskergian [30] TextNetTopics Textual data 80% 71% Misclassifcation Tis work Search function-based sentiment analysis BBC news 94% 97% Embedded category Bi [11] Triplet belief functions Review --Not self-adaptable 6 International Journal of Intelligent Systems (i) Preprocessing: the documents are preprocessed to tokenize and stem the words, remove stop words, and calculate the term frequency (ii) Query Processing: the query is also preprocessed, and the query terms are ranked based on their relevance to the query (iii) Scoring: the documents are scored based on their relevance to the query, using the following formula: where D is a document, Q is a query, k i is a keyword, f i,j is the frequency of k i in document j, k i,j is a parameter that determines the saturation point of the score function for k i in document j, b is a parameter that determines the importance of the document length, and avgdl is the average document length.
(i) Ranking: the documents are ranked based on their scores, and the top-k documents are selected as the search results Te time complexity of each step can be analyzed as follows: (i) Preprocessing: the preprocessing step has a time complexity of O.Nn, as each document needs to be processed individually (ii) Query Processing: the query processing step has a time complexity of O.m log N, as the query terms need to be ranked based on their relevance to the query (iii) Scoring: the scoring step has a time complexity of O.Nm, as each document needs to be scored individually for each keyword (iv) Ranking: the ranking step has a time complexity of O.N log N, as the documents need to be sorted based on their scores Terefore, the overall time complexity of the BM25 technique can be expressed as follows: In practice, the value of m can be much larger than the value of n or N, which means that the time complexity of the scoring step dominates the overall time complexity of the algorithm.As a result, the BM25 technique can be computationally expensive for large datasets and queries with many keywords.

Te Experimental Data.
Te study selected BBC data for crawling because it is the world's largest news broadcaster [16], which lends credibility and authority to the news examined in this study [16,34].Te BBC news website encompasses a wide range of topics that are appropriate for the proposed search function.Te study randomly selected a sample of 800 articles from the website to ensure the reliability of the search results.Te data were stored in a database table called news (refer to Figure 3).Te study retrieved several attributes from multiple BBC articles, including the uniform resource locator (URL), content, date, title, and label, which displays the categorized information of BBC news (Figure 3).Afterward, sentiment analysis computation was performed on the database table's content attribute to determine the polarity of textual data.Tis method proved to be useful in predicting the sentimentality strength of BBC news.
Figure 4 displays the stop word type that was detected and ignored during the feature engineering process.Te frequency of stop word occurrence in the BBC dataset is shown in Figure 5. Removing stop words during feature engineering reduced both the size of the original dataset and the time required for sentiment analysis prediction [35,36].Te elimination of stop words enhanced the prediction accuracy by retaining only relevant tokens or patterns in the BBC dataset.Tis could improve the values of evaluation metrics.Te dataset contains three categorical variables or columns (COVID-19, vaccine, and travel).Te extracted features' descriptions (desc) using the proposed search function are presented in Figure 6.   2.

Crawling and Data Cleaning.
A script or program that automatically searches for information from the web based on certain rules or conditions is called a web crawler.Web crawling is advantageous for downloading web pages and achieving multithreading.Tis study utilized a web crawler, and specifcally, the Python-based Google App Engine (PGAE) introduced by Dominic [37].Te confguration container for the web crawler included a core fle that specifed the interval time between two fetches, the database path directory, and the starting and ending time of crawling.Te crawling frequency from the website is controlled by the interval time between two fetches.If the frequency is set too high, it may lead to the IP address being blocked.In the data engineering process, the structure of the documents was determined using HTML div and span tags to make them identifable.
Other HTML components such as the paragraph (p) and emphasis (em) tags accurately represented the semantics of the web content.However, the use of div and span tags made the web content more accessible.Te crawled features were cleaned by removing any div tags from the web pages, and the cleaned data were stored in a JSON fle before being transferred to a database.

Feature Weighting with NLTK and Inverted Indexing.
Te manual term weighting method was utilized, as Salton et al. [38,39] demonstrated that manually assigning term weights is as efective as automatic weighting methods.To determine the weight of a term based on its relative frequency, the proposed approach applied Zipf's law [40].Te term weighting was computed using the following formula: where N represents the total number of documents and n is the total number of documents containing a particular term or keyword.Te NLTK library in Python was used in this study with two methods, namely, NLTK tokenize and NLTK stem.
Te NLTK tokenize was used to segment sentences into words, using spaces to split sentences into diferent words.Similarly, the NLTK stem was used to normalize words by providing them with an acceptable format, such as changing the past tense to the present tense.Te proposed search function tokenized the search information that users entered into the search engine using the NLTK tool, and the search was performed based on the tokenized words.Irrelevant words, known as stop words, were removed during the data-cleaning phase.Stop words are words that are considered to be of little signifcance and are thus ignored during text analysis.Te most commonly found stop words include six determiners (a, that, the, an, and, those) which are used to describe nouns and express concepts related to localization or numbers in the text.However, removing stop words can improve computation efciency and retrieval performance by reducing the number of indexes in the corpus.In this study, stop words were removed to improve the accuracy of the search function.
For instance, if a user searches for apples, a search engine might return 100 results.But if the user searches for bananas and apples, the search function may tokenize the sentence into three parts [(banana) (and) (apples)], resulting in more than 100 results.Te Python library used in this study contains pretrained models and corpora that were utilized in implementing the search function.Inverted indexing was necessary for retrieving information from the database through indexes.Each entry in the news table includes a specifc index that determines the location of textual data.An inverted index fle was created beforehand, which allowed for efcient searching of information by the user.
Te BBC articles were tokenized using NLTK, cleaned before being linked to a specifc index, and stored in the database for sentiment analysis.Te index facilitates rapid search results retrieval when a user searches for certain words.Te table shown in Figure 8 contains the indexes, where the keywords are recorded in the term column, the number of indexes matching the keywords in the df column, and the weighted index matches in the docs column.For example, a user searching for BBC will receive only 697 indexes and articles as shown in Figure 8. Tis inverted indexing technique improves concurrency and assists in automatically generating attribute values to determine the record location.

Search Function with Sentiment Analysis.
Te BBC news website was crawled using a web crawler to extract patterns such as date, title, and content, and the extracted news features were then stored in a local database as shown in Figure 9.In the data preprocessing phase, the news data were preprocessed and relevant keywords were matched using the NLP module of the framework, which includes the NLTK and inverted index.Te NLTK module was utilized to tokenize the BBC news into words, remove punctuation and full stops using regular expressions, convert all words to lowercase, and remove irrelevant words with no meaning.Additionally, stemming was used to normalize words to their primitive form, which reduces incomplete search results.Te inverted index was then used to efciently and accurately search for keywords and point to relevant BBC articles.
Te data wrangling phase of the framework involves a search function that is activated when keywords are entered, and the BM25 algorithm is employed to extract matching indexes from the JSON fle.Each search result is assigned a score by the BM25 algorithm, which is used to measure the similarity between the keywords and the web articles.Te search results are then ranked based on their scores.In the search function module, the user's keywords are compared with the prestored words in the index table.If there is a match, the corresponding index is retrieved and the matching news is located.If the entered keywords do not match any of the indexed words, the most similar index and words are located using regular expression [41].Te interface module acts as the front end of the framework, and HTML is used to search for content pages.When users enter their keywords, the search results are displayed on the content page.
Figure 9 illustrates the process of the search function that utilizes sentiment analysis.In the data collection phase, a web crawler is utilized to retrieve news from the BBC website, which is then stored in a JSON fle.Tis fle is subsequently transferred to the database.Te preprocessing module is automated to enable the storage of new keywords in the database if they are similar to the embedded categories of the search function.Te news features from the website are tokenized into words using the NLTK, which removes punctuation with a regular expression and converts words to lowercase.Sentence tokenization is applied to remove stop words.Te process of the search function using sentiment analysis is illustrated in Figure 9. Te frst step involves collecting data from the BBC website through web crawling and storing it in a JSON fle, which is then transferred to the database.Te preprocessing module is automated, such that when users enter new keywords, the system stores them in the database if they match the existing categories of the search function.Te news features are tokenized into words using the NLTK and cleaned by removing punctuation and converting them to lowercase.
Stop words are removed using sentence tokenization.Te text is then normalized using stemming, and cleaned words are transformed into indexes that correspond to the matched BBC article.Te generated indexes are stored in a posting table in the database.In the search process, users input keywords into the web interface, which are passed to  International Journal of Intelligent Systems the database to search the index table for matching keywords.Te system then matches the keywords found in the database with specifc web articles to provide search results, which are sorted based on relevance.Finally, the sorted search results are displayed on the web page and presented to users.Overall, the search function allows users to enter a set of keywords into the web interface, which are stored in the database and evaluated for polarity using sentiment analysis.

Sentiment Lexicons.
A sentiment lexicon is a collection of words or phrases that are associated with specifc emotions or sentiments and is commonly used in sentiment analysis algorithms.Te lexicon enables the algorithm to compare input words with previously labeled words in the lexicon to predict the sentiment or polarity of sentences.One such algorithm used in the proposed framework is VADER (Valence Aware Dictionary and sEntiment Reasoner) [42], SentiWordNet [43], SentiStrength [44], Li et al. lexicon [45], and Afective Norms for English Words (AFINN-111) [46].Te sentiment analysis models used in the study had their own set of lexicons.Te framework applied each model to the BBC data and utilized their respective lexicons to match the keywords in the news articles to their corresponding polarity labels, in order to predict the overall sentiment of the BBC sentences.

Sentiment Analysis Algorithms.
Tis section presents the experimental sentiment analysis algorithms and the evaluation metrics used to assess the sentiment analysis computation.Te search methodology comprises of two types of searches: the proposed optimized search function and the simple normal search, both of which produce a dataset.To address the class imbalance in the datasets, the study used feature engineering.Te sentiment analysis on these datasets was then evaluated.Te dataset was divided into training and testing samples, with 70% of the entire set assigned to data-70 and the remaining 30% to data-30, which served as a validation set.Tis approach ensured that the sentiment analysis model was not trained using the testing sets and that the validation results were independent of any discrepancies and biases.Te study compared the proposed search function to the normal search procedure based on the data collected from the web.Te evaluation metrics used to assess the overall experiments were precision, accuracy, recall, and F1 score.Figure 10 depicts the diferences between the two searches.
Te primary diference between the two search methods is the inclusion of embedded categorization.Te same set of keywords, such as Covid, Vaccine, and Travel, are utilized for both search methods to ensure result reliability.Te search process is divided into three clusters: Covid, Vaccine, and Travel.Te Covid category contains coronavirus news stored using the Covid keywords (as shown in Figure 11).Te Vaccine category includes Covid vaccine news stored using the vaccine keywords (as shown in Figure 11).
Lastly, the proposed search function stores Covid-related travel news using the Travel keyword in the Travel category (as shown in Figure 12).Te computing environment used in this study is presented in Table 3. Te evaluation of the two search functions is based on the data collected from the web, and metrics such as precision, accuracy, recall, and F1 score are used to assess the overall experiments.

VADER.
VADER is a sentiment analysis program that uses a rule-based lexicon designed specifcally for social media analysis.Its developers, Hutto and Gilbert [47], created the algorithm to address the challenges in analyzing symbols, languages, and text styles that are commonly found in social media.VADER is capable of detecting positive, negative, and neutral polarities of textual data.Te VADER lexicon is built by selecting textual patterns from the General Inquirer (GI) and Linguistic Inquiry and Word Count (LIWC) preset lexicons and includes slang, social media abbreviations, emojis, and facial expressions such as :-) to represent positive, negative, or neutral sentiments.Te lexicon consists of 7,500 textual features.To compute the polarity of a sentence using VADER, a compound score is calculated by summing equivalent values for each word in the lexicon and adjusting them according to specifc rules.Te polarity score is then restricted to be between the most negative (-1) and most positive (+1) values.Te assigned polarity values for each row in the analyzed text are positive (pos), negative (neg), and neutral (neu), as shown in Figure 13.VADER uses a set of thresholds to assign positive, negative, or neutral polarity, as shown in Figure 14.

SentiWordNet.
SentiWordNet is an opinion-mining tool widely used in sentiment analysis that employs a lexicon derived from WordNet.Tis lexicon is divided into synonyms (synsets), nouns, verbs, adjectives, and other grammatical categories.Te SentiWordNet algorithm utilizes the WordNet synset dictionary to combine polarity scores and determine the sentiment of the text as either negative, objective (neutral), or positive.
Te algorithm generates three scores, each ranging from zero to one, using supervised machine learning methods [43].Te PosScore and NegScore represent the level of positivity and negativity associated with the text.Te process depicted in Figure 15 has been used to assign sentimentality using Sen-tiWordNet version 3.0, as explained in Figure 15.To determine whether a given text feature has a positive, negative, or neutral sentiment, the average scores of its associated synsets were used.If the positivity average score exceeds the negativity average score, then the sentiment is considered positive.
Te lexicon used contained 64,000 features, and a Python code was written utilizing the NLTK package.Te calculation of sentimentality for expressions in the lexicon is shown in Figure 16.

Sentistrength.
Sentistrength is an open-source sentiment analysis tool that can detect the emotions expressed in text.It has been used in the Cyber Emotions project to analyze over 14,000 social media posts with a level of accuracy comparable to that of a human.Te Sentistrength lexicon is based on terms and expressions derived from the Linguistic Inquiry and Word Count (LIWC) [44,48,49].Each textual pattern is assigned 12 International Journal of Intelligent Systems scores for negative, positive, and neutral polarities.Sentistrength predicts the negative and positive polarities in short texts using specifc rules, which are as follows:      International Journal of Intelligent Systems Te Sentistrength algorithm requires placing textual features in a plain text fle (one text per line) to predict text polarities.Te resulting output is a copy of the fle containing negative and positive predictions at the end of each line.As shown in Figure 17, the negative and positive values of each text can be predicted using Sentistrength.Te fnal and averaged polarity of the fle can be obtained with different point scales ranging from −5 (negativity) to +5 (positivity), where 0 represents neutral polarity.Previous experiments have demonstrated the robust performance of Sentistrength in web mining [44,49].Tis tool is available in both Java and Windows formats, with this study utilizing the publicly accessible Windows version, which can be found at http://sentistrength.wlv.ac.uk/.Te Sentistrength lexicon is presented in Figure 18.

Liu and Hu
Lexicon.Te Liu and Hu lexicon used in this study consists of two word lists: one for positive predictions and the other for negative predictions [50].It was developed by researchers from the Computer Science Department at the University of Illinois [50].Te Python code used to access the negative and positive lexicon words is demonstrated in Figure 19.To predict the sentiment of texts, a Python code was created to compare the BBC texts with the labeled lexicon words using the NLTK.
3.12.AFINN-111.AFINN-111 consists of a collection of English words that have been assigned a sentiment score ranging from −5 (negative) to +5 (positive) [51].Tis lexicon was developed manually by Finn Arup Nielsen over the course of 11 years from 2000 to 2011 [51].Te lexicon includes two lists, AFINN-111 containing 2,477 phrases and words, and AFINN-96 with 1,468 unique phrases and words consisting of 1,480 lines.Te list includes two types of columns: the word itself and its corresponding polarity value, which is depicted in the range of [−5, +5]. Figure 20 displays the AFINN-111 list.
Based on the analysis conducted, SentiWordNet was found to have the highest occurrence of positive and neutral textual patterns (Figure 21), whereas Sentistrength exhibited the highest frequency of negative patterns (Figure 21).It is important to note that this fnding suggests that the size of sentiment analysis lexicons can signifcantly impact the classifcation or prediction results obtained, and researchers should carefully consider the size and characteristics of each lexicon before making a selection.
Additionally, it highlights the importance of selecting appropriate lexicons for sentiment analysis to ensure the accuracy and validity of the results.It is important to select appropriate lexicons for sentiment analysis because the accuracy and efectiveness of sentiment analysis largely depend on the quality and suitability of the lexicon used.A lexicon is a dictionary or database that contains words and phrases along with their assigned sentiment polarity (positive, negative, or neutral) [11].Using an inappropriate lexicon can lead to inaccurate sentiment analysis results, which can have serious consequences in various domains such as business, politics, and healthcare.For example, misinterpreting customer sentiments in business can result in a loss of revenue and reputation, while misinterpreting patient sentiments in healthcare can lead to incorrect diagnoses and treatments.Terefore, selecting the right lexicon that is tailored to the specifc domain and language is crucial for obtaining accurate sentiment analysis results.It is also important to continuously update and improve the lexicon as language and context evolve over time.International Journal of Intelligent Systems

Evaluation Metrics.
In the experiments, precision (P), accuracy (A), F1 score (F), and recall (R) [52,53] are used as evaluation metrics.Precision represents the number of true predicted classes that belong to the accurate class.Recall, on the other hand, determines the percentage of correctly classifed categories [52,54].It is the number of observations that the sentiment analysis model correctly predicted divided by the total number of observations [55,56].
Recall measures the number of instances that the sentiment analysis model predicted correctly and is in a particular class, divided by the total number of instances that actually belong to that class [55,57].True positive (TP) refers to instances where the model accurately predicts the positive class, while true negative (TN) refers to instances where the model accurately predicts the negative class [58].False positive (FP) refers to instances where the model incorrectly predicts the positive class, and false negative (FN) refers to instances where the model incorrectly predicts the negative class [59,60].Accuracy measures the overall classifcation/ prediction success of the sentiment analysis and is calculated using the following equation: Te F1 score is a metric that combines both precision and recall values of the sentiment analysis model.Tis score is calculated as the harmonic mean of the model's precision and recall, as shown in the following equation:

Results
Te results of sentiment analysis on the BBC data are presented in this section.Te experiment employed sentiment analysis lexicons without any modifcations to analyze the data.Additionally, a comparative analysis was conducted to determine the best-performing sentiment analysis algorithm.Table 4 displays the evaluation metrics obtained from both BBC samples.Te sentiment analysis performance is summarized in Figure 22.Te lexicons were used without any modifcations to analyze the data obtained using the proposed search function.Te results showed that the VADER model had the highest accuracy of 85%, whereas the precision performance of the Sentistrength, AFINN-111, and Liu and Hu's models was similar (as shown in Table 4).Te F1 score of the SentiWordNet model was 65% using the data obtained through the proposed search (Table 4).Te AFINN-111 model performed well in the positive sample rating, achieving an accuracy of 78% with the proposed search data (Table 4).However, the Sentistrength model did not perform well in classifying positivity/neutrality with the proposed search data, with an accuracy range of 10%-15% (as shown in Table 4).When analyzing data obtained through a normal search, the precision of the AFINN-111 and VADER models was higher and closer than that of the other three models.
Te precision for negative and positive classifcation using normal search data was best for VADER and Senti-WordNet models, with 69% and 65%, respectively (Table 4).However, AFINN-111 showed the highest precision of 60% for neutral classifcation using both types of data (Table 4).
Te performance of Liu and Hu's model was low, as evident from Table 4.
Furthermore, it is worth noting that the performance of the Liu and Hu model's lexicon improved when using data extracted with the proposed search, as compared to a normal search.Te utilized sentiment analysis lexicons and models showed promising results, even without preprocessing the data extracted with a normal search.Te comparative analysis suggests that the VADER lexicon is a strong candidate for accurate classifcation of positive and negative news.Te experiment also demonstrates the AFINN-111 model's profciency in rating more positive samples when using the proposed search data.
Furthermore, the Sentistrength model achieved the highest accuracy (75%) in classifying negative samples, as shown in Table 4.However, Liu and Hu's model performed poorly compared to other techniques.
Te performance of the VADER model varied with the two diferent BBC samples, and the model's performance decreased when using data extracted with a normal search.Figure 23 illustrates that the VADER model had a misclassifcation rate of 15.8% for features extracted by the proposed search and a misclassifcation rate of 20.3% for patterns retrieved by a normal search.
Tese results demonstrate the efectiveness of the proposed search function in enhancing the structure and quality of BBC data compared to a normal search.Te NLP techniques implemented in the data preprocessing stage led to a decrease in the misclassifcation rate of sentiment analysis.
Table 5 displays the quantity of textual characteristics obtained by two types of searches, covering the period from 2020 to 2023.Te table includes the search terms used for data collection, along with the number of URLs retrieved and used in the process.Te content column shows the number of pertinent words (text) employed by the sentiment analysis model for news categorization.Tis table is a complement to Table 4, providing further information.As an example, news categories such as Covid, Vaccine, and Travel, as well as website titles, were considered.Te optimal confusion matrix for sentiment analysis is depicted in Figure 24, illustrating the number of textual characteristics that have been accurately predicted/classifed by the sentiment analysis as belonging to the previously mentioned news categories.Specifcally, the confusion matrix for the best sentiment analysis technique, such as SentiWordNet, reveals that out of 16,744 textual features, the majority of them were correctly predicted to belong to the Vaccine category.
According to Tables 4 and 5, this forecast has a negative polarity of 68%.Notably, SentiWordNet outperformed other models by predicting 16,744 words with a negative polarity of 68% accuracy or precision concerning the COVID-19 vaccine.Among the sentiment analysis techniques, AFINN-111 required the longest execution time (50 seconds), as shown in Table 4.
Terefore, the efciency of sentiment analysis models in terms of time complexity and misclassifcation rates can serve as a crucial performance indicator for news categorization and classifcation.In general, the evaluation metrics 16 International Journal of Intelligent Systems exhibit a minor improvement when utilizing data obtained through the proposed search function, as illustrated in Figure 25.
Te outcomes indicate that the proposed search function retrieves a greater number of URLs, content, and indexes than a standard search, as presented in Table 5. Tis fnding highlights the signifcance of feature engineering in news classifcation and categorization.By utilizing NLP for feature engineering, the accuracy and precision of sentiment analysis classifcation can be enhanced.Crawling with a standard search required more time compared to the proposed search, mainly due to the absence of preprocessed patterns in a standard search.Te primary reason for the superior performance of the proposed search function compared to a standard search is the categorization of features recorded by the proposed search function.As a result, sentiment analysis models were able to achieve a higher level of accuracy and precision.
Among the fve models, VADER implementation, SentiWordNet, and Sentistrength techniques achieved the most promising outcomes in terms of the number of URLs and textual content classifed.VADER employed 8,000 textual patterns, followed by the SentiWordNet and Sentistrength models, as indicated in Table 5. Adjectives were found to be the most relevant linguistic pattern for sentiment analysis.Nevertheless, combining adverbs and adjectives can also enhance the sentiment analysis classifcation.An essential issue, however, is that adjectives  International Journal of Intelligent Systems and adverbs should not carry equal weight in predicting the sentiment of a given text.Rather, the scores of adjectives and adverbs should be combined using appropriate feature weighting techniques to achieve optimal results.Te evaluation of fve sentiment analysis models using available lexicons revealed that Sentistrength and Senti-WordNet obtained comparable accuracy and precision for both types of search, while Liu and Hu's implementation Adverbs were identifed as the most critical linguistic feature for sentiment extraction.Te use of NLP lexicons facilitated sentiment analysis computation, highlighting the importance of using a smart search function that can retrieve nonpolysemic search results and provide accurate automated sentiment analysis.

Discussion.
To summarize, the study found that VADER was the most efcient and accurate sentiment analysis model for news classifcation (Figure 26).AFINN-111 performed well in predicting positive polarity with a high accuracy of 78% when using structured data retrieved with the proposed search function (Figure 26).Liu and Hu's model did not perform well due to the limitations of its lexicons.Overall, the VADER, SentiWordNet, Sentistrength, and AFINN-111 lexicons showed promise for accurately classifying news based on their positive and negative sentiments.Te bestperforming model in classifying more positivity for the Covid category was the AFINN-111 with 78% of accuracy  International Journal of Intelligent Systems (Figure 26).Tis result confrms the efectiveness of the proposed search function in improving the structure and quality of textual data compared to a normal search.Te experiment shows the confusion matrix of the best sentiment analysis technique such as SentiWordNet which considered more textual features that were correctly predicted to belong to the Vaccine category.Te proposed search function was efective in improving the structure and quality of textual data compared to a normal search.Te SentiWordNet technique had the best confusion matrix and was able to correctly predict more textual features to belong to the Vaccine category.Te study suggests that the time complexity, lexicon structure, and misclassifcation rates of sentiment analysis models are important indicators for gauging the efciency of a sentiment analysis framework.Table 6 presents a comparative analysis with existing studies to demonstrate the superior performance of the proposed methodology.Among the fve models, VADER, SentiWordNet, and Sentistrength obtained the best results, outperforming some existing techniques.However, the study's focus was on using sentiment analysis as a subset of NLP to enhance the search engine's performance and provide users with a better perspective on the crawled data.Te study used sentiment  analysis in combination with other NLP techniques such as text segmentation and POS (Part-of-Speech) to optimize the search engine's smart search function.Te aim was to develop a search engine with embedded sentiment analysis that could retrieve nonpolysemic results and classify them using sentiment analysis models [43,47,[63][64][65][66].A search engine with embedded sentiment analysis can automatically classify emotions from search results into negative, positive, or neutral categories.However, accurately determining the valence or emotions of text can be difcult due to the subjectivity of language, leading to false positive and negative rates.Tis means that even with updated lexicons, sentiment analysis models may still have issues predicting the polarity of statements.Sentiment analysis has various practical applications in search engine optimization.For instance, it can be used for real-time assessment of brand reputation, improving customer support interactions, identifying customer needs and anomalies, and monitoring customer behavior using web intelligence.Nevertheless, the absence of lexicons in languages such as French, Spanish, Chinese, and Swahili will impede the deployment of smart search engines and the contextualization of sentiment analysis outcomes.Tis will lead to inaccuracies in sentiment computation due to false positives and negatives.Moreover, building a robust and smart sentiment analysis framework for search engines requires tackling challenges such as polysemy, irony, sarcasm, and multipolarity.

Conclusion
Tis manuscript presents a topic that involves several aspects of NLP and sentiment analysis.Te author has employed a web crawler to collect BBC news across the Internet and carried out preprocessing of text by using NLP and sentiment analysis methods to determine the polarity of the processed text data.
Te BBC website was utilized to crawl news data using a proposed smart search function.Te experimental results portray this search function collecting more than 2,000 textual patterns for sentiment analysis.Te proposed function outperformed a normal search in terms of feature quality.However, this study can be extended by embedding more categories in the search function to collect a large amount of data from the entire Internet.In fact, the development of effective methods and functions for each aspect aforementioned in this study is always challenging.Even if experiments were performed with a single database, the author still believe that some of the results presented in this article will be used as a reference in the feld of search engine optimization.Diferent users will have diferent needs, and traditional search methods cannot satisfy all needs due to polysemous problems, which lead to poor user experiences.To address this issue, the research implemented a search function enabling users to extract relevant BBC information that meets their needs.In addition, the web has a rich and diverse set of news, including fake ones.Unfortunately, the proposed search function cannot remove fake news from the search results.Another limitation is linked to the number of categories embedded in the proposed search, which do not include all groups that are relevant to users.However, the proposed list of categories such as Covid, Vaccine, and Travel can still be extended to improve the search efciency and reliability.Furthermore, the proposed function can be improved by adding an advanced classifcation scheme to the search with intelligent tagging.Te tagging mechanism should be automated with machine learning where the classifer can be trained to tag searched contents.Moreover, the optimized search function using machine learning can also be used to study the strength of an entity's web presence.In future works, one can automatically create search categories using artifcial intelligence and predict users' behavior with machine learning.An online corporate reputation analysis can also be implemented with the proposed methodology.Additional lexicons such as Google Play Opinion Mining Score (GPOMS) and Opinion Finder can also be added to the sentiment analysis models to improve the sentiment analysis performance.Lastly, ensemble learning can also be utilized to improve the machine learning accuracy on the BBC dataset.

Figure 4 :Figure 5 :
Figure 4: Te types of words ignored in the feature engineering process.

Figure 6 :
Figure 6: Data extracted with the proposed search function.

Figure 7 :
Figure 7: Data extracted with a normal search function.

Figure 9 :
Figure 9: Search function and sentiment analysis.

Figure 8 :
Figure 8: Te index table with weight.

Figure 11 :
Figure 11: Te search results and data collected.

Figure 25 :
Figure 25: Summary of the sentiment analysis performance.

Figure 26 :
Figure 26: Time execution and precision of sentiment analysis.

Table 4 :
Evaluation metrics in the BBC dataset.

Table 5 :
Prediction category in the BBC dataset.
Figure 24: Te confusion matrix of the SentiWordNet model.