Assessing the Influence Level of Food Safety Public Opinion with Unbalanced Samples Using Ensemble Machine Learning

Assessing the public opinion on food safety events constitutes an important job of government regulators. To optimize the government’s management of food safety affairs, a promising way is to use artificial intelligence to improve the efficiency of food safety public opinion assessment. In this paper, we model the assessment of public opinion influence as a text classification task. )e whole model adopts the ensemble learning framework, and it integrates naive Bayes, support vector machine, extreme gradient boosting, convolutional neural network, longand short-term memory network, FastText, and BERT classification methods into the framework to form an ensemble learner.)e ensemble learner is able to classify textual public opinion into high, medium, and low influence levels by learning from the samples assessed by human experts. To overcome the problem of unbalanced samples, we propose a sample generation method consisting of synonym replacement and semantic filtering to increase the number of high-influence samples. Real public opinion data collected from the Food Safety Department of the Chinese government are used for experiment. Extensive comparison of the proposed method with baseline methods proves the effectiveness of the ensemble learner and the sample generation steps.


Introduction
Nowadays, people are used to expressing opinions on the Internet, which leads to an explosive growth in the amount of online public opinions. Because food safety is closely related to everyone's daily life, public opinions with this topic are very likely to develop into hot events in the society. For example, La Tourangelle is a walnut oil brand welcomed by the most discerning customers in China. In 2019, the news of this brand of oil containing plasticizer exceeding the standard triggered vast public opinion on the Web, as this oil was mainly used for feeding babies. It has been shown that the interaction of government agencies with public opinions through social media can help the government to respond to public events efficiently [1]. e government can use public opinion assessment to explore people's attitudes towards an event [2][3][4] and predict events that may lead to serious consequences [5]. erefore, it is meaningful to assess the influence of food safety public opinion in the early stage of its formation. e importance of food safety public opinion has been pointed out in various regulatory documents issued by the government [6,7]. However, unlike many other management optimization fields which have been intensively studied [8,9], currently there is not much research dedicated to food safety public opinion assessment. Instead of analyzing research on food safety public opinion assessment, we survey the literature of general public opinion assessment. Moreover, since we formalize the public opinion assessment problem as a text classification problem, the literature of text classification is also analyzed to show the character of our research.

Public Opinion Assessment.
For assessing the influence of general public opinion, researchers often construct an index system to carry out public opinion evaluation. For example, considering the influence of microblog messages and the dynamic role of the target audience of online public opinion, a microblog public opinion indicator system is established based on the Information Source Index (ISI), Geographic Index (GI), Subject Index (SI), and Industry Index (II) [10]. Simple analysis methods such as principal component analysis and analytic hierarchy process [11] are also often used in the construction of public opinion index systems. However, the manually selected indexes in this kind of studies cannot fully measure the characteristics of public opinion influence. At the same time, the index selection has a strong dependence on the opinions of experts and thus has a strong subjectivity. Some scholars believe that user behaviors such as forwarding and commenting can be used as the basis for evaluating the future development of public opinion. Li and Li use cloud models and analytic hierarchy processes to analyze user behaviors in public opinion dissemination, and they use this method to accurately predict hot public opinions [12]. Considering the impact indicators that affect the amount of user forwarding, Zheng et al. build a prediction model of network public opinion forwarding behavior using BP neural networks [13]. Due to the randomness and ambiguity of user forwarding behavior, Liu et al. used cloud theory to optimize the activation function of RBF neural networks [14]. When the information publisher has some professional authority or high popularity, users may ignore the actual content of the information when forwarding it [15]. erefore, only relying on the statistics of user behavior to assess the influence of public opinion will produce a certain deviation. Seeing the problems with using user behaviors and artificial indexes, scholars begin to resort to the textual content of public opinion to assess its influence.

Text Classification for Public Opinion Assessment.
Text classification plays an important role in public opinion assessment. Some scholars use text classification to identify the sentiment of public opinion, as sentiment affects the behavior of people interacting with public opinion [16,17]. Other scholars use text classification to directly classify public opinion into different influence categories [18]. Text classification is an intensively studied field in recent years. When text underwent feature extraction and turned into numerical features, various machine learning methods can be used to classify text. Al-Tabbakh et al. use support vector machine, k-nearest neighbors, naive Bayes, and decision trees to classify the same text collection, whose results show that k-nearest neighbors perform the best in the experiment [19]. In contrast, deep learning algorithms do not require text feature extraction [20]. Deep learning models complete text classification by autonomously acquiring the relationship between text and label [21]. Another way to enhance text classification performance is to use multiple classifiers, which is called ensemble learning. Ensemble learning can effectively improve the accuracy and generalization ability of machine learning by accommodating more model assumptions. Cotelo et al. use a stacking framework of ensemble learning to integrate the content feature and structure feature of text to do classification [22]. Song et al.
propose an ensemble learner to assess the impact of food safety news, which improves the accuracy of impact prediction [18]. However, their work does not take into consideration the unbalanced sample distribution across different impact levels, so the result is not satisfactory for high-impact news.

Dealing with Unbalanced Samples in Text Classification.
Using machine learning for public opinion classification must pay attention to the distribution of data samples, as high-influence samples take only a very small part of the whole. Studies have shown that when the imbalance of the data set reaches 4 : 1 or higher, the predictive ability of the model will be lost [23]. Methods of processing unbalanced samples can be divided into oversampling and undersampling. Oversampling tries to enrich the minority type of samples by generating more samples of this type, and undersampling uses a subset of the majority type to make the number of each type equal. e classic oversampling method SMOTE maps the original samples to a certain vector space and then uses the samples in the space that are close to each other to construct new samples [24]. SMOTE-IPF [25] optimizes the classic method considering noisy and borderline examples. ADASYN [26] considers the distribution of minority data and generates new samples corresponding to the actual distribution of minority samples. In addition to oversampling and undersampling, classification algorithms themselves can be adapted to fit unbalanced samples. Datta and Das propose an approximate Bayesian support vector machine based on boundary transition and asymmetric cost to minimize the classification error [27]. Ando proposes a nearest neighbor model based on class weighting to compensate for the sparsity of minority classes by adjusting the k radius [28]. Cheng et al. introduce cost-sensitive marginal mean, variance, and penalty to adjust the proportional distribution between different categories, so as to obtain a balanced detection rate [29]. Despite the usefulness of adapting algorithms to unbalanced data, the adapted algorithms are often only applicable to a specific model. Under the framework of ensemble learning, adapting different base models one by one to unbalanced samples will increase complexity and cost of the framework.

Ensemble Learning Framework
In this paper, we propose a food safety public opinion assessment model that considers unbalanced sample distribution using the ensemble learning framework. e structure of the model is shown in Figure 1. In order to make full use of the data samples labeled by domain experts, we retain all the available data and adopt a "replacement-filtering" oversampling method to replenish the minority samples. e first step of oversampling is to build a synonym dictionary based on the vectorized word representation acquired through word embedding operation. en, we replace some words in a sample with similar words to generate more samples of the minority type. At the last step of oversampling, we train a Siamese LSTM network to filter out the newly constructed samples that are too dissimilar to real samples. To improve the accuracy and robustness of influence level classification, we use the stacking ensemble learning framework. e framework integrates naive Bayes (NB), support vector machine (SVM), extreme gradient boosting (XGBoost), convolutional neural network (CNN), long-and short-term memory network (LSTM), FastText, and BERT as base learners. Each base learner has its corresponding text preprocessing step: for NB, SVM, and XGBoost, each public opinion sample is turned into a vector of TF-IDF weights, together with the influence level label of this sample; for CNN and LSTM, each public opinion sample is turned into a matrix whose columns correspond to the embedding of words; for FastText, it takes the original text as input; for BERT, as it limits the length of input text for efficiency consideration, we apply automatic summarization to shorten oversized public opinion samples. e stacking ensemble learning framework includes a meta-learner to synthesize the influence level rated by each of the base learners. We test k-nearest neighbors (KNN), SVM, and NB as three candidate meta-learners and select the best to use. To test the method proposed in this paper, we obtain food safety public opinion samples from the Risk  Scientific Programming piece of text showing the original content of the public opinion, and an influence label ranging from high, medium, and low influence levels.

Processing Unbalanced Samples
In this study, we propose a "replacement-filtering" oversampling method to deal with the unbalanced data. e flowchart of the proposed oversampling method is shown in Figure 2. Details of the method are introduced in the following sections.

Minority Sample Generation Based on Synonym
Replacement. In order to retain the information in the original data to the greatest extent, this paper uses the method of increasing minority samples to balance the sample set. To ensure that a newly added sample can achieve the purpose of equalizing the sample set, the new sample and the corresponding original sample should have similar characteristics. From the perspective of textual samples, the new sample and the corresponding original sample should be highly similar in terms of content and semantic meaning. Using synonym replacement to modify the original sample serves the goal of keeping text content similar. Synonym replacement is to replace each word in the original sample with a synonym of the word in the synonym dictionary and then obtain a new sample corresponding to the original sample. Although there is an existing platform that can do Chinese synonym replacement [30], the synonyms in this platform only include common terms and lack professional vocabulary such as law, medicine, and food safety. In this paper, we propose a synonym replacement method based on computed word vectors. e implementation steps are as follows: 4.1.1. Word Vector Computation. We use the Python package jieba to segment the original Chinese text. en, we input the segmented text to the Word2Vec model realized in the Python package gensim to get word vectors. is step is also called word embedding. Word2Vec is a widely used word embedding model capable of capturing word meaning through self-supervised learning.

Top N Synonym Dictionary Construction.
e cosine similarity of two words is calculated according to (1), where x � (x 1 , . . . x i , . . . , x n ) and y � (y 1 , . . . y i , . . . , y n ) represent two word vectors. (1) e closer the cosine similarity value is to 1, the higher the similarity between the two words. We calculate the cosine similarity of each word pair and choose the top N word most similar to a specific word to construct the synonym dictionary.

Generating Samples for a Minority Sample.
We traverse each word in the minority sample and replace the word with its i th similar word in the synonym dictionary, where i is a random number ranging from 1 to N. A new minority sample is generated after each word in the original minority sample has been replaced.

Sample
Filtering Based on Siamese LSTM. Using the above method, we can generate any number of new samples for a given sample. However, this process only pays attention to the similarity of words but not the similarity of semantic meaning between the new samples and the original sample.
To improve the quality of generated samples, we use Siamese neural networks to filter the new samples to ensure the semantic similarity between the new samples and the original sample.
e Siamese neural network [31] is composed of two identical neural networks with shared weights. It can be used to assess the similarity of two samples. Due to the excellent performance of the LSTM model in text understanding, this paper uses the Siamese LSTM model to complete sample filtering. e process is as follows:

Construction of a Siamese LSTM Model.
Construct two LSTM models with identical structure and shared weights. is is done by training a LSTM text classifier using the original public opinion samples and duplicating the trained classifier.

Sample Filtering.
For each newly generated sample, we input it with its corresponding original sample to the Siamese LSTM model. We can obtain the vector representation of the two samples at the LSTM layer prior to the softmax layer. en, we compute the cosine similarity between the two vectors and see if the similarity value exceeds a predefined threshold. If so, we retain the generated sample, otherwise the generated sample is discarded.

Construction of an Ensemble Learner
To construct an ensemble learner includes three basic steps. e first is to select a group of base learners that are differently structured or differently trained. e second step is to divide the data set to properly train the base learners. e last step is to select a meta-learner to synthesize the results of base learners to get the final prediction result.

Base Learner Selection.
To ensure the superiority and robustness of the final result, the selection of base learners follows the principle of accurate result and model diversity.
e chosen base learners include those listed below.
5.1.1. Naive Bayes. Naive Bayes (NB) is a classic machine learning model based on the Bayes theorem and assumption of independent sample features. If the sample features meet the requirement of such assumption, a NB learner will have superior performance. In food safety public opinion assessment, a NB learner uses TF-IDF weighted words in the public opinion as features, and we use the sklearn package to carry out the training and predicting with the NB learner.

SVM
. SVM learns to classify by solving a optimization problem. It maximizes the distance between a cutting hyperplane and the support vectors in the sample space. Due to its good performance, SVM has been used as a benchmark in many classification tasks. When there are more than two classes, a one-versus-rest method is usually adopted: by treating one class of the total n classes as a class, and the other n − 1 classes as another class, totally n SVM classifiers will be constructed for an n-class classification problem.

XGBoost.
XGBoost itself is an ensemble learner integrating multiple CART (classification and regression tree) models based on the boosting mechanism. e training process of XGBoost is to create a series of CARTs and let each tree learn to fit the prediction error of a previous tree. Leveraging the different assumptions in the constructed trees, XGBoost can improve the generalization ability of a learned model.

FastText.
FastText is a simple three-layer neural network deliberately trained for accomplishing natural language processing tasks. FastText can achieve text classification precision comparable to that of deep neural networks but is many orders of magnitude faster in training time. At the input layer of FastText, n-grams in the text undergo a bucket hashing process and become embedding vectors. Since FastText generate word vectors by itself, we do not apply Word2Vec to the FastText classifier.

CNN.
CNN is a deep neural network architecture originally proposed for image classification. Yoon Kim proposed a variant of CNN, namely TextCNN, for text classification [32]. In this paper, we use the word vectors generated by Word2Vec to replace the random word embedding used in TextCNN, so as to incorporate more prior knowledge in the classification model.

LSTM.
LSTM adds an input gate, a forgetting gate, an output gate, and a memory unit to a RNN neuron, making the modified model capable of memorizing important information and forgetting unimportant information in a time series [33]. LSTM is very useful for modeling text as the word sequences in text represents time series signals. e training of LSTM requires vector representation of each word in the text, which in this paper is acquired using Word2Vec. [34] and its variations are among the state-of-the-art techniques for natural language processing. BERT is based on Transformer [35], an encoder-decoder architecture built on multihead self-attention mechanism. BERT is structured as a multilayer bidirectional Transformer encoder and is deliberately pretrained with two types of tasks: masked language model and next sentence prediction. Using BERT to classify is called fine-tuning, which is to learn only the weight matrix of the softmax layer. BERT consumes significantly more resources to compute as the length of input text grows. To fit the capacity of our computing resources, we set 128 Chinese characters as the max length of input text for BERT and use TextRank [36] to summarize the food safety public opinion, so as to keep as much information in the original text as possible.

Data Processing for the Ensemble Model.
To train the 7 base learners and the meta-learner, the sample data set should be properly divided and fed to the model following the process depicted in Figure 3. e overall training set is divided into 7 sets of equal size, shown as a 1 , . . ., a 7 in Figure 3. For each base learner, it is trained 7 times. In the first time, a 1 is used as the inner test set and the rest sets are used as the training set; in the second time, a 2 is used as the inner test set and the rest sets are used as the training set, and so on. e predicted class labels for inner test set a i by base learner j is denoted as b ij . By merging b i1 , . . ., b i7 along the same test samples, we get B i , and B 1 , . . ., B 7 comprise the training set for the meta-learner. Since each base learner has 7 differently trained versions, and when testing the ensemble learner the meta-learner needs a determined class label from each base learner, we test the seven versions of a base learner one by one using the overall test set and choose the most frequently appearing class label for each test sample to get T i , the test result of base learner i. Finally, by merging T 1 , . . ., T 7 , we get T, the test set for the metalearner.

Meta-Learner Selection.
A meta-learner uses the output of all the base learners as input and makes final decision on the class label of a sample. Since the output of base learners for a given sample comprises a digital vector, and the results of different base learners do not affect each other, the metalearner predicting with this vector needs not to be complicated. In this paper, we choose KNN, SVM, and NB as the candidate meta-learners. ey have good performance in the ensemble learning framework and relatively short training time. We will test the performance of the three models and select the best model to use.

Data Collection.
e experiment data in this article come from the Food Safety Department of China Customs. Each public opinion has been manually rated by the customs officers. e original data collection includes 21,145 samples of food safety public opinion. After deleting invalid information, the total number of data sample is 21,065, including 10,247 low-influence samples, 10,314 medium-influence samples, and 504 high-influence samples.

Model Settings.
e settings of each compositional machine learning model are shown in Table 1. Among the models, TextCNN, LSTM, and BERT need to set the reading length of the text. As samples with the length of less than 1000 characters account for 98% of the total samples, the reading length of TextCNN and LSTM is set to 1000 characters, and the excessive text is cut off. Due to the reason explained in Section 5.1, the reading length of the BERT model is set to 128 characters, and we apply automatic summarization to compensate for the information loss.

Evaluation Index.
To conduct a comprehensive evaluation of the model results, four evaluation indexes are used: accuracy, precision, recall, and F1-score. e calculation of these indicators is listed below, where TP, TN, FP, and FN stand for the number of true positive, true negative, false positive, and false negative predictions of sample influence level. (2) Accuracy reflects the overall performance of a model. Precision and recall reflect the ability of the model to correctly predict class labels regarding all classified samples and samples of a certain type respectively. e F1 value is the harmonic average of precision and recall.

Minority Sample Generation and Filtering.
We use Word2Vec to train a word vector table of 88,296 × 50 from the original 21,145 public opinion samples. After removing useless words such as punctuations and numbers, we get a table of 62,981 × 50. Each row in the table corresponds to a Chinese word, and its similarity with another word is calculated through cosine similarity. For a real minority sample, we traverse each word in it and replace the word with its i th similar word in the word vector table. By ranging i from 1 to 20, we obtain 10,080 new samples.
Word replacement only ensures the word-level similarity between a generated sample and the original sample, but in fact we need the semantic meaning between the two samples to be similar. To achieve this goal, we train Siamese LSTM networks to filter out the generated samples whose semantic similarity with the original sample is low. Adopting a semantic similarity threshold of 0.8, we retain 5544 high-influence samples constructed from word replacement. e final set of high-influence samples has a size of 6048.

Meta-Learner Selection Based on Performance.
ree types of meta-learner have been tested using the balanced samples. eir performances are shown in Table 2. From Table 2 we can see that NB has the best accuracy, which is 0.8530. So we choose NB as the meta-learner in our ensemble learning model.

Result and Analysis
To show the effectiveness of the proposed ensemble learning framework and sample balancing method, we present results of influence assessment in three scenarios. e first is the result of base learners and ensemble model with original and 6 Scientific Programming balanced samples as input (Table 3). e second is the result of some base learners under 3 ways of sample balancing: none, SMOTE, and replacement-filtering (Table 4). e third is the performance of influence assessment for each sample class (Figure 4).
It can be seen from Table 3 that before sample balancing, only FastText and LSTM achieve accuracy more than 0.8. After using replacement-filtering measure to process unbalanced samples, all base learners achieve accuracy more than 0.8, and the ensemble model proposed by this paper has the highest score of 0.8530. After processing unbalanced samples, the best performance of a single learner is 0.8494 of accuracy achieved by LSTM. BERT represents a more advanced model than LSTM in processing text, but the performance of BERT predicting public opinion influence in this paper is only better than the NB model. is is due to the limitation of input text length caused by hardware constraints. From the above result we can see that both the ensemble learning framework and the replacement-filtering oversampling measure improve the performance of public opinion assessment. While the ensemble learning framework achieves a 0.42% improvement regarding the best single learner, the oversampling measure achieves an improvement of 5.8%. Moreover, in real application of artificial intelligence, people usually consider an accuracy above 0.85 as the baseline, so in this sense, only the result of ensemble learning model with balanced samples meets the requirement of real application.
To verify the advantage of the proposed oversampling method versus traditional oversampling methods, we

TextCNN
Keras-based implementation of a TextCNN [11]-like CNN, with a dropout layer after the embedding layer (dropout rate � 0.2); the 1D convolutional layer has 250 filters (kernel length � 3); a 3-max pooling layer follows and is followed by a flatten layer, a 50-unit dense layer, and a 3-unit softmax layer; the activation function of the convolutional layer and the dense layer is ReLU; input length � 1000, batch size � 256, epochs � 5.

LSTM
Keras-based implementation of LSTM; the embedding layer is connected to a LSTM layer with 200 neurons, where a 0.2 dropout rate of the input and recurrent state is applied; following the LSTM layer is a dropout layer (dropout rate � 0.2), a 64unit dense layer (ReLU activation function) and a 3-unit softmax layer; input length � 1000, batch size � 128, epochs � 5, Adam optimizer, learning rate � 0.01. BERT Chinese pretrained model, L � 12, H � 768, A � 12; batch size � 32, epochs � 5, learning rate � 2e − 5; input length � 128. KNN Parameters follow the default setting of the sklearn neighbors model.   Table 4 we can see that both oversampling methods can improve the accuracy of influence level prediction. From the perspective of improved performance, the SMOTE method has a better improvement effect on non-deep learning models, and the replacementfiltering method proposed in this paper has good improvement effect on both non-deep learning models and deep learning models.
An important capability of a public opinion assessment model is to recognize high-influence samples, as these samples are more likely to trigger public events. Figure 4 shows the class level results of influence assessment. It can be seen that the abilities of different models to distinguish between low-and medium-influence-level samples are close. But for high-influence-level samples, the performances of different models vary greatly. For single models, SVM and LSTM are better than NB and CNN in recognizing high-influence public opinion with unbalanced samples. For the ensemble model, it has high precision and low recall when recognizing high-influence public opinion with unbalanced samples, but when the samples are balanced, it has the highest precision, recall, and overall performance when recognizing high-influence public opinion.

Conclusion
In this paper we study the problem of assessing the influence level of food safety public opinion. An ensemble machine learning model is proposed to classify food safety public opinions into three influence levels: high, medium, and low. Given that the number of high-influence public opinion samples is much smaller than that of low-influence samples, an oversampling method is proposed to balance the sample number and improve the assessment accuracy. e oversampling method includes using synonym replacement to generate pseudo-high-influence samples and using Siamese LSTM neural network to filter out lowquality pseudo-samples. Experiments with real data collected from the Food Safety Department of China Customs show that the ensemble machine learning model outperforms single machine learning model including NB, SVM, XGBoost, FastText, TextCNN, LSTM, and BERT in terms of assessment accuracy. e oversampling operation is also tested to be beneficial, as after sample balancing, the accuracy of recognizing high-influence samples reaches more than 0.9 and the F1-score raises from below 0.7 to above 0.9. e result of the study shows that the proposed method can be used in real life to optimize the trade-off between accuracy and efficiency of food safety public opinion assessment.
As pointed out by Zheng [37], it is important to do reproducible research by making the data and model definite. Regarding the oversampling model and ensemble learning model proposed in this paper, the result can be made reproducible if the random seeds used in these models were set definite. However, we have not studied how to tactically eliminate the randomness of the model to achieve beneficial effects such as avoiding the selection of outliers during sampling [38]. is kind of study will be done in the future research.

Data Availability
Data are available upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.