Towards Accurate Deceptive Opinion Spam Detection based on Word Order-preserving CNN

As a mainly network of Internet naval activities, the deceptive opinion spam is of great harm. The identification of deceptive opinion spam is of great importance because of the rapid and dramatic development of Internet. The effective distinguish between positive and deceptive opinion plays an important role in maintaining and improving the Internet environment. Deceptive opinion spam is very short, varied type and content. In order to effectively identify deceptive opinion, expect for the textual semantics and emotional polarity that have been widely used in text analysis, we need to further summarize the deep features of deceptive opinion in order to characterize deceptive opinion effectively. In this paper, we use the traditional convolution neural network and improve it from the point of the word order by using the method called word order-preserving k-max pooling, which makes convolution neural network more suitable for text classification. The experiment can get better deceptive opinion spam detection.


I. INTRODUCTION
Nowadays, with the rapid development of Internet, online opinions on products and services have become closely related to people's life. The study has shown that although more than 90% of users in the reality [1] have only published one opinion, these parts of opinions is huge enough to influences the grade and reputation of a product because of the huge number of people. These opinions contain rich information about subjective opinions on certain topics, and the information have become an important resource for public opinion that influence customer decisions over an extremely wide spectrum of daily and professional activities. As a result, sentiment analysis and opinion mining based on product opinions have become a heated topic in the text classification.
Due to opinions information guiding people purchase behavior possi, positive opinions can result in huge economic benefit and fame for organizations or individuals. This gives powerful incentive to promote the generation of deceptive opinion spam [2], [3], [4]. Deceptive opinion refers to a opinion that preach or defame a product to mislead consumers' points of views and behaviors [5]with fictitious opinions, deliberately written to sound authentic. Thus, the existence of deceptive opinion leads customers who are lack of relevant experience difficult to make accurate judgments. It affects consumers experience and the sales of products. It even seriously affects the Internet environment eventually.
Therefore, how to correctly distinguish between deceptive opinion and real opinion is one of the urgent needs of the Internet development, and has become a hot topic in the field of research.
The study of the deceptive opinion spam detection mainly includes three aspects: the choice of data sets, the use of classification methods and the exploration of the attributes. The choice of training data set includes two aspects: the research personnel constructs the data set by the semantic analysis of the comment [8] or directly obtains the opinion data from the website [24,25,27]. They can also build data sets based on the rules such as emotional polarity and semantic content similarity labels. The classification methods include the use of machine learning areas of many classification methods such as support vector machine classification [11]. The study about opinions on the properties mainly based on the opinion text [12,13,14,15], the combination of opinions text and the behavior of users [9,10] and analyzing the behavior similarity between the user and other users [15]. The works above have been described from different angles on the deceptive opinion, and the model has been completed to identify the deceptive opinion effectively. However, deceptive opinion is short, varied type and content. In order to effectively identify deceptive opinion spam, expect for the textual semantics and emotional polarity that have been widely used in text analysis, we need to further summarize the deep features of deceptive opinion in order to characterize deceptive opinion effectively.
For the problems above, our paper uses the deep learning model to achieve the opinion spam detection. The deep learning can achieve the feature selection and organization of high dimensional data, and update the model parameters dynamically according to the feedback. It can flexibly detect deceptive opinion spam. The deceptive opinion has the characteristics of short text length, diverse forms and contents. The existing text model based on deep learning is difficult to apply directly to the deceptive opinion detection process. In order to effectively identify deceptive opinion, we need to further summarize the fundamental features of deceptive opinion to characterize opinion effectively besides the textual semantics and emotional polarity that have been widely used in text analysis. Considering the short space and various forms of opinion, we first introduce the text word order into the process of the deceptive opinion analysis, and expand the scope of the deceptive opinion feature excavation. For the sake of effectively excavate and merge the text features of the opinion, this paper proposes a word order-preserving k-max pooling operation based on the CNN to keep the text order features in the process of text feature mining with CNN. It optimizes the depth of the commentary characterization, thus improves the deceptive opinion spam detection effectively. Contributions. The main contributions are as follows: • Considering the characteristics of the deceptive opinion text is short and has various forms, the paper introduces the word order into the deceptive opinion analysis process, and extends the scope of the deceptive opinion feature. In order to effectively excavate and merge the text features of the comment, this paper proposes a word order-preserving k-max pooling operation based on CNN, which preserves the word order characteristics in the process of text feature mining with CNN.
• In the paper, We implement a word order-preserving CNN model with TensorFlow, which is a deep learning framework. The experimental results show that compared with CNN, the improved CNN in the paper can achieve better deceptive opinion spam detection.
Organization. In section II we introduce the related work. Section III gives details of our proposed neural model, and Section IV introduces experimental setup, and then reports experimental results and analysis. We conclude this work in Section V.

A. Background
In this paper, the deep learning model has been used to study deceptive opinion spam detection in recent years. As one of the deep learning models, CNN has been a research hotspot in recent years. CNN has a good fault tolerance, parallel processing and self-learning ability [6] and is widely used in image processing, speech recognition, natural language processing and other fields, and has been widely used in the text classification. Compared with other popular neural networks RNN [7](Recurrent Neural Networks), the results of the analysis in the field of text classification are similar. Moreover, due to the opinion is generally a short sentence text, convolution function of the overall structure of the sentence has a general ability, which makes CNN in dealing with short text when the accuracy rate slightly better. Compared with the RNN, CNN's training time is shorter, more efficient, to save time costs. Due to the length of the deceptive opinion limited, compact structure, and independently expressing the meaning of the characteristics of short text analysis task, it is possible for CNN to deal with deceptive opinion spam detection.

B. Related Work
At present, in the research on the detection of deceptive opinion spam, Jindal and Liu [4] first studied deceptive opinion spam problem and trained models using features based on the opinion content, user, and the product itself. Myle Ott et al. [8] created a benchmark dataset by employing Turkers to write fake opinions. Fei et al. [9] proposed that a large number of opinions made use of a sudden burst either caused by the sudden popularity of the product or by a sudden invasion of a large number of fake opinions, including some of the features of real users. Markov Random Field (MRF) was used to construct users and their co-occurrence in emergencies by establishing a network for critics in different periods of emergency. Finally, Belief Propagation(BP) was used to infer whether a user is a fake user or not.Wang et al. [10] proposed an innovative heterogeneous opinion graph model to capture the relationship between the users and users' opinions on the shop, and used the interaction and the role of the nodes in the figure to reveal the causes of deceptive opinion, and designed an iterative algorithm to identify deceptive opinion spam. Mukherjee [11] et al. found that more than 70% of deceptive opinion publishers issued opinions between the similarity is greater than 0.3, and real opinion publishers published opinions similarity between less than 0.18 in the Yelp data set. The content similarity calculation for the opinions made by the same commentator can reflect the characteristics of the opinion's behavior.
The neural network model is connected with a large number of neurons to form a complex network system with adaptive and self-learning ability, and suitable for dealing with the unclear inherent characteristics of the data. As a new class of neural network model, the deep learning model can be used to learn the characteristics of various real things from large-scale data sets, and these features can be directly applied to various computing models by the computer. Currently, there have been some studies using deep learning models to identify deceptive opinion spam. Raymond [12] team builds a semantic language model to identify semantic repetitive opinions and makes deceptive opinion detection. However, due to the opinion itself has a certain degree of semantic similarity and content on the repeatability, there may be a miscarriage of justice. Li et al. [13] took the word vector as input, with CNN, the emotional polarity feature can also be applied to unsupervised methods for deceptive opinion text detection. However, only considering the emotional polarity of the deceptive opinion on the identification is not sufficient. At the same time, the local sampling of the CNN model can not take into account the existence of the word order in the text. Jindal [14] thought that the same user that gives his all positive opinions or negative opinions to the same brand of products is a kind of abnormal behavior and the corresponding opinion maybe deceptive opinion. The researchers proposed a "one-condition rules" and a "two-condition rules" model, to predict the falseness of the text by probabilistic prediction. Yapeng Jing [15] sets the data set on the AMT of hotel opinions, uses the information gain to select the feature of the word bag and then detects deceptive opinion spam through the ordinary neural network, DBN-DNN network and LBP network. However, the artificial structure of the data set can not accurately reflect the true opinions. Meanwhile, it will have a certain impact on the recognition results in the process of selecting the characteristics of the word bag method since it can ignored the text of the word order information. Against the issues above, we get real opinion data from the public opinion online. Considering the shortcomings of the opinions and the various forms of characteristics, we introduce the text order into the deceptive opinion spam analysis process to obtain rich and deep features in the feature extraction process.
In this paper, the detection of deceptive opinion spam is one of text classification essentially. Deceptive opinion is very short, varied type and content. In order to effectively identify deceptive opinion, besides the textual semantics and emotional polarity that have been widely used in text analysis, we need to further summarize the deep features of deceptive opinion to characterize deceptive opinion effectively. Therefore, we introduce the word order into the deep learning model, design the preservation of the k-max pooling technology and expand the deceptive opinion feature mining range to solve the difficulties in the identification of deceptive opinion and to enhance the accuracy of deceptive opinion spam detection.

III. DESCRIPTION OF OPCNN MODEL
This paper identifies the opinion data, proposes the OpCNN model by putting word order into CNN and inputs the opinion data into the OpCNN model to distinguish deceptive opinion from real opinion.

A. Chinese Word Order
Many languages have a basic ordering of the subject(S), object(O), and the verb(V), and among the languages of the world, all six possible basic word orders exist [16] especially SVO((Subject-Verb-Object) and SOV(Subject-Object-Verb). The study have shown that earliest human language had rigid SOV order. Nowadays, SOV basic word order is common among the languages of the world and that many other word orders can be reconstructed back to an SOV stage. It can be concluded that SOV must have been the word order of the 'ancestral language' among the six possible word orders [17], [18]. Moreover, Some studies have demonstrated that besides SOV, SVO is such a prominent word order in the languages of the world. For example, a sentence like 'fireman kicks boy', both nouns could in principle be the agent. SVO is used to avoid expressing two plausible agents ('fireman' and 'boy') at the same side of the verb instead of SOV [19]. Whether it is SOV or SVO, word order is a very important feature in the language certainly. As a traditional language, Chinese text also possesses word order. Word order in Chinese text is an inherent feature of text classification. In this paper, in order to describe the short opinion text, we need to introduce the word order feature into the process of detective opinion feature mining, and optimize CNN model to identify deceptive opinion. Moreover, we will use the sentence with word order as the input of our model to prove the idea of word orderpreserving in this paper.

B. OpCNN model
Typically, CNN includes four parts, input layer, convolution layer, pooling layer and output layer. On this basis, this paper presents an improved four-layer OpCNN model considering the Chinese word order problem. The input layer uses sentences with a certain word order as input. In the pooling layer, we use the k-max pooling method instead of the original pooling layer method and optimize the OpCNN model parameters. Unlike the CNN model used to process images, the OpCNN model used for text analysis in this paper regards word as the minimum granularity. Fig.1 represents a OpCNN. We proceed to describe the network in detail.

1) Input Layer:
In the input layer, we use word vector which lets dense space representing the word frequency of each word [6] as the input of text classification. The input layer consists of the word vector in the sentence and the word vector is followed by a two-dimensional matrix arranged from top to bottom. This matrix can be regarded as a CNN in the image processing in a size of n × k and a pixel size of 1 × 1 image. However, the difference to the image processing is that k in text analysis can not be split, which is 1 × k as a unit. For a opinion statement, each word in the sentence can find the corresponding word vector representation, as shown in Eq.1. In this way, each opinion can be expressed as a twodimensional word vector matrix. In this paper, we input the word vector matrix into the OpCNN model.
2) Convolution Layer: The convolution layer is the second layer of the OpCNN model. The input layer passes the resulting word vector matrix to the convolution layer for convolution operations, as shown in Eq.2. The size of the convolution window is h × k, where h is the number of vertical words and k is the dimension of the word vector. We will get a few of the 1 column of the Feature Map through this convolution window. Since the length of the word vector is fixed, the number of convolution cores is also fixed. The width of the convolution kernel needs to be set and adjusted in the experiment.
For a window, the input value of the window is converted to an eigenvalue by the nonlinear transformation of the neural network. As the window continues to move down, the corresponding eigenvalues of the convolution kernel are generated and the eigenvectors corresponding to the convolution kernel are formed.We use the nonlinear transformation activation function called ReLU, as is shown in Eq.3. Compared with sigmoid/tanh, ReLU only needs a threshold to get the activation value, rather than to calculate a lot of complex operations.

3) Pooling Layer:
The pooling layer reduces the number of parameters by reducing the dimension of the output of convolution layer. In the pooling layer, we use the k-max pooling method generally. The method idea is to extract the maximum value from the one-dimensional feature map obtained from the previous convolution layer operation, and discard the other eigenvalues. The output of the final pooling layer is the maximum value of each feature map, that is a one-dimensional vector. The max pooling method can keep the location of the feature and the invariance of rotation. So in the image processing, the feature of the max pooling method is very applicable. However, for text analysis domain, this feature affects the accuracy of text analysis. This is because the Chinese text exists the word order, which has the characteristics of location relevance. The position of each word in a sentence is a very important feature in the text analysis, so it is particularly important to preserve the feature of the word in the sentence. Furthermore, if a strong eigenvalue occurs multiple times, a single operation that takes the maximum value yields only one result, resulting in a loss of information about the same feature strength. Thus, the kmax pooling method is used in the pooling layer to replace the original max pooling method.
For a value k and a sequence c of length c ≥ k, kmax pooling method selects the subsequence c k max (the k highest values in the sequence c). The order of the values Algorithm 1 deceptive opinion spam detection Input:Pretreatment of comments input parameters k Output:Classification results output result 1: file=getfile() // Get the sample file 2: label=getlabel(file) // Get the label 3: test=gettest(file) // Get the text 4: vec=getword2vec() // Load the word vector 5: random=random(label) // Randomized 6: while condition do 7: kf=CV(len(xshuffle),nf) //Cross-validation 8: for trindex, teindex in kf do 9: xtotal,ytotal=xshuffle[trindex],yshuffle[trindex] 10: xtrain,xdev,ytrain,ydev=split(xtotal,ytotal) 11: //Split the data set 12: for i<k do 13: conv=getconv() //Convolution layer 14: h=relu(conv) 15: k=getk() //Get the value of k 16: tensorr=gettensor() 17: for x,y in xtrain,ytrain do 18: value,indice=topk(tensorr) 19: //Get the feature and location information 20: tensors=get(value,indice) 21: //Get the corresponding tensor 22: tensora=append(tensors) 23 in corresponds to their original order in c, as is shown in Eq.4. The k-max pooling method can discern more finely the number of times the feature is highly activated in c and the progression by which the high activations of the feature change across c [20] than that of max-pooling methods. Perhaps more importantly, it is possible to pool the top k features in c that may be a number of positions apart to preserve the order of the features. The method can achieve the goal of maintaining the original order of the words in the Chinese text to some extent. The method also solves the shortcomings of the original method used by the pooling layer in the Chinese text classification.

4) Output Layer:
The last layer of the OpCNN model is the output layer. The feature obtained in the pooled layer is fully connected, and the result is then entered into the logistic regression model to assess the probability that the comment is deceptive. The paper uses the softmax function as the regression function. Finally, we use cross entropy as a model of the loss function.

C. Spam Detection Algorithm
The OpCNN model was introduced above in detail. For deceptive opinion spam detection, we manually annotate the opinion data obtained from the public opinions online. We construct the word vector model and preprocess the experimental data. Then we enter the word into the OpCNN to get the final text classification results, to distinguish between deceptive opinion and real opinion in two cases. The complete process of deceptive opinion spam detection in this paper is shown in Algorithm 1. Algorithm 1 is done by iteration. Assuming that the number of iterations is k times, the number of samples per input OpCNN model is m, the number of words in each sentence is v, the word vector dimension is d, the convolution window size is w, and the number of output channels is n. The model deals with a sentence with a time complexity of O(v*n*(2d*w 2 +w-1)). The time complexity of the OpCNN model can be expressed as O(w 2 *k*m*n*d*v) when model performs k iterations and the per number of samples inputted into model is m.

A. Experimental Data Set
At present, the use of the largest dataset is the Ott [8] data set in the identification of deceptive opinion. However, the study [21] proves that the false distributions of Ott [8] data set constructed by AMT are quite different from the true distribution of deceptive opinion in reality. The data used in this paper is crawled down from public opinion online, the number of the 23,166 hotel data, and each user, rating scale (1 point -5 points) and time opinions published, to construct a data set. On dianping.com, you can write the relevant opinions and give the evaluation level regardless of buying a product or service or not. Therefore, brush reviews, brush scoring phenomenon become easier and false opinions appear more likely. This also provides data support for deceptive opinion identification in this article. Then, the resulting data is manually annotated. This section uses the data annotations presented by Li [22],that is a more rigorous strategy to label 23,166 opinions. We refer to the artificial labeling scheme provided in [23] and we have made some improvements to the scheme. Eventually, all the 23,166 hotel needed opinion with the certain word order are marked and 2132 opinions are fake. The data used in the specific experiment are shown in Table 1. In this experiment, 80% of the sample size is set as the experimental training set, and 20% of the sample size as the experimental test set. At the same time, in order to illustrate the generalization ability of the method used in this paper, the data set proposed by [8] is used as a control to verify the generalization ability of this scheme.

B. Implementation
In order to verify the accuracy of the proposed spam detection scheme, we construct three sets of contrast scenarios: (1) The first experimental baseline uses the classical statistical method called tf-idf for feature extraction, supports vector machine (SVM) as a classifier [24] and supervises the above-mentioned tagged data.
(2) The second baseline uses Bigram to extract the feature data [25]. Bigram is assumed to be in a statement that the probability condition of the second word depends on one word in front of it, that is the context of a word is defined as a word that appears in front of the word [26]. Some of the two consecutive characters usually have the ability to represent the features of the text. Then the support vector machine (SVM) is used as the classifier to obtain the classification result.
(3) The third baseline uses the Convolution Neural Network (CNN) in the deep learning framework [27], combined with the short text feature extraction to apply the CNN to the deceptive comment identification. The experiment uses 3x cross validation to adjust the hyperparameters in the classifier model. The specific parameters are shown in Table 2. We use the ReLU function as a non-linear function, the superparameter of the weight attenuation L2 is set to 0.5. Other parameters include dropout set to 0.5 and mini-batch to 50.
(4) In the experiment group, we realize the OpCNN which is mentioned in the section above. We let the parameters of OpCNN same to the CNN in the third baseline to keep the experiment effective.

C. Evaluation Metrics
In order to illustrate the experimental scheme, we evaluate the experiment from five aspects: accuracy rate, precision rate, recall rate, F1-score and accuracy rate.
Accuracy (A): The ratio of the samples correctly sorted by the classifier to the total number of samples for a given test data set. That is, the loss function is 0-1 loss on the test data set on the accuracy rate. true positives(TP), false positives(FP), false negatives(FN) and true negatives(TN) are the related concepts of experiment effect. Precision (P): It calculates the ratio of all "correctly retrieved items (TP)" to all "actually retrieved (TP + FP)".

P =
T P T P + F P Recall (R): The item (TP) that is correctly retrieved is the item (TP + FN) that should be retrieved.
F1-score: F1-score is the harmonic mean of precision and recall.
Accuracy gain(α): The ratio of the experimental group method accuracy F e and the control group method F c accuracy. When the value of α is lager, the accuracy of the experimental group is higher than that of the control group. When the value of α is smaller, the accuracy of the experimental group is lower than that of the control group.

1) K Value Selection:
In the previous chapter we mention that the k-max pooling method is used in the pooling layer instead of the original max pooling method. The essence of the k-max pooling method is the use of the top-k function, where the choice of k is particularly important. In this experiment, we discuss the effect of k value on OpCNN model. The specific evaluation index is the accuracy rate, as shown in  2) Accuracy Analysis: In this experiment, the classification results of the three groups of experiments are evaluated from the three evaluation indexes of accuracy, recall rate and f1score [25]. The specific experimental results are shown in Table 4. It can be concluded from Table 4 that the accuracy, recall and f1-score of CNN is 68.91%, 64.70% and 69.65% respectively. Compared with tf-idf and Bigram, the accuracy, recall and f1-score of CNN have a certain upgrade. The reason is that, compared with tf-idf+svm, Bigram on the division of the word takes into account the problem of word order to a certain extent. Compared with the first two methods, CNN can explore the characteristics of higher latitudes and can reduce the impact of sparseness of data, making the text classification better. This also gives us a hypothesis that, if we combine CNN and word order, whether the experiment will get a better experimental results or not.
Due to the Chinese word order be taken into account the important role of deceptive opinion spam detection, the kmax pooling method is used to improve the traditional CNN in the pooling layer, which is more suitable for the research of text classification. Through the above experiment, we have set the k value. The experimental group uses the OpCNN model, and the parameters are consistent with CNN. The results of the classification of CNN and OpCNN models are evaluated from the accuracy, the recall and the f1-score. The specific experimental results are shown in Table 3.
From Table 3, compared with 68.91%, 64.70% and 69.65% of CNN, the method has achieved 70.10%, 66.83% and 69.88% of accuracy, recall and f1-score respectively. In the field of Chinese text categorization, compared with the max pooling method used by the pooling layer, the k-max pooling method solves the order problem of Chinese text to some extent. As mentioned earlier in this paper, The classification of the text effect is more obvious.
3) Scalability Analysis: In order to validate the generalization capabilities of the proposed method at the beginning of this chapter, this experiment uses the CNN and OpCNN models on Ott proposing data set. We set the same parameters and use the precision, recall rate and f1-score as evaluation indexes. The specific experimental results are shown in Table  4. not due to the fact that the method has a certain dependency relationship with the data set used in this paper.

E. Effect of Sample Size
In order to fully verify the performance advantage of OpCNN compared with other classification methods in deceptive opinion spam detection, we can compare the classification results by changing the size of the training set. The evaluation index is the accuracy gain(α). At the same time, in order to prevent the imbalance of the data in the experiment probably having the impact on the experimental results, this experiment uses the same number of deceptive opinion and real opinion. The effect of the number of specific training set samples on the experimental results is shown in Fig.3.
It can be seen from the experimental results that compared with other methods, the classification method used in this paper obtains the value of α more than 1. At the same time, as the number of samples increases, the accuracy rate of OpCNN model is increasing compared with the other three groups of control experiments, as shown in Fig.4. Since OpCNN and CNN are data driven, with the training sample increasing, the deeper the ability to characterize the depth model has, the higher the accuracy rate is. Compared with CNN, OpCNN has solved the influence of word order on Chinese text classification to a certain extent, so its accuracy is higher. When the number of samples reaches 3000, the accuracy rate is gradually stable, indicating that the accuracy of OpCNN model classification tends to be stable.

V. CONCLUSION
In our paper, the CNN in the deep learning model is used to identify the detective opinion spam. Against the short opinion text and the various forms of characteristics, we introduce the text order into the deceptive opinion analysis process and extend the scope of the opinion feature. In order to effectively excavate and merge the feature of the opinion text, this paper proposes a guaranteed k-max pooling operation on the basis of CNN. The text order feature is preserved in the process of text feature mining using CNN and the depth of opinion feature is optimized. Experiments show that the improvement of CNN model proposed in this paper can improve the recognition effect of deceptive opinion spam detection. However, there are still some shortcomings in this paper, such as: artificial annotation costs much of manpower. Due to the subjective, the artificial marked data may be awareness of each person to some deviation. In the future experiments, we will continue to improve the above deficiencies to make a better accuracy of opinion spam identification.