Automatic Approach of Sentiment Lexicon Generation for Mobile Shopping Reviews

The dramatic increase in the use of smartphones has allowed people to comment on various products at any time. The analysis of the sentiment of users’ product reviews largely depends on the quality of sentiment lexicons. Thus, the generation of highquality sentiment lexicons is a critical topic. In this paper, we propose an automatic approach for constructing a domain-specific sentiment lexicon by considering the relationship between sentiment words and product features in mobile shopping reviews.The approach first selects sentiment words and product features from original reviews and mines the relationship between them using an improved pointwise mutual information algorithm. Second, sentiment words that are related to mobile shopping are clustered into categories to form sentiment dimensions. At each sentiment dimension, each sentiment word can take the value of 0 or 1, where 1 indicates that the word belongs to a particular category whereas 0 indicates that it does not belong to that category. The generated lexicon is evaluated by constructing a sentiment classification task using several product reviews written in both Chinese and English. Two popular non-domain-specific sentiment lexicons as well as state-of-the-art machine-learning and deep-learning models are chosen as benchmarks, and the experimental results show that our sentiment lexicons outperform the benchmarkswith statistically significant differences, thus proving the effectiveness of the proposed approach.


Introduction
With the rapid development of smartphones, mobile shopping, which is already popular, is expected to grow faster. After shopping, people provide a large number of reviews about different kinds of products on the Internet. Different products could be preferred by different consumer groups. Hence, it is becoming increasingly important to learn about a customer's emotional inclinations and favorite products through online reviews. Sentiment classification can be performed using machine-learning, lexicon-based, and hybrid approaches. Sentiment lexicons are important resources for these approaches. The analysis of sentiment orientation is widely known as a domain-specific task. However, almost all the existing sentiment lexicons are general lexicons, which are not suitable for the analysis of product reviews on the Internet. Thus, automatic construction methods for sentiment lexicons have attracted increasing attention recently, especially methods for constructing sentiment lexicons aimed at mobile shopping.
Sentiment analysis, which is also called opinion mining, review mining (appraisal extraction), or attitude analysis, is the task of detecting, extracting, and classifying opinions, sentiments, and attitudes concerning different topics [1]. In a machine-learning approach, sentiment analysis can be considered as a supervised classification task. Pang et al. [2] solved the sentiment classification problem by training the classifier. However, most machine-learning approaches rely on features that are engineered by machine-learning methods. In a lexicon-based approach, a dictionary is created to judge whether the polarity of words in the text is positive or negative. For example, Turney [3] scanned a review for phrases that matched certain patterns (adjectives and adverbs) and then added up all sentiment orientations to compute the orientation of a document. A hybrid approach combines both the above approaches and has a relative 2 Wireless Communications and Mobile Computing advantage in sentiment analysis. Ortigosa et al. [4] developed a lexicon from a corpus and then chose sentiment words along with the labeled class as the input features for a machine-learning classification method. Sentiment lexicons play a key role in a majority of the above methods.
A sentiment lexicon (or an opinion lexicon) is a list of words and phrases that are commonly used to express positive or negative sentiments [5]. Researchers have proposed many approaches to compile these sentiment words. Technically, the existing automatic lexicon construction methods for both English and Chinese languages are mainly divided into corpus-based and knowledge-based methods. Turney [3] developed a corpus-based method in which the sentiment orientation of a word was judged by using pointwise mutual information (PMI) to describe the closeness of the word and seed words. Knowledge-based methods require a relatively complete knowledge base. Hu and Liu [6] constructed a sentiment lexicon by searching for the synonyms and antonyms of a word in WordNet. For a specific domain, the sentiment lexicon constructed from the corresponding domain corpus is more practical. When building a sentiment lexicon for online product reviews, the product features modified by sentiment words are also very important factors [7]. However, the existing general sentiment lexicons usually include only limited common words, and these words are divided into binary or other fixed categories according to the sentiment orientation.
In this paper, we present a novel method to construct a domain-specific sentiment lexicon by mining the relationship between sentiment words and product features in a specific corpus. In our approach, first, a sentiment matrix is constructed based on the relationship between sentiment words and product features. Every row of the sentiment matrix is regarded as a vector representation of the sentiment word. The sentiment words in the matrix space are clustered based on the distance between the vectors. Second, sentiment words that are related to mobile shopping are clustered into categories to form sentiment dimensions. In the process of building the sentiment matrix, the idea of term frequency-inverse document frequency (TFIDF) is utilized to screen the product features. Furthermore, the traditional PMI algorithm is improved to obtain a new algorithm called EPMI, which is more suited to mobile shopping reviews. Extensive experiments are performed on seven different domain product reviews, which include reviews in both Chinese and English. Compared to two popular general lexicons as well as state-of-the-art machine-learning and deep-learning models, our lexicon can obtain satisfactory classification performance. The experimental results also show that the filtering of product features and the application of the EPMI algorithm can greatly improve the performance of our lexicon for mobile shopping reviews.
The rest of the paper is structured as follows. Discussions on sentiment classification and lexicon generation and a review of the most recent research are presented in Section 2. Our methods for constructing the sentiment lexicon for mobile shopping reviews and a walk-through example of our methods are presented in Section 3. The experimental setup and results are described in Section 4. The conclusions of the paper are summarized in Section 5.

Related Work
This section is structured as follows. In the first part of this section, we review previous works on sentiment classification approaches. In the second part, we summarize works on approaches for sentiment lexicon creation. In addition, we briefly introduce the sentiment dimensions considered in the lexicon and product feature identification for product reviews.
2.1. Sentiment Classification. Sentiment classification aims to automatically classify the text of reviews written by customers into positive or negative opinions. Sentiment classification techniques can be roughly divided into machine-learning, lexicon-based, and hybrid approaches [8].
Machine-Learning Approaches. In such approaches, the analysis of customers' emotional inclinations is considered to be a problem of polarity classification. Pang et al. [2] applied three machine-learning methods (naive Bayes (NB), maximum entropy, and a support vector machine (SVM)) to sentiment classification as a form of traditional topic-based categorization. Zhang et al. [9] used machine learning (NB and SVM) to classify the sentiments expressed in restaurant reviews written in Cantonese. Li et al. [10] adopted extreme learning machine and deep-learning architecture to improve feature representations for text classification. Enríquez et al. [11] showed how a vector-based word representation obtained via Word2Vec can help in improving the results of a document classifier based on the bag-of-words model. However, these supervised machine-learning techniques require a large corpus of training data, and their performance is acceptable only if the match between the training and test data is good.
Lexicon-Based Approaches. These approaches adopt a lexicon to perform sentiment analysis by counting and weighting sentiment words that have been evaluated and tagged [12]. Nasukawa and Yi [13] developed a method to determine subject favorability by creating a sentiment lexicon containing 3513 sentiment terms. Qiu et al. [14] used a lexiconbased approach to identify sentiment sentences in contextual advertising. The most common lexicon resources are SentiWordNet, WordNet, and ConceptNet, and among these resources, SentiWordNet is the most widely used [15].
Hybrid Approaches. Nowadays, researchers are also using combined approaches, in which two or more approaches are combined to achieve better accuracy. Sindhwani and Melville [16] presented a unified framework in which lexical background information, unlabeled data, and labeled training examples can be effectively combined. Li et al. [17] set up a system to analyze the market impact by combining the stock price and news sentiment. Ortigosa et al. [4] performed sentiment classification and sentiment change detection on Facebook comments using a hybrid approach. They combined lexicon-based and machine-learning methods by considering a lexicon as the source of features and using a classification model to evaluate the lexicon; this approach is similar to the one used in our experiments in this study.

Lexicon Creation.
A sentiment lexicon is an important tool for identifying the sentiment polarity of reviews provided by mobile users [18]. Two methods are commonly used to generate sentiment lexicons: knowledge-based and corpusbased methods.
Knowledge-Based Methods. These methods exploit available lexicographical resources such as WordNet or HowNet. Hu and Liu [6] developed a lexicon by searching for the synonyms and antonyms of a word in WordNet. Kamps [19] inferred that the greater the closeness of two words, the smaller the number of iterations required to determine the synonymous relationship between the words. Both these studies used the relationship between words in a knowledge base. The main strategy in these methods is to first manually collect an initial seed set of sentiment words and their orientations and then search for their synonyms and antonyms in a knowledge base to expand this set [12]. However, very few complete and robust knowledge bases are available for the Chinese language.
Corpus-Based Methods. These methods depend on syntactic patterns or patterns that occur together along with a seed list of opinion words to find other opinion words in a large corpus [20]. Hatzivassiloglou and McKeown [21] found that, with a change in the emotional polarity in the text, the turning point appears but concatenation does not. Based on the idea that the emotional polarity of a word tends to be consistent with the emotional polarity of its neighboring words, Turney and Littman [22] constructed a dictionary from a large corpus. Both these works [21,22] are based on a corpus rather than a knowledge base. The corpus-based approach has a major advantage in that it can find domain-specific words and their orientations if a domain-specific corpus is used in the discovery process. Therefore, our work also focuses on a corpus-based approach. In addition, PMI is commonly used in this approach to exploit the syntactic patterns of cooccurrence patterns. Turney and Littman [22] used PMI and latent semantic analysis to measure the correlation between two words, and this method, which uses PMI to calculate the correlation between a word and seed word, is called semantic-orientation PMI (SO-PMI). Yang et al. [23] introduced a method based on SO-PMI to construct a sentiment lexicon and improved the SO-PMI model based on user behavior. In the process of our lexicon construction, we improve the traditional PMI to make it more suitable for mobile shopping reviews.
In the process of lexicon construction, we focus on two issues: the sentiment dimensions of the lexicon, and feature or topic identification in product review domains.
Sentiment Dimensions. Ekman [24] found that humans have six basic emotional categories: happiness, sadness, fear, surprise, anger, and jealousy. Ekman's theory, which is accepted by numerous psychologists and linguists, is widely used in the field of sentiment analysis. Rubin et al. [25] presented an empirically verified model on the basis of the idea [26] that an emotion can be divided into eight categories with two major bipolar dimensions: positive and negative effects. Although early approaches simply focused on this binary classification [27], we not only consider the two polarities but also anticipate that sentiment words can be reasonably clustered into finer-gained categorizations.
Feature Identification. Considering that many words in different fields may have different sentiment polarities, it is necessary to explicitly extract the sentiment words and topics or product features, especially in the mobile review domain. Fast et al. [28] found out that using experts or crowdsourcing to construct domain-specific sentiment lexicons is very difficult. Zhang et al. [29] proposed a hybrid method that combined Apriori and PMI to extract product features. Mishne [30] chose the part of speech (POS) and word counts as features in a text classification task. In our research, the primitive product feature extraction also uses the POS as a selection criterion.

Methods
In this section, we present our proposed framework to generate domain-specific sentiment lexicons for mobile shopping. Figure 1 shows the framework of our method. The domain-specific lexicon is based on the relationship between sentiment words and product features modified by the sentiment words. A sentiment matrix is adopted to represent the relationship between the sentiment words and product features. First, we use PMI to express the relationship between sentiment words and product features. Second, we use TFIDF to filter product features so as to reduce the matrix dimension. Finally, we improve the traditional PMI to develop a new algorithm called EPMI, which is used to build a new sentiment matrix. Each row in the sentiment matrix is a vector representation of the sentiment word. After obtaining the sentiment matrix, we cluster the sentiment words into several categories based on the distance between their vector representations. The mathematical symbols used in the process of construction are listed in Table 1.

Building of Primitive Sentiment Matrix.
To perform the key step of mining the relationship between sentiment words and product features, we need to determine the sentiment words and product features in the corpus. To choose the terms from the corpus as candidate words, we use the POS. Sentiment words are commonly used to express positive or negative sentiments. Sentiment lexicons usually contain such words, which can indicate the sentiment polarity (e.g., "good" and "wonderful" indicate positive opinions, whereas "rubbish," "cheap," and "terrible" indicate negative opinions). In mobile shopping reviews, a number of verbs can also indicate the sentiment polarity (e.g., "like" and "love" indicate positive opinions, whereas "dislike" and "refund" indicate negative opinions). In some previous studies [31,32], the words whose POS is an adjective or adverb are considered as sentiment words. The sentiment lexicons developed or used in some other studies [6,33] are also mainly concerned with adjectives and adverbs. In addition, product features in the product review domain are usually nouns or noun phrases found in review sentences [6]. Therefore, we choose adjectives, adverbs, and verbs as sentiment words and choose nouns as primitive product features. For instance, in the hotel review "The food in the dining room is really good, the breakfast tastes good," the product features are "dinning," "breakfast," and "food," and the sentiment words are "good" and "tastes." If a sentiment word A modifies a product feature B, we consider that there is a relationship between them. In mobile shopping reviews, this relationship can be shown as a phenomenon of cooccurrence. We use PMI to quantify this type of cooccurrence relationship. PMI is defined as Here, p(word 1 , word 2 ) is the cooccurrence probability of word 1 and word 2 in the local window and is expressed as where is the total number of words contained in the corpus. count(word 1 , word 2 ) represents the number of occurrences of the two words in the local window. Similarly, the frequency of each word can be obtained as In (1), p(word 1 )p(word 2 ) gives the probability of cooccurrence if these two words are statistically independent. The ratio of p(word 1 , word 2 ) to p(word 1 )p(word 2 ) is thus a measure of the degree of statistical dependence between the words. The PMI value between the sentiment words and product features can reflect the relationship between them. By calculating the PMI value between all the sentiment words and product features, we can obtain a sentiment matrix that contains the relationship between the sentiment words and product features. Let us denote = { 1 , 2 . . . } as the set of sentiment words and = { 1 , 2 , . . . } as the set of product features. Matrix , as shown below, consists of rows and columns.
In the above matrix, each sentiment word can be represented as a vector → = [ 1 , 2 , . . . , ]. Sentiment matrix A is the primitive sentiment matrix, and this matrix is optimized, as described in the next subsection.

Filtering of Product
Features. So far, we have obtained the primitive sentiment matrix , and each sentiment word in the matrix can be represented as a vector. According to our approach, these vectors should be clustered into several categories. However, we found that the number of product features is very large because we consider all nouns as product features. Consequently, the word vector will face the dimension disaster problem. The clustering of highdimensional data is still a challenging problem because of the curse of dimensionality [34]. In addition, the use of high dimensions will result in low computational efficiency, especially in mobile computing. In Hu and Liu's study [6], only those product features regarding which many people have expressed their opinions are reserved. Similarly, we also select key product features from the primitive nouns. Next, we will describe our feature selection method in detail.
The high-dimension problem stems from the large number of nouns in the corpus. The number is large because we choose all nouns as product features. For instance, consider the product review "This hotel is great, I can recommend my mom to live next time." The word "mom" and "time" will be treated as product features, but these words do not represent any features of the hotel. In addition, this type of nouns can be found everywhere in mobile shopping reviews. Therefore, it is necessary to filter out the key product features rather than choose all nouns as the product features. Product features should be nouns that frequently appear in a particular category of product reviews and rarely appear in other categories. Therefore, we use the idea of TFIDF to select real product features. TFIDF is defined as Here, TF(word) means the term frequency of the word in the document. IDF(word) means the inverse document frequency, that is, whether the word is common or rare across all documents. It is important to note that the TFIDF value of the same word may be different in different documents. However, TFIDF is usually used for documents rather than pieces of reviews. There may be thousands of comments about a single product. We just need to merge the same kind of comments together to form the corresponding document.
From (5), we can obtain the TFIDF value of words in different documents. Unlike the analysis described in the previous subsection, here, we choose the nouns whose TFIDF values are relatively high in the document as product features of the product. Unexpectedly, we find that the nouns whose TFIDF values are relatively high happen to be words that are closely related to the reviewed product. For example, if there are numerous reviews about a hotel, we can retrieve words such as "bathroom" and "air-conditioning" from the corresponding document. When we are commenting on a hotel, we often refer to the "bathroom" or "air-conditioning" in the hotel. However, these two words rarely appear in the reviews of products from other domains such as the electronics domain. We can certainly define a threshold that the TFIDF value of real product features must reach. Let as the set of the remaining product features after filtering by TFIDF. Accordingly, we can obtain another sentiment matrix that is similar to sentiment matrix . This matrix (sentiment matrix ) consists of rows and columns.
Definition 2. Sentiment matrix : this matrix can be considered to be part of sentiment matrix . The rows represent the sentiment words, whereas the columns represent the product features after filtering by TFIDF. The value of each cell , which is the same as that in sentiment matrix , is given by PMI( , ).
In matrix B, each sentiment word can be represented as a vector → = [ 1 , 2 , . . . , ]. Here, can be considerably less than when the threshold is set appropriately. Compared to sentiment matrix A, sentiment matrix B can effectively solve the high-dimension problem in word embedding. However, there are still some defects in the sentiment matrix, which will be elaborated in the next subsection.

Optimization of Sentiment Matrix by EPMI.
Here, we introduce an example from hotel reviews to further explain the defect in sentiment matrices and . We focus on two sentiment words ( 1 = "rich" and 2 = "hearty") and two product features ( 1 = "food" and 2 = "breakfast"). Both these sentiment words can be used to express opinions about a wide variety of foods. The meanings of these two sentiment words are very similar, and these words are commonly used in the hotel review domain. If we just consider the two features 1 and 2 , 1 and 2 can be represented as → 1 = [ 11 , 12 ] and → 2 = [ 21 , 22 ] in the sentiment matrix. is given by PMI( , ).
As is well known, the distance or angle between word vectors can be considered to be the similarities between words. The greater the similarity between two words, the shorter the distance between them. However, in a hotel review, the two sentiment words ( 1 , 2 ) and two product features ( 1 , 2 ) can be matched with each other flexibly. Although some customers may usually modify 1 with 1 and 2 with 2 , they may rarely modify 1 with 2 and 2 with 1 . This means that the PMI value of ( 1 , 1 ) and ( 2 , 2 ) is relatively high, but the PMI value of ( 1 , 2 ) and ( 2 , 1 ) is very low. Therefore, an illusion is created that 1 and 2 are irrelevant in the two dimensions of 1 and 2 . This irrational result stems from the flexibility of product reviews and the ( diversity of vocabulary in mobile shopping reviews. Although 1 rarely modifies 2 , it cannot be simply considered to be irrelevant. When we consider the relationship between a sentiment word and product feature, it is not sufficient to just calculate the PMI value of these two words directly. We still need to consider the relationship between the sentiment word and other product features that are related to the initial product feature. In the mobile shopping reviews about a hotel, there are many features related to 2 ( ), such as "food" and "dinning." Therefore, when we calculate the PMI value of 1 and 2 , we consider the cooccurrence of not only 1 and 2 but also 1 and 1 or other product features related to 2 . We use ∈ [0, 1] to reflect the degree of correlation between the two product features and . The larger the value of , the more relevant to . = 0 indicates that the two features and are irrelevant. In particular, if the features and represent the same feature, the value of between them is zero. Considering all the product features contained in the corpus, we define EPMI as Once we know the value of between any two features and , we can obtain the EPMI value on the basis of the PMI value. Considering that we can screen the features according to the method described in the previous subsection, we focus on the correlation between the remaining product features after filtering and all the primitive features. We assume that the more frequently two features appear in the same review, the higher the correlation between them is. The pseudocode for mining the relationship between them is presented in Algorithm 1.
Following the earlier definitions, is still the set of all product features for a given kind of production, and is the set of product features obtained by carrying out filtering using the approach described in the previous subsection. is the set of reviews related to the product. ( , ) means that features and appear together in review . The function normal() is a simple normalization function that is used to ensure that every element in the vector belongs to [0, 1]. This algorithm is an effective algorithm in the sense that it can find the features that are the most relevant to a specific feature. We can obtain matrix C using the above algorithm. After obtaining matrix , we can use EPMI to build a new sentiment matrix.
Definition 3. Sentiment matrix can be determined by (7). The only difference between sentiment matrices and is that matrix is obtained using our approach (EPMI) rather than the traditional PMI. In other words, each cell of sentiment matrix represents the EPMI value between the sentiment words and product features rather than the PMI value between them.

[ ] [ ] = [ ] [ ] + [ ] [ ] * [ ] [ ]
So far, we have obtained three sentiment matrices A, B, and F. The sentiment words can be represented by the vectors in each of the sentiment matrices.
In mobile shopping reviews, customers often use different sentiment words to modify different product features. In addition, customers may have a good feeling about some features of a product, but they may be dissatisfied with some other features at the same time. Therefore, different product features also reflect different feelings. We assume that sentiment words can be divided into different categories according to the relationship between them and the product features. Therefore, we cluster the sentiment words into several categories rather than into binary or other fixed categories. In other words, the sentiment dimension of a word in our domain-specific lexicon is flexible rather than having only limited emotional polarity. For each sentiment dimension, each sentiment word can take the value of 0 or 1, where 1 indicates that it belongs to a particular category whereas 0 indicates that it does not belong to that category. If we cluster the sentiment words into five categories, the representation

Walk-Through Example.
Here, we will elaborate on the differences between EPMI and PMI using an example. Suppose that we want to determine the semantic correlations between the sentiment words 1 = "丰 富" (rich) and 2 = "丰盛" (hearty), and the five sentences listed in Table 2 are our corpus. This small corpus is a part of Chinese mobile shopping reviews about hotels.
We calculate using Algorithm 1. First, by iterating through these five comments, we obtain the number of instances of cooccurrence of and (Table 3). In this table, each cell shows the number of times that two features appear together in the same comment. The values in this table are similar to those in matrix obtained in Algorithm 1.

Experiments
To evaluate the domain-specific lexicon developed using our approach, we design an experimental setup using which we compare the proposed domain-specific lexicon with two popular general lexicons and with state-of-the-art machinelearning and deep-learning approaches that do not use a lexicon. We mainly evaluate different lexicons and approaches using document-level classification tasks in the domain of online product reviews. For hybrid sentiment classification methods, we consider the features of the document vector representation as the lexicon. We use the F1-measure as our main evaluation index and choose NB and SVM as the classifiers. In the following subsections, the details of the experiments and their results are described.

Dataset.
The dataset includes both Chinese and English shopping reviews. These reviews are for seven types of products. The detailed statistics of this dataset are listed in Table 4.
The Chinese product reviews include three domains: hotel, cloth, and fruit. The hotel reviews are provided by Dr. Tan (http://download.csdn.net/download/lssc4205/9903298), and the cloth and fruit reviews are crawled from a mobile shopping application JD (https://www.jd.com/). The English reviews are obtained from the famous Amazon product review dataset collected by Blizter et al. [35]. It is widely used as a benchmark dataset for cross-domain sentiment classification. Four domains-book, DVD, electronics, and kitchen-are included in this dataset. For each domain, 1000 positive and 1000 negative reviews are included.

Experimental Design.
We use the open-source software jieba (https://pypi.python.org/pypi/jieba/) to carry out preprocessing tasks on the Chinese product reviews, including Chinese word segmentation and POS tagging. For the  Positive  5321  5000  5000  1000  1000  1000  1000  Negative  2444  5000  5000  1000  1000 1000 1000 sentiment classification approaches that do not use a lexicon, we compare our method with the classical bag-of-words and deep-learning model Word2Vec [36]. Furthermore, we compare the domain-specific sentiment lexicon with two popular general sentiment lexicons. We use the scikit-learn [37] python library implementation of the classifier. The detailed differences between the three test groups are described below.  [38]. To use Word2Vec for document-level tasks, a method is required that can unify all word vectors and generate a single vector representing the entire document [11]. Thus, the final representation is obtained according to the number of words contained in the document as follows: We use the genism (https://radimrehurek.com/gensim/models/word2vec.html) python library implementation of Word2Vec. We use the default values for almost all the parameters and use vectors with 200 dimensions.

(b) General Lexicon
For this test group, we use the hybrid sentiment classification approaches. That is, we consider the words in the lexicon, the sentiment dimensions of the lexicon, and a combination of the words and dimensions as the features for the machine-learning classifier. First, we choose a general sentiment lexicon DUTIR [39] for Chinese reviews. The DUTIR lexicon contains 27446 common Chinese words. The sentiment polarity of these words is labeled as positive, negative, or neutral.

(DUTIR)
We consider only the words contained in the DUTIR lexicon as features, as in the case of the bag-of-words model. Therefore, the review can be represented as → 0 = [0, 1, . . . , 1]. (Only 3) We consider the three polarities of the sentiment words in the DUTIR lexicon. We represent the product review by a three- , where 0 , 1 , and 2 are the number of words with the three types of polarities in the review. (DUTIR+3) Here, we combine the above two representations. The product review can be represented as For the English reviews, we choose a general sentiment lexicon SentiWordNet (http://sentiwordnet.isti .cnr.it/). SentiWordNet assigns three sentiment scores to each synset of WordNet: positivity, negativity, and objectivity. In other words, both the sentiment dimensions of the words for these two general sentiment lexicons in Chinese and English are 3.

(c) Domain-Specific Lexicon
We use hybrid approaches to evaluate the domainspecific lexicon developed using our method. We set the window size as 3 and as 0.01 (as mentioned in Section 3.2). We use k-means (http://scikit-learn.org/ stable/modules/clustering.html#k-means) to cluster the sentiment words into categories based on the distance in the matrix space. Unlike the general lexicons, the sentiment dimension of words in our domain-specific lexicon is . Note that we select through a fivefold cross-validation on the training set. The details of the selection of are explained in the next subsection. In our experiment, the number of clusters is not more than 30. In the following discussion, the domain-specific lexicon built using our method is denoted as DS.
(DS) We consider the sentiment words contained in the domain-specific lexicon as features, as in the case of the bag-of-words model. Accordingly, the review can be represented as → 0 = [0, 1, . . . , 1]. (Only k) We cluster the sentiment words into categories using k-means. We represent the product review by a ( ≥ 2)-dimensional Obviously, m t represents the number of sentiment words that belong to the (t + 1)th category in the review. (DS+k) Here, we combine the above two representations. The product review can be repre- Three different sentiment matrices are considered in our lexicon construction process. The sentiment word representations in different matrices are very different. Therefore, the clustering result would also be different. We use (PMI), (TFIDF-PMI), and (EPMI) to represent the clustering results of sentiment matrices A, B, and F, respectively. We discuss the different results of the three matrices in the next subsection.
In addition, we evaluate the lexicon in terms of the coverage and usage. We assume that the test set contains unique words and that these words include sentiment words, which are contained in the sentiment lexicon. We also assume that the size of the lexicon that is used to train the classification model is . Therefore, the coverage of the lexicon is / , and the usage of lexicon is / . If the coverage of the lexicon is low, the classification performance will be unsatisfactory. If the usage of the lexicon is low, computing resources will be wasted, which should be avoided especially for mobile devices. Considering these two evaluation indexes, we propose an average evaluation index such as the F1measure. Let , , and represent the coverage, usage, and average of the lexicon, respectively. Then, is defined as

Results and Discussion
Overall Results. Table 5 lists the overall classification results. All the tasks are balanced two-class problems. The best result for each domain review is marked in bold font, and the second-best result is underlined. First, for the domain-specific and general lexicons, DS+ achieves the best results for all the seven domain reviews whereas DS achieves the second-best results for four of the reviews. DS outperforms the general lexicons DUTIR  and SentiWordNet. These results indicate that the domainspecific lexicon, which is constructed from the corresponding corpus, shows better performance for sentiment classification tasks on shopping reviews.
Second, for no lexicon approaches, the classical bagof-words model obviously performs better than the deeplearning model Word2Vec in terms of sentiment classification tasks. BOW achieves the second-best results for three of the reviews, whereas W2V shows nearly the worst performance for both Chinese and English reviews. The poor performance is unexpected, and a large corpus of training data is perhaps required for training Word2Vec [40].
Third, for sentiment dimensions, the performances of Only 3 and Only are relatively worse for both Chinese and English reviews. That is, it is not sufficient to just consider the sentiment dimensions when we use the lexicon as the source of the features to express the reviews. However, the performance of DS+ is better than those of DS and Only for both Chinese and English reviews. This result indicates the effectiveness of combining the words and sentiment dimensions of the lexicon.
Note that, in Table 5, represents (EPMI). Considering that (DS+ ) provides the best performance, we analyze the differences among the results of DS+ (PMI), DS+ (TFIDF-PMI), and DS+ (EPMI) in detail. Table 6 lists the classification results of (DS+ ) with the three different methods mentioned in Section 3. First, we find that the classification performance of DS+ (EPMI) is better than those of DS+ (PMI) and DS+ (TFIDF-PMI). In particular, the performance of DS+ (PMI) is relatively poor. According to a -test, the differences among the results of the three methods are significant ( < 0.05). This result reflects the advantages of EPMI over traditional PMI in sentiment classification. Second, we discuss the differences among the three methods in terms of time efficiency. Figure 2 shows the average clustering times required by the three different sentiment  matrices. The time consumed by sentiment matrices and is considerably less than that consumed by sentiment matrix . This is because in matrices and , the dimension of the vector in the matrix space is reduced by using the method described in Section 3.2. The dimension reduction leads to a substantial increase in the efficiency and accuracy of classification. Therefore, sentiment matrix F constructed using EPMI shows the best performance.

EPMI versus PMI and TFIDF-PMI.
Selection of K. The sentiment dimensions of the domainspecific lexicon is . Now, we analyze the influence of different values on the classification performance. Figure 3 shows the performance of Only (EPMI) with the change in for the English product reviews. When is 2, Only (EPMI) shows the best performance for the books and DVD domains. For the kitchen and electronics domains, a larger improves the classification performance of Only (EPMI). The appropriate value of for the domain-specific lexicon is different for different fields. We select the value of through a fivefold cross-validation on the training set in our experiments.
However, we find that the performance of Only (EPMI) is worse than that of Only 3 in the books and DVD domains ( Table 5). We believe that this is because = 2 is not a good choice for Only (EPMI) for performing sentiment classification tasks. To prove this point, we look at the performance results of Only ( = 2) for all English product reviews (Table 7). Table 7 indicates that the performance of Only (EPMI) is not good when is fixed at 2. In our domain-specific lexicon, low sentiment dimensions such as = 2 is not good for DS. has a substantial influence on our domain-specific lexicon   for sentiment classification tasks. Therefore, it is necessary to select the value by cross-validation.   NB versus SVM. For obtaining all the above results, we have chosen NB as the classifier. However, classification algorithms influence the classification performance. Therefore, we choose another popular classification algorithm SVM as the classifier for the method with the best performance among each type of methods. The results are listed in Table 8, where ↑ indicates an improvement in performance compared to that when NB is used and ↓ indicates a deterioration in performance. SVM performs better than NB when using DUTIR+3 for Chinese reviews and when using BOW for the books and DVD domains. In contrast, NB yields better performance when using the other approaches. Sentiment classification is perhaps one of the domains that have clear feature dependence, and hence, NB often performs unexpectedly well [41]. Although the domain-specific lexicon performs better with both NB and SVM, different types of models of text classification are probably required for documents with different properties. Hence, further empirical and theoretical study is required to understand the relationship between sentiment classification tasks and classification models.
Lexicon Coverage. Finally, we discuss the classification performance in terms of the coverage ( ), usage ( ), and average ( ). The results for the test set are listed in Tables 9 and 10. In both the Chinese and English domains, the average of BOW is relatively high. For Chinese product reviews, both the coverage and usage of DUTIR are the worst because DUTIR is a general lexicon that contains only few words that often appear in shopping reviews. The coverage of SentiWordNet is considerably higher than that of DUTIR. This partly explains why the performance of SentiWordNet is better than that of DUTIR for sentiment classification tasks. The better performance is also probably because SentiWordNet contains more words related to mobile shopping reviews than DUTIR. The coverage of SentiWordNet is higher than that of the domain-specific lexicon for English product reviews, whereas the usage of SentiWordNet is considerably low than that of DS. The very low usage of lexicons may impact their performance and waste the computing resources of mobile devices. The average of DS is considerably higher than that of the general lexicon for both Chinese and English product reviews. This result reflects the advantage of our domainspecific lexicon for mobile shopping reviews in another way.

Conclusions
The analysis of the sentiment of users' product reviews largely depends on the quality of sentiment lexicons. This paper presents a sentiment lexicon construction approach for mobile shopping. In this approach, a sentiment matrix that considers the relationship between sentiment words and product features is built. The sentiment words are clustered based on the distance between them in the matrix space. One characteristic of our lexicon is that the sentiment words are clustered into several categories rather than into binary or other fixed categories. In other words, the sentiment dimension of the words in our lexicon is flexible. In addition, the product features are filtered based on the idea of TFIDF. Moreover, the EPMI algorithm is proposed, which is more appropriate for the mobile review domain. The experimental results show that our sentiment lexicons outperform the benchmarks with statistically significant differences in terms of sentiment classification tasks, thus proving the effectiveness of the proposed approach.

Data Availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest
The author declares that there are no conflicts of interest regarding the publication of this paper.