Technology and the Internet have changed how travel is booked, the relationship between travelers and the tourism industry, and how tourists share their travel experiences. As a result of this multiplicity of options, mass tourism markets have been dispersing. But the global demand has not fallen; quite the contrary, it has increased. Another important factor, the digital transformation, is taking hold to reach new client profiles, especially the so-called third generation of tourism consumers, digital natives who only understand the world through their online presence and who make the most of every one of its advantages. In this context, the digital platforms where users publish their impressions of tourism experiences are starting to carry more weight than the corporate content created by companies and brands. In this paper, we propose using different deep-learning techniques and architectures to solve the problem of classifying the comments that tourists publish online and that new tourists use to decide how best to plan their trip. Specifically, in this paper, we propose a classifier to determine the sentiments reflected on the
That the world of tourism is changing is not news. There are more and more data, both structured and nonstructured, being generated at ever higher rates, which once transformed into information, which provide a tangible value to businesses. Opinion mining and sentiment analysis is a very active field of study in recent years [
The use of automatic tools in social networks for the tourism sector has generated ample literature due to the importance of influencing the consumer’s participation and affecting the way in which tourists perceive their experience [
In contrast to the traditional conversations that take place in specific physical locations, the digital conversation is shaped using new methods and tools for engaging the public, whose social interaction characteristic is the focus of its dynamic [ Great ability to disseminate, where users can access opinions from strangers Massive engagement by users of different ages and groups, all sharing different points of view The message spreading quickly in several ways: blogs, websites, social media, messages posted in online groups, etc. Multidirectional discussion among users, who play an active role by answering questions on the information presented Persistence over time, since the discussions are uploaded for the current and future reference Credibility, since the information is offered by users spontaneously and, in theory, with no profit motive
These features make monitoring of eWOM by tourism companies particularly relevant [
According to [
The tourism industry is of great importance in the Canary Islands, as it is the real engine of growth and development in the archipelago, accounting for a high percentage of its GDP. This has a knock-on effect on the remaining industries and services in the islands, especially in the development of trade, transportation, food production, and industry. Tourism also comprises a very important component in creating jobs in the service sector of the archipelago, which encompasses direct employment in the sun and beach sector, as well as workers in activities that support tourism, such as restaurants, hotels, travel agencies, passenger transport, car rental, and recreational, cultural, and sports activities.
In 2016, Tenerife, one of the Canary Islands, received over seven million tourists, most of them from the United Kingdom, which accounted for thirty-one percent of the passengers arriving at Tenerife’s airports in 2016. These figures indicate that the opinion that English tourists have of Tenerife is particularly relevant. As a result, we will conduct our experiments based on the comments in English made about the hotels in Tenerife.
Below, we describe the estimators implemented, the data used, and the procedures employed to train, evaluate, and compare the estimators.
The techniques used, which will be detailed in the sections that follow, fall under the category of supervised learning methods. This requires having a set of previously classified data before the prediction system can be trained and a test sample in order to validate how accurately the technique behaves. To satisfy this requirement and in order to use real data in the methods, we extracted information from reviews in English on the Booking
The comment will have a general score between 0 and 10 The comment’s author is able to separate positive and negative aspects TripAdvisor
The comment features a rating system using bubbles or stars to assign a score between 1 and 5 There is a single field for expressing an opinion
In order to extract the information, we developed Python scripts based on the Scrapy framework and adapted them to the domain using the scripts offered by the MonkeyLearn project [
As a result of this process, we obtained more than 40,000 records with different fields, depending on the source portal for the data: title, comment, score, date, and location of the visitor.
The preparation of the data set so that it can be used in the deep-learning techniques studied in this paper is an important part of the work. In the initial phase and in an effort to standardize the information taken from the two sources used, we developed programs in Python to build a CSV file with two columns:
Comment (free text) Label (“Bad” or “Good”)
Depending on the portal from which the data were sourced, the scripts had different functionalities.
In the case of the samples taken from TripAdvisor, the original title and comment were used to comprise a single text containing the visitor’s full opinion. Those reviews scoring three or higher were labeled “Good,” and those scoring two or below were labeled “Bad.” Any samples with an intermediate rating were discarded so as not to hamper the training.
In the case of the samples taken from Booking, we evaluated the number score awarded by the visitor. For scores higher than six, which were labeled “Good,” the comment was generated by concatenating the title and comment fields. For scores of four and below, the title and negative comment were concatenated and labeled “Bad.”
Once the structure of the samples was standardized, the data set was divided into three parts. For the first set, a random balanced selection (half of the samples labeled “Good” and the other half labeled “Bad”) of 9640 samples was carried out. The second set was used to evaluate the models after each training epoch, and finally, the test sample was created using 2785 samples in which the scores assigned by the tourists were known and which we used to compare the accuracy of the various models.
The second phase to prepare the data sets for the deep-learning models involved adapting them to the data domain that can be input into the models. Each comment used from the training set was subject to preprocessing before it could be used. Fortunately, Python offers an ecosystem of libraries that can be used in several machine learning applications [
The figures below offer a visual representation of those words that appeared most frequently in the positive and negative comments:
As we can see by analyzing Figures
Negative review words.
Positive review words.
The algorithms used require as an input a fixed-length vector in which each component is a number. In the technique for coding text into number vectors known as bag of words (BoW), a dictionary is created with the words found most frequently in all of the training comments. Each comment is then coded into a fixed-length vector corresponding to the number of words in the dictionary created. In BoW, a comment is coded into a vector in which each component counts how many times each word in the dictionary appears in the comment. We ruled out this coding method because even though it represents the frequency of words in the comment, it discards information involving the order in which those words appear in the comment.
The word embedding technique is currently one of the best for representing texts as number vectors. It is a learned representation in which words with similar meanings are given a similar representation. Each vocabulary word is represented with a vector, and its representation is learned based on the use it is given in the training comments. As a result, words that are used in a similar way will have a similar representation. The learning process for embedding is carried out in this paper by adding a layer at the front of the neural network in each of the models. In order to be able to generate our models in Python, we resorted to the Keras library [
In order to use this library and add the embedding layer to our models, we have to first transform the tourists’ comments into integer vectors in which each word is represented by an index in a list of words. We do so by using the Tokenizer class in Keras, creating an instance based on the training data set and limiting the size of the vocabulary, in our case, to the 5000 most common words. By using the Tokenizer instance, we transform each comment into a vector of variable length in which each word is an integer where the value
Converting review to fixed-length vectors.
The embedding layer is initialized with random weights and trained at the same time as the rest of the models, with the training supplied by the training data set. In every model implemented for this work, the embedding layer was used as the first layer in the model, with the following characteristics:
Input dim (the maximum integer value of the vector component input): its value, based on how we coded the comments, is 5000 Output dim (the length of the vectors that will represent the words after embedding): in most of the experiments, this length is set to 300 Input length (the length of the vectors in the layer): As defined earlier, it is the maximum length of the comments in words. In the experiments conducted for this work, the maximum comment length was 582 words
As shown in Figure
Embedding layer.
To predict the sentiment of the comments, we use models based on neural networks. Each comment is a sequence of encoded words that can be processed as a time series. However, the most common neural networks (e.g., feed-forward neural networks) lack the memory to store information over time. Recurrent neural networks [
Figure
LSTM unfolded into a full network.
In a RNN, the relationship between output
As (
The
As Figure
Output
Another type of neural network that can be used to predict time series is a convolutional neural network (CNN). These are biologically inspired variants of feed-forward neural networks used primarily in computer vision problems [
In these models, the output of each neuron
For the
Figure
General diagram of the CNNs used.
After each convolutional layer with a ReLU, activation function is a max-pooling layer, which partitions the input into a set of nonoverlapping ranges and, for each range, outputs the maximum value. Following the convolutional and max-pooling layers is a feedforward layer (as described in (
In order to compare some of the deep-learning techniques mentioned in this paper, we conducted a series of experiments on different models based on LSTM neural networks and CNN. Table
Model structure.
Model | Model description |
---|---|
1 | (582) embedding→(582 × 300) LSTM→(30) dense [sigmoid]→(1) |
2 | (582) embedding→(582 × 300) LSTM→(50) dense [sigmoid]→(1) |
3 | (582) embedding→(582 × 300) LSTM→(70) dense [sigmoid]→(1) |
4 | (582) embedding→(582 × 300) LSTM→(100) dense [sigmoid]→(1) |
5 | (582) embedding→(582 × 300) LSTM→(200) dense [sigmoid]→(1) |
6 | (582) embedding→(582 × 300) LSTM→(300) dense [sigmoid]→(1) |
7 | (582) embedding→(582 × 300) LSTM→(500) dense [sigmoid]→(1) |
8 | (582) embedding→(582 × 300) Conv1D→(575 × 64) MaxPooling1D→(287 × 64) flatten→(18,368) dense [relu]→(10) dense [sigmoid]→(1) |
9 | (582) embedding→(582 × 300) Conv1D→(575 × 128) MaxPooling1D→(287 × 128) flatten→(36,736) dense [relu]→(10) dense [sigmoid]→(1) |
10 | (582) embedding→(582 × 300) Conv1D→(575 × 32) MaxPooling1D→(287 × 32) Conv1D→(280 × 64) flatten→(17,920) dense [relu]→(10) dense [sigmoid]→(1) |
11 | (582) embedding→(582 × 300) Conv1D→(582 × 32) MaxPooling1D→(291 × 32) LSTM→(100) dense [sigmoid]→(1) |
First, we prepared the data as explained in Section
To make the models comparable, we used the same training data set (Table
Training data set.
Training data set | |
---|---|
Positive reviews | 4820 |
Negative reviews | 4820 |
Total reviews | 9640 |
Mean review length (chars) | 53 |
Max review length (chars) | 582 |
Test data set.
Test data set | |
---|---|
Positive reviews | 1408 |
Negative reviews | 1377 |
Total reviews | 2785 |
So as not to use a different number of training epochs based on the model, we decided to use a fixed number, 10, since, for every model, the loss value did not improve significantly with longer training.
Table
Test results.
Model |
Training time | Good hits | Bad hits | False good | False bad | Accuracy end |
---|---|---|---|---|---|---|
1 | 2208 | 1227 | 1228 | 149 | 181 | 88.15 |
2 | 2734 | 1211 | 1239 | 138 | 197 | 87.97 |
3 | 4658 | 1215 | 1237 | 140 | 193 | 88.04 |
4 | 4406 | 1198 | 1261 | 116 | 210 | 88.29 |
5 | 4388 | 1213 | 1256 | 121 | 195 | 88.65 |
6 | 10,630 | 1221 | 1263 | 114 | 187 | 89.19 |
7 | 13,574 | 1190 | 1261 | 218 | 116 | 88.01 |
8 | 6994 | 1210 | 1247 | 130 | 198 | 88.22 |
9 | 9139 | 1206 | 1235 | 142 | 202 | 87.65 |
10 | 1765 | 1166 | 1268 | 109 | 242 | 87.40 |
11 | 1602 | 1187 | 1272 | 105 | 221 | 88.29 |
Models 1 to 6 have the same structure, the number of memory units varying as shown in Table
LSTM comparison.
Figure
LSTM comparison.
The second part of the experiment consisted of checking for a significant variation when the number of filters was changed in the CNN models. Specifically, we compared models 8 and 9 with one another, yielding the results shown in Figure
CNN comparison.
To complete the study, we compared the results for the previous models that yielded the best outcome (model 6 LSTM and model 8 CNN) with models 10 (two-layer CNN) and 11 (CNN and LSTM). As Figure
Comparison of LSTM and CNN models.
Figure
LSTM comparison.
In this paper, we considered the problem of predicting sentiment in tourist reviews taken from eWOM platforms for hotels at an important international tourist destination. The use of techniques and automatic tools such as those considered in this paper are very useful for tourism industry practitioners [ Comparison of feeling about another local competitor or tourist destination (market positioning) Performing a proactive customer service management, generating a job ticket when a negative review is detected (customer management) As a measure of indicators to start a campaign to improve the reputation (marketing management) As a measure of risk indicators that affect the hotel or destination image (risk management)
Once the models studied in this article have been trained, they can be used in combination with other tools (review extraction or dashboards).
We used deep-learning techniques to devise different predictors based on neural networks, which were trained with the extracted data to compare the accuracies of each.
The predictors evaluated were based on recursive neural networks with cell LSTM and convolutional neural networks. Different designs were considered for each. The methodology was checked by training and validating the model with samples taken from Booking and TripAdvisor.
The results show that LSTM neural networks outperform CNN. The optimum result for CNN is attained with a single convolutional layer and 64 channels. More layers or more channels result in symptoms of overfitting. The LSTM neural networks yield higher accuracies, with one LSTM with a vector length of 300 for the internal state yielding an accuracy just over 89%, the highest for any model.
Finally, the results show that the better results of the neural networks are due primarily to their advantage when classifying the positive comments. They also show that combining convolutional layers with recurrent LSTM layers does not yield any advantages.
The comment tourist review data used to support the findings of this study have been deposited in the GitHub repository
The authors declare that there is no conflict of interest regarding the publication of this paper.
This work is sponsored by the “VITUIN: Vigilancia Turística Inteligente de Tenerife en Redes Sociales” project, through research funds of the Fundación CajaCanarias.