Using Internet Search Trends and Historical Trading Data for Predicting Stock Markets by the Least Squares Support Vector Regression Model

Historical trading data, which are inevitably associated with the framework of causality both financially and theoretically, were widely used to predict stock market values. With the popularity of social networking and Internet search tools, information collection ways have been diversified. Instead of only theoretical causality in forecasting, the importance of data relations has raised. Thus, the aim of this study was to investigate performances of forecasting stock markets by data from Google Trends, historical trading data (HTD), and hybrid data. The keywords employed for Google Trends are collected from three different ways including users' definitions (GTU), trending searches of Google Trends (GTTS), and tweets (GTT) correspondingly. The hybrid data include Internet search trends from Google Trends and historical trading data. In addition, the correlation-based feature selection (CFS) technique is used to select independent variables, and one-step ahead policy is adopted by the least squares support vector regression (LSSVR) for predicting stock markets. Numerical experiments indicate that using hybrid data can provide more accurate forecasting results than using single historical trading data or data from Google Trends. Thus, using hybrid data of Internet search trends and historical trading data by LSSVR models is a promising alternative for forecasting stock markets.


Introduction
With the advances of the Internet and communication in recent years, the increasing amount of data from social networks leads to changes in ways of collecting and analyzing data. Google Trends (http://www.google.com/trends) can be used to search trends of keywords. Hence, the data from Google Trends data started to be applied to many fields such as economy, election, and medication. Compared to structured data, collection data from social networks are another way to depict the issues concerned, and thus, some other interesting and essential insights that are not included in the traditional data collection may be discovered. Ever since the beginning of the stock market, it is hard to predict. However, the stock markets have profound effects on a country. In the past, the forecasting of stock markets has relied heavily on historical trading data. Most forecasting models using historical trading data are based on the causality theoretically. Due to the popular use of the Internet search, people tend to seek data or information from the Internet and express opinions on social networks. Stephens-Davidowitz [1] indicated that when social censoring issues are studied, Internet search behaviors can better reflect the real thinking of people than survey data, and the timing to obtain data is more close to real time [2][3][4][5][6]. However, the importance of historical trading data in forecasting stock market values should not be disregarded. is study attempts to incorporate the data from Google Trends and historical trading data together to predict stock markets. e performance of hybrid data and the unique data type in forecasting stock market closing values were examined in this investigation. Five stock markets, namely, Dow Jones Industrial Average Index (DJIA), Nasdaq Composite Index (IXIC), Russell 2000 Index (RUT), Standard & Poor's 500 Index (S&P 500), and Chicago Board Options Exchange Volatility Index (VIX), and three companies, the Apple corporation (APPL), the Alphabet corporation (GOOGL), and the Microsoft Corporation (MSFT), were forecasted by least squares support vector machines models with different data types. e rest of this article is organized as follows: Section 2 provides the related work. Section 3 introduces the methods employed in this study. Section 4 illustrates the proposed stock-forecasting framework and numerical examples. Section 5 draws conclusions. Hassan [7] noted that predicting stock markets using complex calculations does not help much. e author proposed a forecasting technique combining the hidden Markov model and fuzzy concept to predict stock markets. e results showed that the presented model outperformed the autoregressive integrated moving average model, the neural network model, and other hidden Markov models. Hadavandi et al. [8] claimed that a successful forecasting technique model for stock markets is a technique that can obtain accurate forecasting results with the smallest amount of input data and the simplest stock market model. is article combined genetic fuzzy systems and neural networks to forecast stock markets for information technology companies and airline companies. For the data-preprocessing stage, the stepwise regression analysis was used to pick factors, and then, through the selforganizing map approach, they were employed to cluster data. e experiment's results showed that the proposed approach can obtain more accurate results than some other forecasting methods. Singh and Borah [9] designed a forecasting model consisting of fuzzy theory and the particle swarm optimization technique to predict stock markets by using historical data from the State Bank of India. e numerical results illustrated that the proposed forecasting model is superior to the grey model, artificial neural networks, and regression models.

Related Work
Another tendency of forecasting stock markets is putting finance indicators into forecasting models. Laboissiere et al. [10] developed a model including correlation analysis and artificial neural networks to predict stock prices of Brazilian electric companies. In addition to the historical trading data, some indices such as the Ibovespa index, the Electric Power index, and American dollar quote were employed to predict stock prices. e numerical results were promising in terms of forecasting accuracy. Lincy and John [11] presented a multiple fuzzy inference systems model to predict selected stocks prices of the Nasdaq stock exchange. Four indicators, Moving Average Convergence/Divergence, Relative Strength Index, Stochastic Oscillator, and Chaikin Oscillator, were used by the proposed model, and decision rules were generated by using fuzzy set theory and multicriteria decision-making approaches. Simulation results revealed that the presented model is a positive way to analyze stock prices in terms of profit return. de Oliveira et al. [12] used artificial neural networks to forecast Petrobras' PETR4 stock by fundamental and technical factors which may influence stock markets. After the data-preprocessing procedure, essential factors left out were used by artificial neural networks. is study reported that the testing accuracy of stock market directions was more than ninety percent. Göçken et al. [13] applied metaheuristics, which are employed to select essential indicators, and artificial neural networks in stock price prediction. In addition, this study examined the suitable number of hidden neurons in the hidden layer in order to deal with the overfitting or underfitting problems of artificial neural networks. e results indicated that the proposed forecasting model was a dominant way to predict stock markets.
Because the use of social networks is booming, data from social networks offer valuable insights into what people think and want. us, these data have become more and more popular for collecting opinions and for forecasting. Stephens-Davidowitz [1] studied the relation between the voting of American presidential election and racially charged language. e author pointed out that the Google search queries were more useful than the survey data when social censoring issues were investigated. e results showed that there was a relation between voting and the search queries of racial animus. Gunn III and Lester [5] employed Google Trends with three terms to analyze the relation between the three terms and monthly suicide rates. ey reported that the information from the Internet search is correlated with the number of suicides, and thus, it is a faster way of monitoring possible suicide trends than compiling suicide statistics. Yang et al. [14] analyzed the relation between Internet search trends and suicide death. e conclusions revealed that suicide-related search terms were related to suicide death, and thus, keyword-driven search results of the Internet are the essential knowledge to reduce suicide deaths. Frijters et al. [4] conducted a study about the relationship between macroeconomic conditions and an indicator of problem drinking data from Google searches.
e results showed that the macroeconomic conditions are associated with health in some ways, and the real-time data provided by Google searches are crucial information for policy-makers. Smith [15] investigated the volatility in forecasting foreign currency exchange rates by using three Google search keywords and time-series models.
e results demonstrated that the information from Google searches is important in forecasting the market for foreign currency. Fondeur and Karamé [16] used the Google search data to enhance the prediction accuracy of youth unemployment in France. e results indicated that Google search data did improve the prediction of unemployment. Li et al. [17] used both statistical data and Google search data to predict the consumer price index by a mixeddata sampling model. Numerical results revealed that the proposed approach was helpful in forecasting the consumer price index by using data from the user-generated content. Takeda and Wakao [18] studied the relation between the Google search intensity, stock trading volume, and stock prices. It was reported that the positive relationship between Google search intensity and trading volume is stronger than that between Google search intensity and 2 Computational Intelligence and Neuroscience stock prices. Araz et al. [2] used Google Flu Trends data to forecast influenza-like illness, and a strong positive relation between Google Flu Trends data and influenza-like illness was revealed. In addition, using Google Flu Trends data as independent variables can result in accurate forecasting results. Some studies have examined the relation between the Internet search and some diseases, such as diseaserelated genes [19], kidney stones [20,21], epilepsy [3,22], allergy [23], and restless legs [24]. Most data on social networks are unstructured. erefore, to find meaningful information from social networks, text mining has been one of the major tools employed. Mostafa [25] used tweet samples on some famous companies to analyze sentiments of users to forecast the Prosperity index of each company. is investigation concluded that text mining in social networks is a helpful way to capture consumers' view and preferences of products. Ikeda et al. [26] investigated the Japanese tweeters and developed a hybrid text-based and community-based method for the demographic group or prediction of Twitter users. e proposed method can analyze tweeter's hobby, occupation, marital status, age, gender, and area. e authors reported that the proposed hybrid method can increase the precision of the text-based method. He et al. [27] collected social media data from both their own sites and the competitors' sites in the pizza industry.
is study indicated that the social media competitive analysis is essential and can help companies to form marketing strategies. Yu and Wang [28] gathered real-time tweets during 2014 World Cup games and employed text mining tools to distinguish positive and negative comments which may reflect moods of the soccer fans during matches. is study showed that opinions of sports fans can be learned from Twitter, and the results were fairly close to the predictions of the disposition theory. Chae [29] used a collection of Twitter hashtags related to the supply chain to gain some insight into supply chain management. e presented model consists of four approaches, descriptive analytics, content analytics, integrating text mining and sentiment analysis, and network analytics. Some interesting and valuable conclusions have been reached from the studies on the professional use of Twitter, organizational use of Twitter, and supply chain research, respectively.

Methodology
Proposed by Hall [30], the correlation-based feature selection (CFS) is a feature identification technology used for determining features with critical influence on prediction classes. e influence of features is related to the correlation between the feature and the prediction class labels. e correlation function is represented as follows: where DOF p is the degree of importance of a feature set p, NV is the amount of features in the subset p, r qi is the average correlations between the feature i in the subset p and the class q, and r ii is the average intercorrelation between features. e best-first search algorithm [31] was employed to generate the appropriate feature subset, and the Weka [30,32] software was utilized to perform CFS in this investigation. e support vector machines [33,34] model has been one of the most prevalent classification techniques in the past two decades. e support vector machines model was extended to cope with regression problems, and the support vector regression [35][36][37] has become popular in solving function approximation problems. Both support vector machines and support vector regression have to handle quadratic functions during the problem-solving processes. is is a timeconsuming task.
is restriction has been overcome by transferring a quadratic programming problem into a linear equation so that it can be solved. e least square support vector regression (LSSVR) [38] model can be represented as follows: where w is the weighted vector or the normal of the hyperplane, Υ is the penalty parameters that manipulate the balance between the minimization of estimation error and smoothness of the estimated function, ξ i is the error vector of the ith sample point, ∅(x i ) is the nonlinear function mapping of x i from the original space into a high dimension feature space, p is the bias parameter, and y i and x i are input data and output value, respectively. Due to the difficulty of solving the optimization problem straightly, the Lagrange function is developed and the dual problem can be represented as follows: where ] i are the Lagrange multipliers.
By solving the above functions, the solution of the problem can be achieved when all derivatives are equal to zero based on the Karush-Kuhn-Tucker conditions [39][40][41].
e optimal conditions are shown as follows: By removing w and ξ i from (4), the following linear equation can be obtained: Computational Intelligence and Neuroscience 3 0 where K is a kernel matrix and determined by where K(x i , x j ) indicates the kernel function satisfying the Mercer's condition [42]. In this study, the radial basis function represented by (7) was employed as a kernel function: where σ is the kernel width. By solving (5), v i and p can be obtained, and the LSSVR function is represented as follows: Figure 1 shows the framework of this study. ree major types of data, namely, data from Google Trends, historical trading data, and hybrid data, were gathered in this study. When using Google Trends data as independent attributes for making a forecast, the determination of related search keywords influences forecasting results a lot. us, in this study, keywords of Google Trends were collected in three ways: users' definitions (GTU), trending searches of Google Trends (GTTS), and tweets (GTT), respectively. Firstly, for collecting GTU data, users specified keywords subjectively with some domain knowledge or intuition. Secondly, keywords of Google Trends were gathered by the GTTS approach. Google Trends has a way to calculate keywords' activity levels, namely, trending searches of Google Trends. When a specific term is considered, the results show other related keywords from the highest activity level to the lowest one. en, the keywords of trending searches are ranked. Users can select keywords in terms of the ranking. e third way of generating keywords for Google Trends is the GTT method which collects texts on Twitter. When keywords for Google Trends obtained from Twitter were employed, the word "clusters tool" provided by KH Coder [43] was employed in this study to select the first 100 terms according to the scores calculated. For three methods of generating keywords for Google Trends, only keywords for Google Trends with scores were used as independent variables to forecast stock markets in this study. Some keywords for Google Trends are without scores due to the low search frequencies. ree hybrid data sets shown in Table 1 were generated by combining the historical data set data set with three data sets of Google Trends. Hybrid data I, hybrid data II, and hybrid data III represent historical data with data of GTU, GTTS, and GTT correspondingly. en, the correlation-based feature selection technique was performed for determining essential independent   Computational Intelligence and Neuroscience variables to predict stock markets. Since GTU data and historical trading data are with a small number of features, all data sets except the GTU data and historical trading data were processed by the feature selection procedure. erefore, totally 12 types of independent variables were used in this study to forecast stock markets. One-step ahead policy was employed to predict values of stock markets for all data sets. All 12 types of data were divided into three parts, namely, training data, validation data, and testing data, for LSSVR models to predict five stock markets. e training and validation data were used to select the LSSVR models, and the testing data were utilized to evaluate the forecasting performance of LSSVR models. In addition, genetic algorithms [44] were employed to determine parameters of LSSVR models [45]. In addition, the mean absolute percentage error (MAPE) and mean absolute error (MAE) were used to measure the performance of LSSVR models. e MAPE can be represented as follows:

The Proposed Stock Market-Forecasting Framework and Numerical Examples
where N is the number of forecasting periods, A t is the actual value at period t, and F t is the forecasting value at period t.                           e point-to-point comparisons of actual and predicted values by using various data to forecast values of stock markets and corporations are presented in Figures 2-9. e experiment's results revealed that using hybrid data with LSSVR models does improve forecasting performance on closing values of five stock markets and three corporations.

Conclusions
Many forecasting models have been proposed for stock market forecasting in the past decades. Due to the rise of social networking and Internet search tools, types of data employed for predicting stock markets became diversified.
is study proposed a framework to explore the influence of Internet search trends, historical trading data, and hybrid data on the prediction of stock markets by the least squares support vector regression models. Numerical experiments      indicate that using hybrid data can provide satisfied forecasting results. e superior performance and success of the proposed framework are most likely owing to employing the unique advantage of data from the Internet search and historical trading data. Empirically, the Google data may capture a part of the nonlinear data patterns [47], and therefore, the variety of the data has a chance to improve the forecasting performance. e promising results achieved in this study reveal the potential of the proposed framework for forecasting stock markets. Since keywords of Google Trends significantly affect the forecasting accuracy, Naccarato et al. [48] pointed out the selection of keywords results in different data sets for analysis and thus generates different numerical results.
is study provided three ways, namely, users' definitions, trending searches of Google Trends, and tweets, to determine keywords for Google Trends. e three ways can be easily and systematically reproduced for future use. Some other advanced techniques for determining appropriate keywords for Google Trends could be an essential direction for future study. In addition, numerical examples in the developed markets were employed to depict the proposed framework. For emerging markets, owning to the restriction of languages used for Twitter and Google Trends, some hurdles have to be overcome for analyzing the performance of the proposed framework.

Data Availability
e data used to support the findings of this study are included within the article by website linkages.

Conflicts of Interest
e authors declare that there are no conflicts of interest.