Incorporating Research Reports and Market Sentiment for Stock Excess Return Prediction: A Case of Mainland China

+e prediction of stock excess returns is an important research topic for quantitative trading, and stock price prediction based on machine learning is receiving more and more attention. +is article takes the data of Chinese A-shares from July 2014 to September 2017 as the research object, and proposes a method of stock excess return forecasting that combines research reports and investor sentiment. +e proposed method measures individual stocks released by analysts, separates the two indicators of research report attention and rating sentiment, calculates investor sentiment based on external market factors, and uses the LSTM model to represent the time series characteristics of stocks. +e results show that (1) the accuracy and F1 evaluation indicators are used, and the proposed algorithm is better than the benchmark algorithm. (2)+e performance of deep learning LSTM algorithm is better than traditional machine learning algorithm SVM. (3) Investor sentiment as the initial hidden state of the model can improve the accuracy of the algorithm. (4)+e attention of the split research report takes the two indicators of investor sentiment and price as the input of the model, which can effectively improve the performance of the model.


Introduction
Stock price prediction is a method of predicting stock prices in the future based on stock price information at the past or current moment. Traditional quantitative investment methods are mostly based on experience in forecasting future stock prices. Such methods often have weak antirisk capabilities, poor long-term forecasting capabilities, and slow analysis speed and are not convenient for dissemination and promotion. Randomly appearing statistical and financial-based stock analysis methods belong to the traditional machine learning category. Most of them use autoregressive models, random fluctuation models, and Markov models to make predictions. Compared with empirical methods, this method is faster. It is fast and accurate, but the disadvantage is that it can process less information and cannot fully deal with many factors of external market data that cause stock price fluctuations.
anks to the massive financial data provided by the continuous development of big data technology, it is possible for artificial intelligence methods to enter the financial analysis field. erefore, more and more researchers have begun to use machine learning or deep learning methods to analyze stock prices and make predictions. Related methods in the field of artificial intelligence have demonstrated incomparable excellent performance on large-scale datasets. is has been verified in areas such as images [1,2] and text [3][4][5]. It is foreseeable that artificial intelligence-related methods can solve many problems in current stock price prediction models. Because of its policies, internal environment, and investor attributes, stocks have different rules in different markets. e stock market in mainland China belongs to an emerging capital market. e imperfect regulatory policies and the characteristics of most investors are retail investors, and making media reports to a great extent can affect the trend of stock prices. Ding and Sun's [6]research shows that in the Chinese A-share market, the behavior of ordinary investors in buying and selling stocks will be largely affected by research reports issued by financial institutions. Compared with the news media reports, which focus on the occurrence of events and describe the original events, the research reports are more focused on financial and market-related attributes related to stock prices, with the purpose of predicting stock prices. At the same time, as an information publisher, securities analysts have a more professional industry background and richer information channels than ordinary financial news reporters, so for ordinary investors, direct and professional research reports are important reference objects for investment decisions.
In order to better predict the stock price of the Chinese mainland stock market, we propose a method that combines research reports and market sentiment to predict abnormal stock returns. e proposed method measures individual stocks released by analysts and splits the research report. Attention and rating sentiment, calculate investor sentiment based on external market factors, and use the LSTM model to represent the time series characteristics of the stock. We selected A-share data from July 1, 2014, to September 30, 2017, for experiments and compared different algorithms and different inputs. Based on the experimental results, we found that (1) e accuracy of the proposed algorithm and the evaluation of F1 are better than the benchmark algorithm (2) e performance of deep learning LSTM algorithm is better than traditional machine learning algorithm SVM (3) Investor sentiment as the initial hidden state of the model can improve the accuracy of the algorithm (4) e attention of the split research report and the two indicators of investor sentiment and price are used as input for the model, which can effectively improve the performance of the model e rest of this article is organized as follows. Section 2 reviews the literature that separately introduced the impact of machine learning-based stock price predictions and research reports on stock prices. Section 3 introduces our proposed method. Section 4 presents the experimental design and details. Section 5 presents the experimental results and discussion. Section 6 gives our conclusions and directions for future work.

Machine Learning-Based Stock Price Prediction.
In the traditional machine learning field, Xiang [7]used an improved gradient boosting decision tree (GBDT) to predict stock prices. is model can mine the relevant features of the current stock series, but the GBDT model structure itself is not suitable for solving serial data problems like stocks.
Du et al. [8] used a Bayesian learning (BL) model to predict stock prices in the research. is model is actually similar to the autoregressive integrated moving average (ARIMA) model. Based on statistical knowledge, it learns the characteristics of the stock sequence. However, the BL model itself is actually not suitable for sequence data. In the field of deep learning, Tsantekidis et al. [9] in their research proposed a stock price prediction model based on a CNN encoder. CNN is a very effective model for image input. In order to adapt it to sequence data, first use the encoder to encode the sequence data, and then use CNN for training.
is method is very similar to the signal and system. In filtering theory, the sequence data can be regarded as a time signal, and the CNN can be regarded as a filter for convolution. Bao et al. [10] in combination with the long shortterm memory (LSTM) of the autoencoder (AE) constructed a special algorithm based on the recurrent neural network (RNN). e neural unit structure makes it very suitable for processing sequence data such as stocks. is method even adds an autoencoder to encode the stock sequence through training and then uses the LSTM network for training. Based on the basic deep learning models, more studies have considered the basic characteristics of the stock market and incorporated them into the method. Zhang and Tan [11] used historical price data to predict the future return ranking of stocks through a new stock selection model based on deep neural networks. Li et al. [12]have established a system that uses deep learning architecture to improve feature representation and uses extreme learning machines to predict market impact. ey concluded that the feature representation of deep learning together with extreme learning machines can provide better accuracy of market impact predictions. Li et al. [13] emotion vectors are obtained through sentiment analysis of news articles, and sentiment vectors are added to the LSTM model to predict stock prices. e experiments on the Hong Kong stock market have shown good performance.

Stock Price Prediction Based on Research Reports.
Lee et al. [14] believe that after being affected by media sentiment, investors will form a subjective and objective comprehensive judgment on future capital flows and investment risks, which is called "investor sentiment." When investor sentiment is extremely optimistic or pessimistic, stock price volatility increases. At the same time, for the purpose of promotion, commission income, contracting customers or business, etc., the research report written by the securities company is not always neutral, and they will convey information with serious selective deviations to the market to meet the needs of investors, and such deviations are often optimistic. Using data from the "Abreast of the Market" column on the Wall Street Journal's website as a sample, Tetlock [15] constructed a media pessimism index and found that abnormally high or abnormally low media pessimism can cause temporary activity in market trading behavior. Hribar and McInnis [16]in their research found that when investor sentiment is high, analysts' optimism tends to be more obvious. e existence of optimism tends to distort stock prices and seriously affects investor decisions. Zhao et al. [17] proposed that in companies with high stock price synchronization, analysts' optimism tends to have a weaker impact on the accuracy of their subsequent earnings forecasts. Xu et al. [18] believe that optimism tends to lead to high transaction volumes, but it also easily leads to negative news of listed companies not being disclosed in a timely manner, and the risk of future stock price crashes. Lu and Chen [19] found that the impact of extreme optimism and extreme pessimism on the stock price index is asymmetric, and short-term extreme pessimism has a negative relationship with the stock price index.

Task.
e ups and downs of stocks determine the rate of return on stocks, and most stock prices fluctuate with changes in the stock market environment. When the stock market is in a "bull market," most stocks will rise, and when the stock market is in a "bear market," most stocks will follow the trend and fall. Simply predicting the rise or fall of a stock on the next trading day cannot objectively reflect the stock income. erefore, this article uses stock excess returns as a research object to explore whether the excess returns obtained by individual stocks in a certain time interval in the future are positive, negative, or par. e calculation of excess returns for individual stocks is as follows: where AR k,t is the abnormal rate of return on day t of stock k, R k,t is the actual rate of return on day t of stock k, and ER k,t is its expectation yield (or expected normal return). ere are many methods to calculate the expected normal rate of return. In order to exclude the part of the return that is related to market returns, this article uses the Malkiel and Fama [20] market model to measure. erefore, the actual yield of individual stocks can be expressed as where R m,t is the market return rate on day t, μ k,t is a random error term, and estimates of α k and β k obtain the values α k and β k . e models that measure expected normal returns are Finally, calculate the cumulative abnormal return of stock k during the event window: Due to the high turnover rate of ordinary investors in China's A-share market and the average holding time is about one month, this article selects the excess returns from one trading day to the 5, 15, and 30 trading days as the forecast target.

Overview.
In order to better understand the time series relationship between stock prices, our method uses LSTM networks [21] as the basic unit of the model, and based on this, a research report and market sentiment are quantified into the process of stock prices. Figure 1 shows the structure of our method model.
First of all, in order to better indicate the status of a stock in the current stock market, we sort the stock price (Price), research report rating sentiment (RRA k,t ), and research report attention (RRRS k,t ) which is concatenated to get x t , and the calculation method of x t is shown in the following formula: en, x t is used as the input of LSTM, and the initial hidden state of LSTM is investor sentiment Sentiment m . e reason for this is that relative to the positive volatility of the stock price, Sentiment m is stable for a period of time and can be regarded as an indicator of market sentiment in the short term. Finally, the output of the LSTM is calculated by the SOFTMAX function to obtain the final output of the model. e calculation methods of LSTM, Sentiment m , RRA k,t , and RRRS k,t will be described in detail in the following sections.

Long Short-Term Memory Network.
e LSTM proposed by Hochreiter and Schmidhuber [21] in 1997 can effectively deal with the long-term dependencies in the sequence, and its structure is shown in Figure 2.
e core of LSTM lies in its memory unit, and related information is transmitted backward through the memory unit. eoretically, the memory unit can transfer information during the entire sequence propagation process so that the information at the previous time can be used to predict the output at the later time, so it can solve the short-term memory problem of the traditional recurrent neural network. In addition, during the backward transfer of information in the memory unit, the LSTM adds or deletes information in the memory unit through three gates. ese gates can be seen as different neural networks, which can be trained to automatically learn what information to keep or forget. e process of LSTM processing information is as follows. First, the LSTM will use the "forget gate" to determine which information should be removed. e input is Scientific Programming 3 mapped between 0 and 1 by the Sigmoid function. e trend to "1" means to retain the information; otherwise, it means to forget information. e "input gate" is used to determine which information needs to be updated. e Sigmoid function is used to determine whether it needs to be retained. en, the tan h function maps the input value to [−1, 1], thereby generating a new memory unit state and adding it to the original memory unit. Next, to update the value of the memory unit, first multiply the memory unit by the forget gate, discard the information that needs to be forgotten, and then add the input information obtained from the input gate to obtain the new memory unit value. Finally, the "output gate" decides which memory unit information to output, that is, the hidden state. e calculation formula of LSTM is as follows: where f t , i t , o t , and C t , respectively represent the forget gate, input gate, output gate, and memory unit. For the t time step, h t−1 represents the hidden state at the previous time step, W represents the weight matrix, σ represents the Sigmoid function, and ⊙ represents the point multiplication operation.

Measure of Research Report.
We measure the research report from the two dimensions of attention and rating sentiment. e attention of the research report can measure the popularity of the stock in the entire market by analysts. e rating sentiment indicates an analyst's judgment on the future trend of the target stock.

Attention of Research
Report. Different stocks in the market receive different degrees of attention. We calculate the ratio of the absolute number of research reports on a stock day to the total number of all stock research reports in the A-share market for the current month as a measure of the stock's attention, RRA k,t and RRRS k,t . e larger the value, the higher the stock analyst's attention. RRA k,t is calculated as shown in the following formula: where N k,t is the total number of research reports of stock k on the t-th day and N A,m is the total number of research reports of all stocks in the A-share market in that month (m).

Rating Sentiment Measures in Research
Reports. e text of the research report released by the analyst contains the optimistic or pessimistic attitude of the individual company's operating status, future prospects, earnings expectations, investment recommendations, and risk warnings. e report contains two important key pieces of information: first, the current rating, and second, the rating change. Among them, the current rating provides investment advice on buying, selling, or holding of individual stocks; the rating change indicates the current rating and the previous rating change are reported. In the previous literature, when discussing the impact of stock investment ratings on their abnormal returns, the basic ratings and rating changes were always analyzed separately, making it difficult for investors to choose when the two ratings were inconsistent. For example, when an analyst gives a "Hold" rating to individual stocks, and the rating changes to "Down," it is more difficult for investors to determine whether to buy or sell, so this article innovatively proposes a "research report rating sentiment" index, RRRS k,t , taking into account the two major factors of basic rating and rating change; the calculation method of RRRS k,t is shown in the following formula: where R k,t is the base rating and C k,t is the rating change.
When there are multiple rating results for a stock within a day, this article will average the ratings of these research reports to find a comprehensive rating sentiment.

Measure of Investor Sentiment.
In the actual trading process, investor sentiment will affect investors subjective judgments on future returns. When investor sentiment rises or becomes more pessimistic, it will trigger its "irrational" behavior of information and cause market anomalies. For the calculation method of investor sentiment quantification, this article draws on the ideas of Hai-Yuan [22], selects Shanghai and Shenzhen cities A transaction volume (VOLUME), the number of new investor accounts (NEWIN), and consumer confidence index (CCI). e five indicators of closed-end fund discount rate (FUND) and broad market turnover rate (HS_TVR) were used for principal component factor analysis, and the initial investor sentiment index for each month was calculated using the respective variance contribution rate as the weight. en, macroeconomic control was introduced, regression analysis was performed on the variables, and the calculated residual value was used as the investor sentiment index. Finally, the simple investor average of the period i lagging behind the study report date (where i � 3) was used to obtain the final investor sentiment index. First of all, the abovementioned five indicators are standardized, and principal component factor analysis (PCA) is performed on these standardized variables. e three principal components with the highest variance explanations are selected, and the respective feature values are used as weights to obtain the factor load after weighted averaging, and as the principal component coefficients of the preliminary sentiment index, the preliminary sentiment index is shown in the following formula: en, control the impact of macroeconomic variables. Take the abovementioned preliminary sentiment indicators as the explanatory variables, and the consumer consumption index (CPI), the amount of new credit (IC), the rate of economic growth (GDP), and the money supply (M 2 ) as the explanatory variables (standardize the data in advance to eliminate dimensional impact), regression analysis of Sentiment m , the residual sequence can be used as an indicator of investor sentiment: CSI (China Sentiment Index).
where Sentiment m is the preliminary investor sentiment index for m months, α 0 is a constant, and α 1−4 is the regression coefficient to be estimated.
Finally, considering the lag of investor sentiment, the sentiment index of the three months before the month on which the research report was published was selected to calculate the average value as the final investor sentiment index.

Data Collection.
is article selects the research report on 2,225 Chinese A-share companies issued by 66 securities institutions between July 1, 2014, and September 30, 2017, as the research object. e data on the number of research reports published, the date of publication, the title, the basic rating, and the rating changes are from the Oriental Fortune website. Economic data such as stock returns, market value of stocks in circulation, and stock turnover rate are taken from the wind database. For the selected time period, the trend of the Chinese A-share market can be roughly divided into two phases: July 2014 to June 2015 is the rising period of the stock market, which belongs to the "bull market," and July 2015 to September 2017 is the decline of the stock market period, belonging to the "bear market." e sample time spans a bull-bear cycle, which can more effectively verify the robustness of the algorithm.

Data Culling.
is article deletes some anomalous data: first, unrated or ambiguous research reports; second, new stock data, because during the continuous daily limit of new stock sales, stock price fluctuations do not truly reflect market fluctuations, and during this period, few investors can successfully buy new stocks, so the new stocks issued from July to September 2017 are uniformly eliminated; third, the individual stock research report at the time of long-term suspension, because the long-term suspension of stocks cannot be traded, and the price of the stock cannot be compared with the average market price without fluctuation (temporarily suspended stocks are not excluded).

Rating Consolidation.
In all the reports collected, a total of 27 different basic ratings were included, and we obtained a total of 14 different ratings after synonym merger. At the same time, we sorted out four different rating changes. Referring to the research by You et al. [23], we use discrete values to assign ratings and rating changes. is article assigns the "Neutral" rating to 1.0 and increases or decreases by 0.1 according to the intensity change to obtain the basic rating G, G ∈ [0.6, 1.9]. e specific values of the rating are shown in Table 1: For the four rating changes of "Up," "First," "Maintenance," and "Down," we assign "maintenance" to 1.0 and increase or decrease by 0.1 according to the rating change to get the rating change C, C ∈ [0.9, 1.2]; the specific value of the rating change assignment is shown in Table 2: 4.3. Detail. All data are divided into training set, validation set, and test set according to the ratio of 80%, 10%, and 10%. We use categorical cross-entropy as a loss function to optimize the target parameters during the model's backpropagation, which is defined as where y t is ground-truth in the form of one-hot, and y t is the model's predicted probability that the excess return is a "positive," "negative," and "par" vector. During the model training process, the Adam [24] function was selected for optimization, where the initial learning rate was set to 1e−4 and the minimum batch size was set to 32.

Results and Discussion
In order to measure the performance of the model from different angles, this paper chooses the classic classification algorithm SVM and the vanilla LSTM model as the benchmark method to compare with our work. Table 3 shows the results of the experiment. e results show that in the excess return forecast on the 5th, 15th, and 30th, our proposed method achieves the best performance regardless of the accuracy rate or F1 measurement. In the comparison of the 5th, 15th and 30th, the accuracy of the excess return prediction on the 15th by all methods is the highest. is is Scientific Programming 5 related to the release cycle of the research report. According to statistics, the average cycle of all stock research reports in the dataset is 18.9 days. When the period exceeds 20 days, there will be multiple reports overlapping, and the latest report will affect stock price fluctuations and thus affect excess returns. We included a "bull market" and a "bear market" in our selected trading cycle. Different market states present different trading sentiments, so we further trained the "bull market" and "bear market" data separately and tested them on the test set, Table 4 shows the comparison results. Compared with the training of "bull market" and "bear market" data aggregation, the accuracy of all methods after training separately according to different market conditions has improved. Similarly, our proposed method achieved the best performance in the 5th, 15th, and 30th excess return forecasts, with the highest accuracy rate as the "bull" 15th excess return forecast. e table shows that the overall accuracy of the "bull market" is higher than that of the "bear market," and this conclusion is consistent with the research results of Hai-Yuan [22].
In order to verify the effectiveness of the increased research report metrics and investment sentiment on the model, we delete the corresponding inputs and compare them. Table 5 shows the results. Among them, LSTM + Sentiment m indicates that the original hidden

Conclusion
Regarding the prediction of excess returns in the mainland Chinese stock market, first of all, this article measures the research report released by the analyst and splits the research report into two indicators: the attention degree of the research report and the rating sentiment; secondly, we quantify the external environment that may affect stock price changes as investor sentiment; then, we split the research report indicators and investor sentiment as the input and initial hidden state of LSTM; finally, in the comparison of experiments, our proposed method achieved the best performance.

Data Availability
All data in this article are from public websites (https://www. eastmoney.com/).

Conflicts of Interest
e authors declare no conflicts of interest.