A PLS Approach to Measuring Investor Sentiment in Chinese Stock Market

We select five objective sentiment indicators and one subjective sentiment indicator to build investor sentiment composite index in Chinese stock market by using the partial least squares. The reason why we do that is to improve the shortcomings of the principal component analysis, which was adopted to build investor sentiment composite index in the pioneering research. Moreover, due to the large proportion of individual investors in Chinese stock market and the rapid change of investor sentiment, we innovatively use the weekly data with smaller information granularity and higher frequency. Through empirical tests for its reasonability and market’s predictive capability, we find that this index appears to fit the data better and improves prediction.


Introduction
Recently, investor sentiment measurement has become one of the more widely examined areas in behavioral finance.
The key to measuring investor sentiment is to find the proxy indicators which can express sentiment accurately. It is better that these proxies are observable and quantifiable and can objectively and comprehensively reflect the views of investors on the market. Investor sentiment proxy indicators are usually divided into three types: single objective sentiment indicator, single subjective sentiment indicator, and comprehensive sentiment index. Single indicator is the basic component of composite index construction, which is used flexibly in different studies. While the composite index has theoretical advantages, if the method is properly constructed, we will obtain a more accurate measure of sentiment. According to the pioneering literature, the construction of comprehensive sentiment indexes has become the mainstream of the construction of sentiment indexes. Baker and Wurgler [1] used the first principal component of the proxies as their measure of investor sentiment, and it had been extensively adopted in the following research. For example, Stambaugh et al. [2], Ben-Rephael et al. [3], Chen et al. (2014), Chong et al. [4], Zhigao and Ning [5], Ma and Zhang [6], and so on are basically adopted this method.
However, the first principal component appears to be a combination of six proxies that maximally represents the total variations of the six proxies. Since all the proxies may have approximation errors for the actual condition but unobservable investor sentiment and these errors are parts of their variations, the first principal component can potentially contain a substantial amount of common approximation errors that are not relevant for forecasting returns. The partial least squares (PLS) will address the problem effectively. The principal advantage of PLS is that it can extract as much as possible part of investor sentiment from the proxy variable of sentiment. This will ensure that the extracted part is close to the real investor sentiment. For example, Huang et al. [7] use the same six American individual investor sentiment proxies of Baker and Wurgler [1], which include close-end fund discount rate, share turnover, number of IPOs, firstday returns of IPOs, dividend premium, and equity share in new issue to propose a new sentiment index by adopting PLS method. They call the new index extracted by this way the aligned investor sentiment index. They find that their index has greater power in predicting the aggregate stock market than the Baker and Wurgler [1] index.
The PLS method has proved suitable for constructing investor sentiment index in the American stock market by Huang et al. [7]. In this paper, for the purpose of predicting 2 Discrete Dynamics in Nature and Society the Chinese aggregate stock market better, we develop Chinese market sentiment index by using the PLS method. The rest of the paper is organized as follows. Section 2 introduces principle of partial least squares (PLS) method to construct indexes. Section 3 constructs the comprehensive index of investor sentiment and then tests its robustness and the power of predicting the stock market.
Finally, Section 4 concludes the paper.

Principle Introduction of Partial Least Squares (PLS)
Partial least squares (PLS) was first proposed by Wold and Albano in 1983. It can realize multiple variables regression modeling in small samples. After the improvement of Kelly and Pruitt [8], it can be used to solve the problem of variable information extraction. Different from the principal component analysis, partial least squares use the method of decomposing predictive variable and response variable , extract component (usually called factors) from them at the same time, and then arrange the factors from large to small arrangement according to the correlation between them. In other words, the partial least squares method can not only well explain the information in the prediction variables, but also well summarize the response variables and eliminate the noise interference in the system. Therefore, it can effectively improve the problems where the PCA method just extracts the information hidden in the predictive variable , resulting in regression model accuracy decrease. We assume that the one-period ahead expected log excess stock return explained by investor sentiment follows the standard linear relation: where SENT represents the comprehensive investor sentiment index of the period. represents the closing price of China Securities Free Float Index (CSI Free Float) (the CSI circulation index is composed of full circulation shares of Shanghai and Shenzhen stock markets; it is based on December 30, 2005, and it adjusts the market capitalization of the stock based on all samples; the base point is 1000) during time period t. The formula shows that the excepted closing price of CSI Free Float in the period is related to the investor sentiment in the period. So the real closing price of CSI Free Float in the period is where is a residual term. It is unpredictable and has nothing to do with investor sentiment SENT , ordering = ( 1 , 2 , . . . , ) to represent a single investor sentiment proxy variable vector of × 1-order in the period and assuming that each original proxy index has the following structure: We assume that SENT should be a linear combination of SENT , which means the relationship between SENT and SENT is where SENT represents the investor sentiment information contained in the original proxy variable . represents the deviation information, which is unrelated to the investor sentiment but is related to the closing price of CSI Free Float.
is a unique noise contained in the proxy variable . 1 , 2 represent the sensitivity of SENT and to proxy variables , respectively.
represents the weight of the integrated measure index in the investor sentiment information which is contained in the proxy variable . Therefore, we think that the core of the problem lies in how to decompose investor sentiment information SENT of a structure for each original proxy variable . The partial least squares method is better than the principal component analysis method, which can effectively eliminate the interference of information deviation and specific noise and can construct the comprehensive sentiment index which can reflect the real investor sentiment.
Integrating (2), (3), and (4), we can sort out that there is such a relationship between the individual investor sentiment proxy index = ( 1 , 2 , . . . , ) and the closing price of CSI Free Float : From it, represents the explanatory power of the original proxy variable to the closing price of CSI Free Float combining with (2), (3), and (4); we can see that each investor sentiment proxy variable can be expressed as a linear function of the closing price of CSI Free Float, and it has nothing to do with the unpredictable deviation . Therefore, we think that in (5) can be used to reflect the contribution degree of investor sentiment proxy variable to the comprehensive investor sentiment index SENT . As far as the contribution of each proxy variable to investor sentiment is concerned, it can be determined by the covariance between the investor sentiment proxy variable and the closing price of CSI Free Float . Then, based on the PLS method, the comprehensive investor sentiment index can be expressed as From it, = ( 1 , 2 , . . . , ) represents a single investor sentiment original proxy variable sequence; = ( 1 , 2 , . . . , ) represents the weight of each proxy indicator in the comprehensive investor sentiment index.

Data.
In the process of collecting indicator data, considering the larger proportion of individual investors in Chines stock market, it is extremely easy to be influenced by shortterm market volatility and then lead to irrational speculation.
Discrete Dynamics in Nature and Society 3 In order to more accurately track changes in investor sentiment on the market, in this paper, we innovatively adopt weekly data which have smaller information granularity and higher frequency, to capture the immediate investor sentiment, rather than the annual or monthly data used in most of the previous literature. In this paper, the weekly data set from January 4, 2008, to May 30, 2014, is used as the training set of sentiment index construction. At the same time, in order to test the validity and robustness of the index construction method, we will intercept the weekly data from June 6, 2014, to May 29, 2015, as the test set of the index construction and use the corresponding cycle of CSI Free Float to represent the overall performance of Chinese A shares. In this paper, we select five objective indicators through the optimization in the specific selection of proxy indicators, which are SWS Low Profit Margin Stock Index (LPM(0)), SWS High-P/E-Ratio Index (HPEI (0)), SWS High-P/B-Ratio Index (HPBI(0)), one-period lag Newly Additional Fund Accounts (NAFA (+1)), six-period lag new number of IPO (NIPO (+6)), and a subjective indicator: New Fortune Analyst Index (CAI (0)) over the same period. Based on conclusions of Baker and Wurgler [1], we believe that investor sentiment leads investors to make decisions; at the same time, investor sentiment itself will also be affected by changes of macroeconomic factors; for example, the number of IPOs will change with the macroeconomic cycle fluctuations. But this is based on the objective analysis of the reality of the macroeconomic operation situation. It is a rational sentiment based on the investor's psychological factors and not included in the scope of the study. Therefore, we will separate the rational components of investor sentiment through the multivariate regression model, eliminate it, and only retain the irrational elements of investor sentiment: From it, is the original proxy variable value of the period. That means, LPM(0), CAI(0), NAFA(+1), HPBI(0), HPEI(0), NIPO(+6), and Macro are a series of indicators reflecting macroeconomic fundamentals, is the parameter to be estimated, and 0 is a constant. is the residual of a regression equation, which represents irrational sentiment excluding macroeconomic fundamentals. Here, taking into account the representativeness of the macroeconomic cycle variables and the availability of weekly data, we use China's commodity price index (CCPI) and the Central Bank weekly monetary net supply (MNS) as proxy variables to reflect the macroeconomic fundamentals.
Residual sequence obtained by regression is as follows: 1 , 2 , 3 , 4 , and 5( +6) , respectively, expressed by ELPM(0), ECAI(0), EHPBI(0), EHPEI(0), and ENIPO(+6). They represent the proxy variables of irrational investor sentiment after the elimination of macroeconomic fundamentals. Because the selected original proxy variables of the investor sentiment are not subject to normal distribution, in this paper, we choose the standardization of 0-1 method to standardize the index. The method uses observed value of a variable to subtract the minimum value of the variable. The specific formula is After the sequence of the standard deviation, ELPM(0), ECAI(0), EHPBI(0), EHPEI(0), and ENIPO(+6) are expressed as sLPM(0), sCAI(0), sHPBI(0), sHPEI(0), and sNIPO(+6). After standardization, the observed values of each variable will fall between (0, 1); the standardized data are pure numbers without units and can be directly used for the following index structure. After the above pretreatment, the results of the descriptive statistics of the selected investor sentiment proxy indictors are shown in Table 1.

Investor Sentiment Composite Index Construction.
We choose the investor sentiment proxy indictor sequence after pretreatment: sLPM(0), sCAI(0), sHPBI(0), sHPEI(0), and sNIPO(+6), 5 indictors in all. Firstly, before the number of principal components in the model is determined, we should determine the number of principal components by a certain method. For the selection of principal components, normally, if the number of selected components is too much, it is likely to lead to the problem of overfitting. Conversely, if the number of selected principal components is too small, it is likely to lose some important information. In order to find out the optimal number of principal components, it is necessary to follow the conclusion of "Leave-One-Out Cross Validation" when choosing the number of components as the final model's one. Moreover, we collect data when the sum of squares of errors is the minimum value, or it almost remains with no change. The results are shown in Table 2. Table 2 shows the model fitting results of the number of different principal components. Based on the results in Table 2, the error square sum of the number of different components obtained by the "Leave-One-Out Cross Validation" and combined with Figure 1, we can see that, when the number of principal components is two, the square error is almost with no change. And the cumulative contribution The correlation coefficient between the comprehensive measure index of investor sentiment SENT PLS and each sentiment proxy variable can be seen in Table 3.
It can be seen from the statistical results of correlation coefficient that the correlation between sLPM(0), sCAI(0), sHPBI(0), sHPEI(0), and the investor sentiment indictor index SENT PLS is the highest. The correlation coefficients were 0.9350, 0.9439, 0.9626, and 09704, respectively. The correlation coefficient between sNIPO(+6) and sHPEI(0) is 0.4913. From the symbol of factor composition, we can find that, in addition to sNIPO(+6), the factor composition coefficients of all the other variables are positive. It means that sLPM(0), sCAI(0), sHPBI(0), and sHPEI(0) are positive indicators of a composite index built on the basis of the PLS method and are basically consistent with theoretical expectations. On the contrary, sNIPO(+6) is a negative indictor.

Robustness Test.
In order to guarantee the stability of every proxy indictor in the investor sentiment composite index, we divided the whole study period into two "bull market" periods (the time span is, resp., 2008.11.7-2010.11.5 and 2012.12.7-2014.5.30) and two "bear market" periods (the time span is, resp., 2008.1.4-2008.11.7 and 2010.11.5-2012.12.7). Then we construct the investor sentiment index in two market states, respectively, and observe whether there has been a significant change between the coefficients and the symbol of each proxy indictor and the upper section. It should be particularly noted that although the sample period is divided into "bull market" period and "bear market" period, in the span of the sample period, from January 1, 2008, to May 30, 2014, the overall market has never exceeded the previous highs. So the entire sample period is still regarded as bear market. Therefore, conditions for robustness testing will be relaxed, as long as, in the "bear market" period, there is no significant difference between the factor structure of the sentiment composite index and the full sample index. It can be assumed that the investor sentiment composite index constructed by this method is robust. Otherwise, we can assume that it is not robust. It will change with the change of market conditions, affecting the validity and accuracy of the empirical results.
In the bull market and bear market period, the partial least squares method is, respectively, used to extract the investor sentiment information from the original proxy indicator; then the information will be synthesized to form the investor sentiment composite index. Here, we still use sLPM(0), sCAI(0), sHPBI(0), sHPEI(0), and sNIPO(+6) five indictors and the results of Cross Validation to determine the number of principal components in the model. Among them, we select the first two principal components (the cumulative contribution rate of investor sentiment proxy variables is 93.90%; the cumulative contribution rate of the closing price of CSI Free Float is 97.34%) in the "bull market" period and the first two principal components (the cumulative 1.0000 sCAI(0) 0.8320 * * * 1.0000 sHPBI(0) 0.8927 * * * 0.8694 * * * 1.0000 sHPEI(0) 0.8786 * * * 0.8922 * * 0.9144 * * * 1.0000 sNIPO(+6) 0.4894 * * * 0.4351 * * * 0.4133 * * * 0.5790 * * * 1.0000 Notes. The first rows of data in the table are the factors of the composition of the index of the 5 proxy indictors in the sentiment composite index. The correlation coefficient between second lines of data is the composite sentiment index and the proxy indictor. 3-7 lines are the correlation coefficient among behavioral surrogate indictors. * * * , * * , and * , respectively, represent significant levels at 1%, 5%, and 10%. Note. * * * , * * , and * , respectively, represent significant levels at 1%, 5%, and 10%.
contribution rate of investor sentiment proxy variables is 96.73%; the cumulative contribution rate of the closing price of CSI Free Float is 98.86%) in the "bear market" period and then construct the investor sentiment composite index as follows: SENT PLS bull = 0.3320 × sLPM + 0.2319 × sCAI Combining the statistical results of Table 4, we compare comprehensive measure indexes of investor sentiment during 3 periods: the "bull market" period (9), the "bear market" period (11), and the whole sample period of (9). It finds that there is little difference between (10) and (11) in the size and the symbol of factor composition of the comprehensive measure index of sentiment and (9). It can be explained that the change of market condition does not influence the original proxy variables of each sentiment when constructing the investor sentiment index. That means that the comprehensive measure index of investor sentiment constructed in the "bull market" and "bear market" period is more robust and has little difference with the full sample index factor composition.

Interpretive Power to the Closing Price of CSI Free Float.
In general, the more optimistic the investor sentiment tends to be, the higher the closing price of CSI Free Float will be. On the contrary, it will be lower. In other words, the level of investor sentiment is consistent with the changing track of market fluctuation in theory. We select the sample data of test set (June 6, 2014-May 29, 2015) and examine the interpretive power of the comprehensive investor sentiment index based on the PLS method to the closing price of CSI Free Float after the same pretreatment with the training set data.
First of all, we draw the time series comparison chart of the investor sentiment composite index and the closing price of CSI Free Float, which is shown in Figure 2. Judging from the trend comparison chart, the interpretive power of the investor sentiment index constructed by PLS method to the closing price of CSI Free Float is relatively good. In order to make the conclusion more convincing, we treat the investor sentiment composite index as the predictor variable and treat the closing price of CSI Free Float as the response variable. Then, we carry out linear regression on them and use 2 value of the linear regression model to represent the explanatory power of the sentiment composite index to the closing price of CSI Free Float; at the same time, we combine the AIC information criterion to select the optimal sentiment index. The final result is as follows: 2 of the regression equation of SENT PLS and the closing price of CSI Free Float is 0.9964; the value of AIC is −274.74. This shows that the fitting effect is very good; the investor sentiment index based on the PLS method has strong ability to interpret the stock market index.

Conclusion
Investor sentiment measurement has long been one of the challenging problems in behavioral finance. Although principal component analysis (PCA) is able to furthest extract nonrepetitive information about variables, there are also drawbacks. Due to the proxy indicator of the synthetic principal component factor, there may still be a large amount of bias information unrelated to the real sentiment of investor, resulting in reduced accuracy of the model. In order to address the defects of principal component analysis, this paper uses the partial least squares (PLS) to rebuild the investor sentiment composite index in Chinese stock market and analyze the robustness and the explanatory power to the closing price of CSI Free Float. It turns out that the investor sentiment composite index based on PLS is in better agreement with actual condition. What is more, it has strong predictive power in the stock market.