Multifactor Stock Selection Strategy Based on Machine Learning: Evidence from China

. Machine learning methods have been used in multifactor stock strategy for years. This paper uses three machine learning methods and linear regression method to ﬁnd the most appropriate approach. First, a framework is established and 10 style factors and 30 industry factors are chosen. Second, four methods are used to forecast portfolio returns and compared by predicting returns, successful rate, and Sharpe ratio. Finally, this paper draws conclusion. The main ﬁndings are as follows: the support vector regression has the most stable successful rate for predicting, while ridge regression and linear regression have the most unstable successful rate with more extreme cases; algorithm of support vector regression ﬁtting higher-degree polynomials in Chinese A-share market is optimized, compared with the traditional linear regression both in terms of stock return and retracement control; the results of support vector regression signiﬁcantly outperforming the CSI 500 index prove further.


Introduction
Quantitative trading in securities market usually adopts CTA (commodities trading adviser) strategy, intraday highfrequency strategy, and multifactor quantitative strategy. e multifactor models are widely used in the stock market, including Fama-French three-factor asset pricing model [1], Carhart four-factor model [2], and the further improved five-factor and six-factor models. Scholars have found hundreds of market anomalies which might provide excess returns and created a "factor zoo". Bridgewater Associates, Renaissance Technologies, and AQR Capital Management, the top hedge funds by assets in the world, trade in global financial markets achieving exceptional returns for their investors by strictly adhering to quantitative strategies. e vast majority of nonquantitative stock funds also introduce the multifactor model to analyze and allocate their securities positions to a certain extent. e traditional factor strategies are usually used to forecast stock returns by scoring factor exposures, and linear regression methods commonly used are time series regression, crosssectional regression, Fama and MacBeth [3] regression, and Hansen GMM regression [4]. However, the relationship between the factor value and the return of individual stocks in the actual stock market is often nonlinear which leads to linear regression that cannot well fit in many cases. In addition, Green et al [5]and Hou et al [6] studies have shown that out-ofsample testing finds that most factors cannot consistently provide excess returns. One of the reasons for the disappearance of excess returns is the increasing convergence of prediction and trading models using traditional methods in the security market, which leads to failure. With the development of artificial intelligence technology, Mullainathan and Spiess [7], Kleinberg et al [8] show data mining, machine learning, and other technical methods are applied to the field of economics and management research. Major financial institutions have also adopted new technologies and methods to improve quantitative trading strategies in security market transactions. e finance analytical method is improved by introducing machine learning methods, which make the empirical research paradigm expand from linear to nonlinear, from focusing on parameter significance to the model structure and dynamic feature. Appropriate and robust models are built to capture the effective characteristics of financial data and to interpret economic meaning, making great efforts to improve prediction accuracy.
Liu et al. [9] use a support vector machine (SVM) to classify forecasting stock price index and find that support vector function can accurately reflect the variation trend and improve the prediction accuracy. On the basis of the multifactor stock selection model, Wang et al. [10] verify the predictive performance of the random forest algorithm in China's stock market by using it to predict the rise and fall of stocks and analyze the returns of selected stocks. Xie et al. [11] use LASSO regression and elastic net in the process of factor screening to select factors and determine the weight and find that the factors screened by this method could obtain excess returns. Gu et al. [12] test the performance of machine learning algorithms in the US market and find that machine learning models can effectively outperform traditional linear regression models. Wang and Li [13] use the gcForest algorithm to classify individual stocks and predict the probability of rise and fall of stocks. ey build an investment portfolio and a back test shows that the portfolio could achieve significant excess returns. ey compare the back-test results of SVM and random forest algorithm and find that the gcForest algorithm has obvious advantages over other algorithms in both stable and rising period in the stock market from a comprehensive analysis of various technical indicators.
Although machine learning methods have been used in return forecasts in the security market in recent years, there are still questions about which method is the best or most appropriate for the emerging stock market? Security markets are more volatile in developing countries and have their own features. Based on the Chinese stock market, this study aims to establish a forecasting framework to predict the relationship between abnormal factors and excess returns with different methods, conducting a systematic test and evaluating which method is best. erefore, this study puts forward three research questions: (1) Is the machine learning model superior to the traditional predictive model? To verify the first observation, a traditional linear regression model and three machine learning algorithm models are selected in this study. Rapach et al. [14] show traditional linear regression has been used in financial forecasting and achieved good results. (2) If the prediction model f(·) adopts linear function form, whether the performance of the nonlinear model is better than that of the linear regression model. To verify the second observation, traditional regression and linear ridge machine learning models are used to compare with random forest and support vector machine models. Ridge regression is chosen because it can solve the problem of the sparse model as Hastie et al. [15] research, and random forest and support vector machine algorithms are chosen because both of them are the core algorithms according to machine learning theory and have achieved good results in many tasks as Fernández-Delgado et al. research [16]. (3) If the predictive model f(·) adopts the machine learning methods, which performance is best among the three machine learning models and why?

Factor Variable Selection and Model Selection
e task of multifactor model forecasting is a standard supervised learning and regression task, that is, to explore the following functional form: where X tj is the factor, the explanatory variable selected by the researcher in advance that has an influence on the return rate of the stock j at time t. e function f(·) can take any form and represents all the possible ways in which x can act on y that the researcher can imagine. e residual term ε represents other possible influence factors beyond control. Compared with the traditional factor regression model, formula (1) does not require that the number of variables in X is smaller than the number of samples and allows x to take effect on y of almost any form. Under specific sample conditions with the traditional econometric analysis framework, researchers can estimate f(·) by the reduced phenomenon regression model and nonparametric methods. But in the reduced linear regression model, the explanatory variable x j is easily correlated with the residual term ε j . In addition, when the sample size is limited and x contains many variables, traditional nonparametric estimation is difficult to overcome technical obstacles, and how to select variables has not been solved as showed by Henderson et al. [17].

Factors Selection.
e essence of the multifactor model is to build an optimal asset portfolio through factor selection. erefore, factors should be selected as many as possible to explain the return of stocks, so as to minimize the residual of the regression model which represents the return of stocks that cannot be explained by factors. e selection of individual stocks is based on the results of portfolio earnings of the forecasting model. e characteristics of samples, that is, the independent variables in the model, are determined by researchers on account of their market experience. At present, the common practice in the investment industry is to divide factors into industry factors and style factors. e industry factor is a dummy variable, if the individual stock belongs to a certain industry, the corresponding factor value is 1, and the factor value of other industries is 0. Style factors are selected by investors' study and comprehension of the market. e number of style factors excavated by quantitative institutions and the ability to interpret alpha of the stock manifest academic competence of financial institutions, and different institutions may choose different style factors.
Based on the situation of China's A-share stock market, twelve primary factors from four categories are selected: the valuation factor which includes price-to-earnings ratio, price-to-book, total market capitalization, the financial factor which includes price-to-cash-flow ratio and price-tosales ratio, the momentum factor which includes turnover rate, turnover, yield, the length of the cylinder, and the closing price, and technical factor which includes the length of upper wick and length of lower wick. Technical factors are improved because few researches have paid attention to the length of the wick, but it shows the trading mood which has a great effect on stock price, especially in emerging security markets.
e primary factors are back-tested with the stratified method, and the results are sorted in descending order listed in Table 1. Group a buys the top 20% stocks with the largest factor value ranking in each cross-sectional period each week, while group e buys the stocks with the last 20% factor value ranking in each cross-sectional period each week. e frequency of position adjustment is weekly. At the same time, stocks with an absolute weekly return of more than 15% were removed, which account for less than 2% of the number of stocks, in order to eliminate the impact of stocks with a consecutive daily limit up and daily limit down.
e results of long-short portfolios of the stratified back test show that the net value of four factors, namely, length of wick, length of the lower shadow, price-cash flow ratio, and price-sales ratio, is low, indicating these factors are not correlated with the stock return rate strongly. Considering the net value curve of the long-short portfolio and the stratified back test, two financial factors, price-cash-flow ratio and price-sales ratio, are filtered out, and ten style factors are selected. Combining with 30 industry factors, now 40 factors are selected for the return forecasting model as shown in Table 2.

Model Selection.
Machine learning is a collection of many forms of predictive functions f(·) and all kinds of algorithms. As stock return prediction is a supervised learning regression task, theoretically, all machine learning algorithms adapted to regression task can be used to build stock return prediction models. In this study, three machine learning regression algorithms (ridge regression, random forest regression, and support vector regression) are used to predict the returns of individual stocks. Based on the predicted returns of individual stocks, the investment portfolio is constructed for back test, and the efficiency of the machine learning algorithm is analyzed.

Ridge Regression Model.
In the traditional linear regression model, the parameter estimation is generally obtained by minimizing the loss function. e formula of the loss function is as follows: where LOSS is the loss function, X and Y are data matrix and outcome variable, respectively, and β is regression coefficient vector. In contrast to OLS estimates, Hoerl and Kennard [18] propose to add a constant λ to the principal diagonal of the X ′ X matrix to ensure the matrix (X ′ X + λI) is invertible and alleviate the multicollinearity problem. In order to obtain the unique solution of the parameter vector β, the paper regularizes it to limit its data range. e penalty term is introduced into the loss function for penalized regression.
where the first term of (3) is the sum of squares of residuals. e second term is the penalty term, and λ is the adjustment parameter to control the penalty intensity. e optimal solution of parameter β in the ridge regression model β ridge (λ) � (X ′ X + λI) − 1 X ′ y can be obtained when the loss function is minimum. e choice of parameter λ determines the degree to which the regression coefficient is compressed. Different values of λ will generate different results. A common method in machine learning proposed by McNeish and Daniel [19] is K-fold cross-validation. e cross-test error is as follows: e cross-validation error CV(λ) is a function of λ, and λ is optimal when CV(λ) is the smallest.

Random Forest.
Random forest is also a combined prediction model, belonging to a Bagging algorithm variation in the family of integrated algorithms. It is a tree-based integrated learning model proposed by Breiman [20] and widely used to solve classification and regression. e paper uses random forest algorithm of Bagging, using bootstrapping to generate random training samples n from the initial dataset. e probability of each sample being selected is 1/n, and the probability of each sample not being collected k times is lim k⟶∞ (1 − (1/n)) k ⟶ (1/e) ≈ 0.368 . e 36.8% dataset that did not participate in the training model composes the out-of-bag sample, which can be used to evaluate the out-of-bag error. D t is used to represent the training sample set actually used by h t , and H oob (x) represents the out-of-bag sample prediction of sample x, whose formula is and the out-of-bag estimation of the generalization error of the Bagging algorithm is Cawley et al. [21] use the above results as the criteria for model pruning and overfitting to reduce the risk of overfitting. e random forest method is similar to the bagging method both of which rely on initial data and use the bootstrap method to build the training set. Random forest also introduces a random attribute selection in the training process of the decision tree. In other words, in random forest generation, a subset containing j, attributes are randomly selected from the attribute set of each node of each decision Complexity 3 tree, and then, an optimal attribute is selected from this j subset for partitioning after random selection. e more machine learners there are, the better the random forest learns. In the study, the method of the weighted mean for regression is adopted in the integrated strategy of random forest, and its formula can be expressed as

Support Vector
Machines. e support vector machine algorithm, first proposed by Vapnik and Vladimir [22], is to maximize the interval among training samples of different categories in the sample space so as to achieve optimal classification. For the nonlinear samples applied in the study, the feature space can be mapped into a higher-dimensional space, and all samples can be correctly classified by a mapping function. e sample space partition in the hyperplane can be expressed by the following linear equation.
where ω is the normal vector that determines the direction of the hyperplane, and b is the displacement term that determines the distance between the hyperplane and the origin point. us, the distance from any point x in the sample space to the hyperplane (ω, b) can be obtained as follows: Assuming that any point (x i , y i ) ∈ D, the sample space points are classified as It can be seen from the above formula that in order to find the partition hyperplane with the maximum interval, it is necessary to find the parameters ω and b which satisfy the constraints in (10) so that the sum of the distances from the two heterogeneous support vectors to the hyperplane can be maximized. e constraint conditions can be obtained as For nonlinear classifiers, the support vector machine has several kernel functions to realize hyperplane partition, including polynomial kernel, Gaussian radial basis kernel, Laplacian kernel, and Sigmoid kernel. ese nonlinear kernel functions mainly transform the original feature space into a higher-dimensional feature space and are separated by a hyperplane. In this paper, the Gaussian radial basis is

Data Preprocessing and Training Model
Predicting the returns of individual stocks is the most important part of the multifactor stock selection strategy, and the alpha of the strategy usually comes from stocks selected. Value of factor of individual stocks is taken as the characteristics of the data (independent variable) and the return rate of individual stocks in the next period as the label of the data (dependent variable). After using the data from t-24 to t-1 period as the model training set, the factor data of individual stocks in the t period are used to predict the return rate of t+1 period. e period of stock portfolio transfer selected in the paper is weekly, so the corresponding training set is the data of 24 weeks from the forecast day to the week 24 weeks before.

Data Preprocessing.
Because the dimensions of each factor are not consistent, it is necessary to standardize the factors so as to compare and regress. Before data standardization, in order to avoid interference caused by the estimation of the correlation between a few extreme value data factors and the rate of return, the extreme data are excluded first. Figure 1 shows the probability density comparison of factor data of stock market value before and after the deextreme operation. It can be seen that the de-extreme method effectively reduces the impact of extreme values on the prediction results.
After the market value factor data of stock is de-extreme, the distribution before and after standardization is compared in Figure 2. Figure 2(a) shows the data distribution before standardization, and Figure 2(b) shows the data distribution after standardization. It can be seen that the dimensions of the normalized data are adjusted.

Model Training.
After deleting extremes and standardization of all factor loading data, four algorithms including linear regression, ridge regression, random forest regression, and support vector regression are used, respectively, to predict the returns of individual stocks. In the ridge regression algorithm, the penalty parameter alpha is set to 90. In the random forest regression, 500 trees are selected to test with regression tree as the base learner. In the support vector machine algorithm, radial basis function is used, the radial kernel gamma parameter is set to 0.5, and penalty parameter is set to 100.
ere are 40 sample features including the 10 style factors and 30 industry factors in the model. e label of the sample is the return rate of individual stocks in the next cross-sectional period. e sample characteristics and labels of 24 weeks before the prediction are selected as the training set, and rolling prediction is carried out. Finally, the forecasting value of the weekly return of all stocks in the security market from July 9, 2010, to November   (12) Figure 3 shows the MSE statistical results of the four algorithms in each back-test section period.

Prediction Results and Analysis
It can be seen that the MSE indexes of the four algorithms are very close, indicating that for all stocks in the security market, the generalization ability of the four algorithms is close to each other. Compared with the trend of turnover of Shanghai and Shenzhen stock exchanges in Figure 4, the deviation of the model forecast result is greater as the turnover of the market magnifies. e characteristic is in line with the actual situation of the Chinese A-share market, for market sentiment often being hot and retail investors entering the market in a concentrated way when the transaction volume is enlarged, which corresponds to the two highest transaction volumes in Figure 4 of 2015 and early 2019. At these times, irrational investors increase in the market, and market efficiency decreases, which are reflected in the price deviation of individual stocks. In this case, historical data usually cannot accurately predict the future, so the prediction deviation of the corresponding model increases.

Success Rate of Forecast.
In the multifactor stock selection model, the deviation between the forecast return and the actual return often cannot completely determine the merits of the strategy. e deviation can be divided into two kinds: one is the actual return of the selected stock being higher than the forecast return, and the other is the actual return of the selected stock being lower than the forecast return. Obviously, the first bias is favorable for investors, 6 Complexity while the second bias is an adverse result that should be avoided as far as possible. erefore, some other indicators are used to help evaluate the model, such as predicting success rate. e success rate of forecast refers to the probability that the actual return of the stock which the model predicts is positive, that is, the accuracy of the model to predict the rise of the stock. In many cases, the absolute value of the prediction results of the model is not high. For example, although the model predicts a 3% return on individual stocks, the actual stock rise of 2% or 4% is acceptable. Because if the actual return falls in the end, the forecast will cause a loss on the investment. erefore, the success rate is also an important index of model evaluation. Figure 5 shows the probability density distributions of the four algorithms for predicting the success rate during the back test.
Except for SVR, the distributions of the other three algorithms all have two peaks, among which linear regression and ridge have the most unstable success rate, with the two peaks close to 0 and 1, respectively, and the peak near 0 is higher. Random forest is slightly better than the two algorithms, but many extreme values of the prediction of the success rate still exist. erefore, from the perspective of prediction success rate, SVR is the most stable, followed by random forest regression, while ridge regression and traditional linear regression are very unstable and have many extreme values as shown in Table 3. Figure 6 shows the net value curves of the investment portfolio constructed using the corresponding earnings forecast results of the four algorithms. Liner regression corresponds to linear regression, random forest corresponds to random forest regression, ridge corresponds to ridge regression, SVR corresponds to support vector regression, and benchmark corresponds to CSI 500 index trend. e back-test results show the following.

SVR Is Superior to the Traditional Linear Regression
Algorithm. Compared with the traditional linear regression, the return of the portfolio constructed by the SVR is significantly improved from the perspective of return rate and retracement control. e traditional linear regression is not suitable for high-dimensional data, the number of independent variables of high-dimensional data is greater than the sample size, and the rank of matrix X is less than the number of rows, which will lead to the matrix X is not full rank, and the unique solution cannot be obtained. In addition, even if there is no problem of high-dimensional data, approximate (incomplete) multicollinearity which means the high correlation between characteristic variables often appears in the traditional linear regression model. e matrix becomes almost irreversible under multicollinearity, magnifying the variance and underestimating the significance of OLS estimation. e back-test results show that SVR has a better prediction effect on the return of stocks than linear regression, and the algorithm fits the characteristics of higher-degree polynomial and is more suitable for the stock market. For example, it can be seen from the results of the stratified back test in the primary factor chosen part that the style factors with better effects in the model have fluctuated to a certain degree since 2018, such as factors of market value and turnover rate. In 2017, China's stock market saw a record number of IPOs, and the regulator cracked down on high increasing the number of common shares and other subject speculation, and there was a big shift in market style such as the market's small-cap effect changed significantly. For this kind of nonlinear behavior, the machine learning algorithm is relatively well adapted. Compared with the CSI small-cap 500 index of the Chinese stock market, Sharpe [23] ratio calculated is 0.27, which means at the same risk, the portfolio gains more than CSI 500.
However, the result of SVR has got a large retracement since 2018, which may be caused by the increasing use of machine learning algorithms by quantitative institutions in China's A-share market. at is, the increase of funds in the market for return prediction using the SVR algorithm reduces the alpha of the algorithm itself.

e Results of Linear Regression Are Similar.
e backtest results show that the trend of ridge regression is very close to that of linear regression, which is determined by the two algorithms themselves. Ridge regression only adds a penalty term to the linear regression; in this study, the penalty term is rather big which leads to similar results. Although ridge regression and linear regression algorithms had relatively high returns before 2018, they began to plunge after 2018, which reflecting the distribution of the success rate of linear regression prediction was unstable. Analysis of the forecast results shows the successful rate distributions of ridge regression and linear regression are extreme, and Xu et al. [24] find that stocks with high return asymmetry exhibit low expected returns. In the market, if the forecast successful rate is unstable, it will have a negative impact on the net value. e advantage of ridge regression over classical linear regression models lies in its tradeoff between prediction error and variance. With the increase of λ, the smoothness of the fitting of ridge regression decreases, although the variance decreases, but the deviation increases. In general, when the relationship between the response variable and the prediction variable is approximately linear, the least squares estimate will have a low bias but a large variance, which means that small changes in the training data may lead to large changes in the least squares regression coefficient. When the number of variables and the number of observations are close, the variance of the least squares estimation will be larger, and when the number of variables is greater than the number of observations, the least squares have no unique solution. Ridge regression method can still get a large decrease of variance by a small increase of deviation, and a better fitting effect can be obtained by using this tradeoff. e back-test results show that the ridge regression fitting trend effect is very close to that of the ordinary linear regression, Complexity 7 which also reflects from another side that the variance of the least square estimation is not large.

Stochastic Forest Regression Is Insignificant in the
Sample. Random forest only uses some node variables in the decision tree. Because different nodes are forced to split with different variables, the correlation between different decision trees can be reduced, thus reducing the variance. erefore, in the tradeoff between variance and bias, random forest sacrifices a small amount of bias for a smaller variance, so as to reduce the mean square error. Since all characteristic variables are used for splitting in this study, even though the deviation is small, the correlation between different decision trees is strong, resulting in a large variance. So, the portfolio constructed by the stochastic forest regression algorithm does not generate significant excess returns, and this algorithm has no obvious advantage over the traditional linear regression algorithm in the construction of the multifactor stock selection model.

Conclusion.
is study aims to study the application of machine learning algorithm in multifactor selection strategy.
First, according to results of the stratified back-test and long-short portfolio, the price-to-sales ratio and price-tocash-flow ratio factors which are not strongly correlated with the return rate of stocks are removed. e remaining 10 style factors and 30 industry factors constitute the independent variables in the return forecast model.
Next, four algorithms, including linear regression and three machine learning regression, are used, respectively, to predict the return of stocks. e deviation of the forecast results increases when the stock market turnover is enlarged, indicating that the prediction effect of the model will weaken when the market sentiment is high and the irrational pricing of investors increases. Among the four algorithms, the support vector regression has the most stable successful rate for predicting stocks return, while the ridge regression algorithm and linear regression algorithm have the most unstable successful rate for predicting with more extreme cases.
Finally, portfolios are built of which the position weight of individual stocks is determined by the weighted average of expected returns, and the forecast results of the four algorithms are back-tested. It is found that the support vector regression has a significant improvement compared with the traditional linear regression, both in terms of return and retracement control. e result can significantly outperform the CSI 500 index, indicating that the support vector regression algorithm in machine learning has a better effect in predicting returns in the multifactor stock selection strategy.
In conclusion, support vector regression can be used to fit higher-degree polynomials in the Chinese A-share market, and its applicability is strong.

Prospect. Multifactor stock selection model can help
investors not only to make more efficient and accurate decisions in investment but also have a clearer understanding of the huge and intricate security market and price fluctuation to seize trade opportunity. With the advent of the Internet of things, the acceleration of the development of big data and cloud computing, and the continuous innovation of data mining algorithms, more and more reasonable algorithms will be explored and used in the Chinese equity market to gain an excess return.
From the perspective of the data relationship, the nonlinear relationship between the prediction excess return rate variable and the anomaly factor maybe not very strong, which leads to the prediction effect of some machine learning algorithms used in this study is not as good as that of the traditional linear regression model. On the other hand, it is also limited by the problems of the algorithm itself. Although the ridge regression model can solve the X′X irreversible problem in the linear regression model, the cost paid is to "compress" the regression coefficients, thereby making the model more stable and reliable. Since the penalty term is the quadratic function of the regression coefficient β, when seeking the minimum value of the objective function, its partial derivative always retains the independent variable itself. So, sometimes ridge regression cannot realize the choice of variables in a true sense. Although the results of SVM model performance are sometimes excellent, its biggest disadvantage is that when the data scale is large, the operating cost is relatively high. erefore, future work can be further studied from the following two aspects: if the real data do have sparse problems, LASSO regression can be considered to achieve better results; and if the correlation between decision trees is strong, the prediction accuracy can be further improved by using AdaBoost algorithm.