A Model for Evolution of Investors Behavior in Stock Market Based on Reinforcement Learning in Network

.is paper builds an evolution model of investors behavior based on the reinforcement learning in multiplex networks. Due to the heterogeneity of learning characteristics of bounded rational investors in investment decisions, we consider, respectively, the evolution mechanism of individual investors and institutional investors on the complex network theory and reinforcement learning theory. We perform mathematical analysis and simulation to further explain the evolution characteristics of investors behavior. .e conclusions are drawn as follows: First, the intensity of returns competition among institutional investors and the forgetting effect both have an impact on the equilibrium of their evolution as to all institutional investors and individual investors. Second, the network topology significantly affects the behavioral evolution of individual investors compared with institutional investors.


Introduction
ere are significant uncertainty and unpredictability in the stock market due to the complicated factors including investors' bounded rationality and policy correlation effects. Investors tend to frequently trade and blindly follow the volatility of stock market, which can easily react upon the market exacerbating the instability [1][2][3][4]. erefore, it is of great significance to study the evolution of investor behavior in the stock market. e research is more effective in grasping the behavior evolvement rule of stock market.
In recent years, reinforcement learning has become a cutting-edge application in the financial field with the development of big data and the increasing demand for financial data analysis. At present, many literatures have applied reinforcement learning to the research on highfrequency trading, investment portfolio, and other fields. Jangmin et al. [5] proposed a new stock trading method combining dynamic asset allocation and reinforcement learning. Shimokawa et al. [6] introduced nonlinear effects into the ordered TD-type learning model and constructed an augmented TD-type learning model to predict investors' behavior. Bekiros [7] proposed an adaptive fuzzy Actor-Critic reinforcement learning and built a financial trading system based on it. Tan et al. [8] constructed an arbitragefree algorithmic trading system with the adaptive network fuzzy inference system (ANFIS) based on reinforcement learning. Bertoluzzo and Corazza [9] applied Q-learning and Kernel-based RL algorithms to automated financial trading. Feuerriegel and Prendinger [10] developed an automatic decision-making system with the news-based sentiment data and price momentum based on the supervision and reinforcement learning. Pendharkar and Cusatis [11] constructed a pension investment portfolio based on reinforcement learning. Yang et al. [12] believed that investor sentiment is an important factor affecting market returns, so they designed a trading system based on investor sentiment reward by Gaussian inverse reinforcement learning.
Network science theory is important to analyze investor behavior and it is a supplement to traditional statistics and experimental economics [13][14][15]. In recent years, many scholars have studied the investment behavior in the stock market from the perspective of network science. Existing research is divided into three aspects: analysis on investor network topology, empirical description of investment behavior, and research on investor behavior under the interaction mechanism.
At present, some scholars have analyzed the topological structure of financial network through using actual financial data and found that these financial networks have the characteristics including disassortative architectures [16], small-world properties [17], power-law degree distribution [18][19][20], and community structure [21,22]. rough the study on the network structure, some scholars further analyzed the investment behavior in the financial network. Long et al. [23] and Baumohl et al. [24] studied the volatility spillover effect of the stock market based on the stock network. Pareek [25] constructed a network model of mutual fund interconnection with holding the same stock as the "link" and revealed the inherent formation mechanism of fund herding behavior. Chung et al. [26] constructed an empirical investor network with the account transaction data of investors in the Taiwan Stock Exchange from 2005 to 2014 and analyzed the impact of information dissemination on stock returns.
As the actual financial data is not completely available, methods such as theoretical analysis and simulation have become important choices for many scholars to study the investor behavior in financial network. Wang et al. [27] divided the investors into three types including experts, speculators, and followers, to study how the information quality of experts affects investor behavior in the social network. Bian et al. [28] constructed an evolutionary model of investor behavior in the stock market based on network coordination game and studied the impact of investor behavior on financial market. Bakker et al. [29] and Stefan et al. [30] established an evolution model of investors' investment behavior based on the investor's social network to analyze investor behavior and its impact on the stock market. Xu et al. [31] proposed a weighted game model in investor network and found that the investor behavior evolution in WS small-world network is more stable than that in BA scale-free network. Khashanah and Alsulaiman [32] analyzed the performance of their investment behavior on the scale-free network by taking four types of investors, basic strategy traders, momentum traders, adaptive strategy traders, and zero-intelligence traders, into consideration. Cohen et al. [33] studied the information transmission in the financial market by applying a social network. Krichene and El-Aroui [34] divided investors into informed and uninformed investors, constructed a scale-free social network model, and analyzed the information asymmetry and herd behavior in different stock markets. Wang and Pan [35] built an artificial stock market model based on the heterogeneous information strategies in a dynamic scale-free network and studied the evolution of investors' information strategies.
In recent years, with the development of network science theory, many scholars have changed from a single-layer network perspective to a multi-layer network perspective to study investor behavior in stock market. Li et al. [36] analyzed the volatility of the stock market and the herd behavior of investors by using a bipartite network model of stocks and traders. Gao [37] explored the impact of margin trading on stock price volatility based on a double layered network of stocks investors. D'arcangelis et al. [38] analyzed the investment style of Italian pension fund by using a bipartite network model. Paulin et al. [39] constructed a double layered network of funds and assets and analyzed the reasons for the stock market crash. Lu et al. [40] studied the "herd effect" caused by the pursuit of portfolio diversity based on a bipartite network of funds and stocks. Biondo et al. [41] built a double layered financial network model to simulate information dissemination and investor transactions in the market. Wang and Chen [42] constructed the transaction model with a heterogeneous information in a double layered social network by introducing the doublefactor interaction function. Souza and Aste [43] established a multi-layer network model by using social media data and financial data to predict the future stock market structure.
In conclusion, there are many existing literatures studying investor behavior in stock market with reinforcement learning theory, but these literatures are mainly based on a single-layer network model and a single reinforcement learning mechanism, ignoring the heterogeneity of investor network structure and investor learning. erefore, this paper divides investors in the stock market into the individual ones and institutional ones. en, we analyze the decision-making mechanism of individual investors and institutional investors using the complex network theory and reinforcement learning theory. Based on it, the evolution model of investors behavior in the multi-layer network is constructed. Finally, this paper is investigated by mean-field analyses and simulations on the evolution characteristics of investors behavior．

Evolutionary Mechanism of Investor Behavior.
Due to factors such as information asymmetry, the professionalism of investors, and differences in investment goals [44,45], bounded rational investors are less likely to make optimal decisions only by observing the current stock market conditions. Experience is the easiest way to judge the feasibility of investment decisions. Before making the next investment decision, an investor usually refers to the returns obtained in the past. Even if the environment has changed, it will also repeat the strategy with better returns in the previous period [46,47]. erefore, the experience-based reinforcement learning method will directly affect investors' investment strategies. However, there are significant differences between individual investors and institutional investors. Compared with individual investors, better performance can bring more capital inflows, which helps institutional investors achieve their goals, so there is intense performance competition among institutional investors [48]. erefore, this paper divides investors into institutional investors and individual investors, the evolution model of investment behavior under different reinforcement learning strategies in network is constructed, as shown in Figure 1.
Based on the above analyses, the decision-making mechanism plays a crucial role in the evolution of investor 2 Complexity behavior. In view of the differences between individual investors and institutional investors in investment decisionmaking, the learning strategies of investment behavior evolution of institutional investors and individual investors are constructed, respectively.

Evolutionary Model of Investor
Behavior. Suppose G � (N, E) as the stock market, where N � (1, 2, . . . , i, . . . , n) represents the investor and (i, j) ∈ E means the connection between investor i and investor j. k i refers to the degree of node i; p(k) represents degree distribution. Meanwhile, investors in the stock market are divided into institutional investors and individual investors, and the proportion of institutional investors in all investors is η (0 < η < 1), so the proportion of individual investors is 1 − η. In addition, it is assumed that there are three investment behavior states in stock market: a i ∈ −1, 0, 1 { }, where a i � −1 means "sell," a i � 0 means "hold," and a i � 1 means "buy." Based on this, the state of all investors in the stock market can be expressed as A t � a 1,t , a 2,t , . . . , a i,t , . . . , a N,t at any time. m i (0 ≤ m i ≤ k i ) is set to indicate the number of individual investors which are associated with investor i, and n i (0 ≤ n i ≤ k i ) is set to indicate the number of institutional investors which are connected with investor i; then m i + n i � k i . Among them, it can be assumed that there are n i,1 buyers, n i,−1 sellers, and n i,0 holders of institutional investors associated with investor i, and n i,1 + n i,−1 + n i,0 � n i . Similarly, it can be assumed that, among the individual investors associated with investor i, there are m i,1 buyers, m i,−1 sellers, and m i,0 holders, and m i,

Evolution Rules of Individual Investors Behavior
①Individual investors reinforcement learning strategies: Strahilevitz et al. [49] found that investors tend to repurchase the stocks they previously sold for a gain while ignoring the stocks they previously sold for a loss. Linnainmaa [50] found that household trading intensity depends on past performance. In the stock market, the investment performance of individual investors is largely affected by personal experience. In particular, investors tend to pay more attention to some successful experiences, even if such experience cannot be replicated in the future [51,52]. ese findings can be explained by reinforcement learning models. Individual investors will adjust their behavior according to the changes in the stock market at any time. On perceiving the changes of the market, they will take corresponding actions, such as buying, selling, or holding. e market will give certain feedback based on actions. If the action is successful, the probability of it being selected will increase [53]. Kaustia and Knüpfer [46] found that, compared with Bayesian learning theory, reinforcement learning theory can better explain the positive correlation between investors' past IPO returns and future subscriptions. Choi et al. [54] used reinforcement learning theory to explain the savings behavior of individual investors. Erev and Roth [55] established a reinforcement learning model, which can well describe the evolution of individual behavior in economic experiments.
erefore, this paper refers to Erev   Complexity evolution rules of investment behavior of individual investors. Suppose q i,a (t) represents the attractiveness of strategy a to investor i at time t. Forgetting parameter λ is used to gradually weaken the influence of past experience, and λ ∈ [0, 1] always holds. ξ i (t) represents the return of investor i at time t. When the decision of investor i is the same as the optimal decision max k j ∈S i k j a j,1 , Otherwise, ξ i � −1. e investor's strategy update rules are as follows: ②Individual investment decisions: based on the above analyses, the probability that individual investor i chooses to "buy" is

Evolution Rules of Institutional Investors Behavior.
Greenwood and Nagel [56] analyzed the performance of experienced and inexperienced fund managers during the bubble. ey found that inexperienced managers would buy technology stocks in large quantities during the run-up and sell them during the downturn. Hence, similar to individual investors, institutional investors also rely on reinforcement learning in trading. e historical performance of institutional investors will influence the capital flow and investment choice of individual investors. In order to maximize profits, there is a competitive relationship among institutional investors [48]. Compared with individual investors, institutional investors not only have reinforcement learning behavior in the decision-making, but also have game behaviors among institutional investors, namely, belief learning (game learning). EWA learning model [57] contains both reinforcement learning and belief learning. erefore, we construct an evolution model of institutional investor behavior based on the EWA learning model: ①Institutional investors reinforcement learning strategies: based on the EWA learning model, f i (t) is defined as the attractiveness of strategy a to investor i at time t without considering the influence of other investors' decisions, and f i (t) ∈ [−1, 1] is satisfied. Suppose Q i (t) represents the actual return if institutional investor i chooses strategy a at time t. λ is defined as the forgetting effect coefficient of investors. e larger the value of λ is, the less the influence of past experience is, and λ ∈ [0, 1] is satisfied. e rule for institutional investor to update the strategy based on reinforcement learning is ②Institutional investment decision: suppose S i (t) represents the strategy chosen by investor i at time t, S a i (t) represents strategy a chosen by investor i at time t, and I(S a is a vector that represents the set of strategies adopted by all the other adjacent investors of investor i at time t. e return obtained by investor i choosing strategy a is when the set of strategies adopted by other investors is . α is used to represent the intensity of returns competition among institutional investors; α ∈ [0, 1] is satisfied. Based on this, the decision function of institutional investor is According to the above analyses, the probability that institutional investor i chooses to "buy" is . e probability that institutional investors choose to "hold" is p(a i (t) � −1) � 1 � p(a i (t) � 1) − p(a i (t) � 0).

Mathematical Analysis
Assuming that the degree of institutional investors is k at time t, the proportion of investment decision to buy is O 1 (t), the proportion of investment decision to sell is O −1 (t), the proportion of the hold is O 0 (t), and O 1 (t)+ O −1 (t) + O 0 (t) � η. Assuming that, at time t, the degree of individual investors is k, the proportion of investment decision to buy is I 1 (t), the proportion of investment decision to sell is I −1 (t), and the proportion of choosing to hold is I 0 (t), and I 1 (t) + I −1 (t) + I 0 (t) � 1 − η.

Mathematical Analysis on the Evolution of Institutional
Investor Behavior. Assume that θ O,1 (t) represents the probability of connecting to institutional investors whose investment decision is buying; represents the probability of connecting to institutional investors whose investment decision is selling; then θ O,−1 (t) � ( k≥1 kp(k)O −1 (t)/〈k〉). Assume that θ O,0 (t) represents the probability of connecting to institutional investors whose investment decision is holding; θ O,0 (t) � ( k≥1 kp(k)O 0 (t)/〈k〉). us, the probability that there are 4 Complexity exactly n i,1 buyers and n i,−1 sellers in the institutional investors connected to investor i is So, the probability of institutional investor i choosing to buy is e probability of choosing to sell is e probability of choosing to hold is e rate of change that institutional investors buy stocks is According to the mean-field equation, the rate of change that institutional investors buy stocks is equal to the probability that the non-buyer converted into the buyer minus the probability of the buyer becoming a non-buyer. When the value of the above formula is equal to 0, the network reaches a steady state. We can infer According to equation (10), combined with θ O,1 (t) � ( k≥1 kp(k)O 1 (t)/〈k〉), the values of θ O, 1 and O 1 in the equilibrium state of institutional investors behaviors can be determined: However, it can be seen from the above analyses that it is difficult to see the impact of key elements such as network topology and learning strategies on the equilibrium state of the evolution of institutional investor behavior, so further discussion is needed.
, it can be seen that investors will choose to buy when In equation (12), is not affected by α, so it mainly considers the influence of α on . Based on this, it is assumed that α 1 � n i f 1 i (t), α 2 � (1/2)n i f 1 i (t), and investors' decision-making is not affected by α when n i � 0 and f 1 i (t) < 0, so analysis is not required. erefore, with the increase of α, h 1 (t) is not strictly decreasing; that is, E(θ O,1 ) is not strictly decreasing. However, it is difficult to obtain the influence of forgetting effect λ on the stable state θ I,1 of institutional investor behavior through mathematical analysis, so this paper will analyze it by simulation.
us, the probability that there are exactly n i,1 buyers and n i,−1 sellers in the institutional investors connected to investor i is Similarly, the probability that there are exactly m i,1 buyers and m i,−1 sellers in the individual investors connected to −1 . e probability of the above two events occurring at the same time is So, the probability of individual investor i choosing to buy is 6 Complexity e probability of choosing to sell is e probability of choosing to hold is e rate of change that individual investors buy stocks is According to the mean-field equation, the rate of change that individual investors buy stocks is equal to the probability that the non-buyer converted into the buyer minus the probability of the buyer becoming a non-buyer. When the value of the above formula is equal to 0, the network reaches a steady state. We can infer According to equation (19), combined with θ I,1 � ( k≥1 kp(k)I 1 (t)/〈k〉), the values of θ I,1 (t) and I 1 in the equilibrium state of individual investors behaviors can be determined: However, it can be seen that the influence of learning strategy, forgetting effect, initial state, and other factors on the equilibrium state of individual investor behavior cannot be obtained through mathematical analysis. erefore, this paper will analyze it by simulation.

Simulation Results
e previous part has theoretically analyzed the evolution mechanism of investor behavior in the stock market under the reinforcement learning. It can be seen from the analysis that the evolution of the stock market investor behavior on the reinforcement learning is mainly affected by factors such as the network structure, the intensity of returns competition among institutional investors α, and forgetting parameters λ. On this basis, this section mainly employs simulation, to further explain the evolution process of investor behavior. According to the statistical yearbook data of the Shanghai Stock Exchange, the net sales of various types of investors in the past five years are as follows: the proportion of natural person investor transactions is 80% to 85%, the proportion of transactions of general legal persons, Shanghai Stock Connect, and professional institutions ranges from 15% to 20%. Based on this, this paper sets the proportion of institutional investors in the network to η � 0.2. If the total number of investors is set to 500, the number of individual investors is 400 and the number of institutional investors is 100. Existing research shows that the network of stock investors based on financial correlation exhibits power-law distribution characteristics [18][19][20]. Based on this, a twolayer scale-free network is constructed, and the average degree of network is different at different layers. e edges of different types of investors are established according to degree-dependent connection and random connection, respectively, in this paper. In the initial state, all investors are in the "holding" state. e evolution time is 300. e default values of all parameters are λ � 0.5, α � 0.1, Q i (t) ∈ [−1, 1], and the average degree of the entire network is <k > � 12. is paper assumes that the market is generally in a good state.

Impact of Network Structure on the Evolution of Investor
Behavior in Stock Market. Figure 2 compares and analyzes the dynamic evolution of the two types of investor behavior in two network structures of random connection and degree-dependent connection when the network average degree is <k > � 8 and <k > � 12, respectively. Complexity Figure 2 describes the impact of network connectivity and network heterogeneity on the evolution of investor behavior. For individual investors, Figures 2(a) and 2(b) show that, in BA-BA randomly connected multi-layer networks, the time for the individual investor's behavior evolution to reach a steady state is t ≈ 40 when <k > � 8 and the time is t ≈ 10 when <k > � 12. It indicates that the time for investment behavior to evolve into steady state shortens as the average degree increases. Figures 2(c) and 2(d) show that, in a BA-BA degree related connection multi-layer network, the time for the evolution of individual investor behavior to reach steady state is t ≈ 70 when <k > � 8 and the time is t ≈ 10 when <k > � 12. is shows that network connectivity and network heterogeneity have an impact on the evolution of individual investor behavior.
For institutional investors, in any case, the dynamic evolution of their behavior quickly reaches a steady state, and the realization time and the state ratio are basically the same. It can be seen that the network topology significantly affects the behavioral evolution of individual investors, but it has a weaker impact on institutional investors. Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" (a) Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold"  Complexity are the same, Figure 3 analyzes the evolution of the investment behavior of institutional investors and individual investors under different intensity of returns competition among institutional investors α. As shown in Figure 3, under the two network structures, for institutional and individual investors, the intensity of returns competition among institutional investors α has an impact on their investment behavior. For institutional investors, with the increase of α, the proportion of institutional investors "buying" has continued to decrease, and the proportion of "holding" has continued to increase, while the proportion of "selling" has not changed significantly. Compared with institutional investors, with the increase of α, the proportion of individual investors "holding" and "selling" has increased slightly, and the proportion of "buying" has continued to decrease. Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold"

Impact of Forgetting Effect on the Evolution of Investor Behavior in Stock Market.
is section mainly analyzes the impact of forgetting effect coefficient on the dynamic evolution of investor behavior in two network structures, respectively. Other experimental parameters are set with the above default values. e simulation results are shown in Figure 4.
As shown in Figure 4, under two network structures, with the increase of the forgetting coefficient λ, the proportion of individual investors "buying" is continuously decreasing, and the proportion of "holding" and "selling" increases slightly. Compared with the inter-layer random connection network, the proportion of individual investors "buying" is higher in the inter-layer degree-dependent connection network. Unlike the behavior of individual investors that is obviously affected by the forgetting effect, institutional investors are relatively less affected by the forgetting coefficient. As the forgetting coefficient λ increases, the proportion of institutional investors "buying" decreases to a lesser extent. In summary, with the increase of the investor's forgetting effect coefficient, the proportion of investors who "buy" at equilibrium is reduced; compared Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" Individual "buy" Individual "sell" Individual "hold" Institution "buy" Institution "sell" Institution "hold" with institutional investors, the proportion of "buyers" of individual investors decreases even more. at is, the forgetting effect has a greater impact on individual investor behavior.

Impact of Initial Investment Behavior State on the Evolution of Investor Behavior in Stock Market.
is section mainly discusses the effect of the initial investment behavior of investors on the dynamic evolution process of their behaviors. e parameters in the simulation refer to the above default values.
Figures 5(a)-5(d) describe the effect of the initial behavioral status of investors on the dynamic evolution of their behaviors in the two types of network structures with average degrees of 〈k〉 � 8 and 〈k〉 � 12, respectively. It can be seen from the figure that as the proportion of investors' initial investment behaviors "buy" status gradually increases, the ultimate equilibrium state of institutional investor and individual investor investment behavior evolution are basically the same. It shows that, whether it is an institutional investor or an individual investor, the initial investment behavior status has little effect on its behavior evolution.

Conclusion
In this paper, we analyze the decision-making mechanism of individual investors and institutional investors on the theories of complex networks and reinforcement learning and construct the evolution model of investors behavior based on the reinforcement learning in network. en, the evolution characteristics of investors behavior are investigated by mean-field analysis and simulations. e conclusions are drawn as follows: (1) the topology of the network has a greater influence on the evolution of the individual investor behavior than that of the institutional investor. (2) e equilibrium state of institutional investors behavior is in nonlinear correlation with the intensity of returns competition among institutional investors. Moderate competition for returns among institutional investors is conducive to the stability of the stock market. (3) As to all institutional investors and individual investors, the forgetting effect has a significant influence on the equilibrium of their evolution. rough reinforcement learning, investors can obtain useful investment advice from a large amount of market information, choose which information is more worthy of attention, and to a certain extent eliminate the influence of behavioral bias on investment, but at the same time, the successful experiences will also strengthen the irrational characteristics of investors, such as overconfidence, etc. In a bull market, investors often attribute their success to their investment ability, which makes them more overconfident and further leads to irrational behaviors. erefore, investors should view their own experience as rationally as possible when making investment decisions and establish the belief of long-term investment. (4) e initial state of the investors behavior has a small influence on the equilibrium of their evolution for all institutional investors and individual investors.
is paper mainly studies the evolution of investor behavior based on the static network structure. However, the network structure is bound to evolve with the change of investment subject's behavior. erefore, it will be the authors' future work to analyze the evolution of investor behavior under dynamic network structure.

Data Availability
e data used to support the finding of this study are included within the article.

Conflicts of Interest
e authors declare no conflicts of interest.