Stock Trading Strategies Based on Deep Reinforcement Learning

,e purpose of stock market investment is to obtain more profits. In recent years, an increasing number of researchers have tried to implement stock trading based on machine learning. Facing the complex stock market, how to obtain effective information from multisource data and implement dynamic trading strategies is difficult. To solve these problems, this study proposes a new deep reinforcement learning model to implement stock trading, analyzes the stock market through stock data, technical indicators and candlestick charts, and learns dynamic trading strategies. Fusing the features of different data sources extracted by the deep neural network as the state of the stock market, the agent in reinforcement learning makes trading decisions on this basis. Experiments on the Chinese stock market dataset and the S&P 500 stock market dataset show that our trading strategy can obtain higher profits compared with other trading strategies.


Introduction
Stock trading is the process of buying and selling stocks to obtain investment profit. e key to stock trading is to make the right trading decisions at the right times, that is, to develop a suitable trading strategy [1]. In recent years, many studies have been based on machine learning methods to predict stock trends or prices to implement stock trading. However, long-duration prediction of the price or trend of the stock is not reliable. Besides, the trading strategy based on stock price prediction is static [2,3]. e stock market is affected by many factors [4][5][6], such as changes in investor psychology and company policies, natural disasters, emergencies, etc., stock price fluctuates greatly. Compared with a static trading strategy, a dynamic trading strategy can make trading decisions dynamically according to the changes of the stock market, which has greater advantages.
At present, an increasing number of studies implement dynamic trading strategies based on deep reinforcement learning. Reinforcement learning gains increasing attention after AlphaZero defeated humans [7], has the ability of independent learning and decision-making, and has been successfully applied in the field of game playing [8,9], unmanned driving [10,11], and helicopter control [12].
Reinforcement learning solves the sequential decisionmaking problem, which can be applied to stock trading to learn dynamic trading strategies. Nevertheless, reinforcement learning lacks the ability to perceive the environment. e combination of deep learning and reinforcement learning (i.e., deep reinforcement learning) solves this problem and has more advantages when it has the decisionmaking ability of reinforcement learning and perception ability of deep learning.
One of the challenges when implementing stock trading based on deep reinforcement learning is the correct analysis of the state of the stock market. Financial data is nonlinear and unstable. Most of the existing studies on stock trading based on deep reinforcement learning analyze the stock market through stock data [13][14][15]. However, there is noise in stock data, which affects the final analysis results. Technical indicators can reflect the changes in the stock market from different perspectives and reduce the influence of noise [16,17]. ere are studies that convert financial data into two-dimensional images for analyzing the stock market [18][19][20][21][22]. Different data sources reflect the changes in the stock market from different perspectives. Compared with the analysis of the stock market based on a single data source, multisource data can integrate the characteristics of different data sources, which is more conducive to the analysis of the stock market. However, the fusion of multisource data is difficult.
To a deeper analysis of the stock market and learn the optimal dynamic trading strategy, this study proposes a deep reinforcement learning model and integrates multisource data to implement stock trading. rough the analysis of stock data, technical indicators, and candlestick charts, we obtain a deeper feature representation of the stock market, which is conducive to learning the optimal trading strategy. Besides, the setting of the reward function in reinforcement learning cannot be ignored. In stock trading, investment risk should be paid attention to while considering returns and reasonably balance risk and returns. Sharpe ratio (SR) represents the profit that can be obtained under certain risks [23]. In this study, the reward function takes investment risk into consideration and combines SR and profit rate (PR) as the reward function to promote the learning of optimal trading strategies.
To verify the effectiveness of the trading strategy learned by our proposed model, we compare it with other trading strategies based on practical trading data. For stocks with different trends, our trading strategy obtained higher PR and SR, which has better robustness. In addition, we conduct ablation experiments, and the experimental results show that the trading strategy learned from analyzing the stock market based on multisource data is better than those learned from analyzing the stock market based on a single data source. e main contributions of this paper are as follows: (i) A new deep reinforcement learning model is proposed to implement stock trading and integrate the stock data and candlestick charts to analyze the stock market, which is more helpful to learn the optimal dynamic trading strategy. (ii) A new reward function is proposed. In this study, investment risk is taken into account, and the sum of SR and PR is taken as the reward function. (iii) e experimental results show that the trading strategy learned from the deep reinforcement learning model proposed in this paper can obtain better profits for stocks with different trends.

Related Work
In recent years, a mass of machine learning methods has been implemented in stock trading. Investors make trading decisions based on their judgment of the stock market. However, due to the influence of many factors, they cannot make correct trading decisions based on the changes in the stock market in time. Compared with traditional trading strategies, machine learning methods can learn trading strategies by analyzing information related to the stock market and discovering profit patterns that people do not know about without professional financial knowledge, which have more advantages. ere are some studies based on deep learning methods to implement stock trading. Deep learning methods usually implement stock trading by predicting the future trend or price of stock [24][25][26][27]. In the financial field, deep learning methods are used for stock price prediction because they can obtain temporal characteristics from financial data [28,29]; Chen et al. [30] analyzed 2D images transformed from financial data through a convolutional neural network (CNN) to classify the future price trend of stocks. When implementing stock trading based on the deep learning method, the higher the accuracy of the prediction, the more helpful the trading decision. On the contrary, when the prediction result deviates greatly from the actual situation, it will cause the fault trading decision. In addition, the trading strategy implemented by such methods is static and cannot be adjusted in time according to the changes in the stock market.
Reinforcement learning can be used to implement stock trading by self-learning and autonomous decisionmaking. Chakole et al. [31] used Q-learning algorithm [32] to find the optimal trading strategy, in which the unsupervised learning method K-means and candlestick chart were, respectively, used to represent the state of the stock market. Deng et al. [33] proposed a model Deep Direct Reinforcement Learning and added fuzzy learning, which is the first attempt to combine deep learning and reinforcement learning in the field of financial transactions. Wu et al. [34] proposed a long short-term memory based (LSTM-based) agent that could perceive stock market conditions and automatically trade by analyzing stock data and technical indicators. Lei et al. [35] proposed a time-driven feature aware jointly deep reinforcement learning model (TFJ-DRL), which combines gated recurrent unit (GRU) and policy gradient algorithm to implement stock trading. Lee et al. [36] proposed HW_LSTM_RL structure, which first used wavelet transforms to remove noise in stock data, then based on deep reinforcement learning to analyze the stock data to make trading decisions.
Existing studies on stock trading based on deep reinforcement learning mostly analyze the stock market through a single data source. In this study, we propose a new deep reinforcement learning model to implement stock trading, and analyze the state of the stock market through stock data, technical indicators, and candlestick charts. In our proposed model, firstly, different deep neural networks are used to extract the features of different data sources. Secondly, the features of different data sources are fused. Finally, reinforcement learning makes trading decisions according to the fused features and continuously optimizes trading strategies according to the profits. e setting of reward function in reinforcement learning cannot be ignored. In this study, the SR is added to the reward function setting, and the investment risk is taken into account while considering the profits.

Methods
We propose a new deep reinforcement learning model and implement stock trading by analyzing the stock market with multisource data. In this section, first, we introduce the overall deep reinforcement learning model, then the feature extraction process of different data sources is described in 2 Scientific Programming detail. Finally, the specific application of reinforcement learning in stock trading is introduced.

e Overall Structure of the Model.
Implementing stock trading based on deep reinforcement learning and correctly analyzing the state of the stock market is more conducive to learning the optimal dynamic trading strategy. To obtain the deeper feature representation of the stock market state and learn the optimal dynamic trading strategy, we fuse the features of stock data, technical indicators, and candlestick charts. Figure 1 shows the overall structure of the model. e deep reinforcement learning model we propose can be divided into two modules, the deep learning module for extracting features of different data sources and the reinforcement learning module for making trading decisions. Candlestick charts features are extracted by the CNN and bidirectional long short-term memory (BiLSTM); stock data and technical indicators are as the input of the LSTM network for feature extraction. After extracting the features of different data sources, contacting the features of the different data sources to implement feature fusion, the fused features can be regarded as the state of the stock market, and the reinforcement learning module makes trading decisions on this basis. In addition, in the reinforcement learning module, the algorithms used are Dueling DQN [37] and Double DQN [38].

Deep Learning Module.
e purpose of this study is to obtain a deeper feature representation of the stock market environmental state through the fusion of multisource data to learn the optimal dynamic trading strategy. Although raw stock data can reflect changes in the stock market, they contain considerable noise. To reduce the impact of noise and perceive the changes of the stock market more objectively and accurately, relevant technical indicators are used as one of the data sources for analyzing the stock market in this study. Candlestick charts can reflect the changes in the stock market from another perspective. is paper fuses the features of the candlestick charts.

Stock Data and Technical Indicator Feature Extraction.
Due the noise in stock data, we use relevant technical indicators to reduce the impact of noise. e technical indicators reflect the changes in the stock market from different perspectives. In this paper, stock data and technical indicators are used as inputs to the LSTM network to better capture the main trends of stocks. e raw stock data we use include opening price, closing price, high price, low price, and trading volume. e technical indicators used in this paper are the MACD, EMA, DIFF, DEA, KDJ, BIAS, RSI, and WILLR. e indicators are calculated by mathematical formulas based on stock prices and trading volumes [39], as reported in Table 1.
To facilitate subsequent calculations, we perform missing value processing on the stock data. First, the stock data is cleaned, and the missing data are supplemented with 0. In addition, the input of the neural network must be a real value, so we replaced the NaNs in the stock data and technical indicators with 0. Data with different value ranges may show gradient explosion during neural network training [42]. To prevent this problem, we normalize the stock data and technical indicators; normalization is performed to transform the data to a fixed interval. In this work, the stock data and technical indicators of each dimension are normalized, and the data are converted into ranges [0, 1]. e normalization formula is as follows: where X represents the original data, X min and X max represent the minimum and maximum values of the original data, respectively, X norm represents the normalized data. e neural network structure for extracting stock data and technical indicators is shown in Figure 2. LSTM network is a variant of recurrent neural network (RNN), and its unit structure is shown in Figure 3. LSTM solves the problem of gradient disappearance and gradient explosion in the long sequence training process. In the LSTM network, f, i, and o represent a forget gate, an input gate, and an output gate, respectively. A forget gate is responsible for removing information from the cell state. e input gate is responsible for the addition of information to the cell state. e output gate decides which next hidden state should be selected. C t is the state of the memory cell at time t; C t is the value of the candidate state of the memory cell at time t; σ and tanh are the sigmoid and tanh activation functions, respectively; W and b represent the weight and deviation matrix, respectively; x t is the input vector; h t is the output vector; in this paper, x t is the data after the contacting of stock data and technical indicators, x t and other specific calculation formulas are as follows: x t � (open, low, close, ..., MACD, RSI, Willr) In the entire process of feature extraction of stock data and technical indicators, the stock data and technical indicators are first cleaned and normalized. en, the normalized data are used as the input of the LSTM network for feature extraction. Finally, the final feature is obtained by feature extraction through the two-layer LSTM network.

Candlestick Chart Feature Extraction.
To extract more informative features, in this study, historical stock data are transformed into candlestick charts; candlestick charts contain not only the candlestick but also other information, which can be divided into two parts, the upper part is the candlestick and moving average of the closing price, the lower part is the trading volume histogram and its moving Scientific Programming candlestick CNN BiLSTM   average. Generally, the candlestick consists of body, upper shadow, and lower shadow. e body is the difference between the closing price and the opening price of the stock during the trading session, as shown in Figure 4. If the opening price is lower than the closing price, it indicates that the price is rising, this kind of candle is called a bullish candle, and the color of the candlestick is red. And if the open price is higher than the close price, it indicates that the price has fallen, and the color of the candlestick is green. For a bullish candlestick, the upper shadow is the difference between the high price and the close price, and the lower shadow represents the difference between the low price and the open price. For a bearish candlestick, the upper shading indicates the difference between the high price and the open price, and the lower shading indicates the difference between the low price and the close price. e trading time of stocks can range from one minute to one month. e candlestick chart is based on days in this study. e network structure of extracting candlestick chart features is shown in Figure 5. In this study, we first obtain the features of the candlestick chart through three layers of convolution and pooling, then transform the obtained vector, input the BiLSTM network for feature extraction, and finally obtain the final features.

Reinforcement Learning
Module. Reinforcement learning includes agent, environment, state, action, and reward. e agent chooses action according to the environmental state, which will get the immediate reward every time it chooses an action. e agent constantly adjusts the learning strategy according to the reward value to obtain the largest cumulative reward value. For example, in the process of stock trading, if the trading action selected by the agent gains a profit, it will get a positive reward value. In contrast, if there is a loss after choosing a trade action, the agent will get a negative reward value. Reward promotes the agent to make the correct action in future behavior choices. Most previous works used trading profits as an immediate reward of reinforcement learning to optimize the trading strategies. However, this only considers the changes in the profits after each trading action is taken and does not consider the investment risk. In this paper, we take investment risk into consideration, SR is an indicator for evaluating transactions and is used to optimize the trade-off between profitability and risk. SR is the expected return minus the risk-free rate and then divided by the variance of the return. Considering both investment risk and return change, the reward function obtained by the sum of the two is more advantageous than the reward function based on return change, and this is proved by experiments, so the immediate reward is set to the sum of PR and the SR, the specific formula is as follows: where P t is the sum of the assets owned by the investors at time t, E(R P ) is the expected portfolio return, R f is the riskfree rate, and σ P is the standard deviation of the return, R t is cumulative rewards, c ∈ (0, 1), which is a discount factor. In this paper, we combine the Double DQN algorithm and the Dueling DQN algorithm, both of which are improved algorithms based on the DQN algorithm. In the value-based deep reinforcement learning algorithm, actions are selected according to the Q value. In the DQN algorithm, there are two networks with the same structure, the main network and the target network. Initially, the parameters of the two networks are the same. During the training process, the target network and the main network update the parameters in different ways. In the DQN algorithm, under the ε-greedy strategy, the agent that has a greater probability chooses the action corresponding to the maximum Q value, which will cause the Q value to be overestimated.
Compared with the DQN algorithm, the Dueling DQN algorithm changes the calculation method of the Q value through the addition of the state value function V(s) and advantage function A(s, a). e value function V(s) is used to evaluate the quality of the state, and the advantage function A(s, a) is used to evaluate the quality that the agent chooses action a in state s. e calculation formula of the Q value is as follows: where α and β , respectively, represent the parameters in the value function V(s) and advantage function A(s, a), θ represents other parameters in the deep reinforcement learning modal. Double DQN changes the calculation of the Q value of the target network and solves the Q value overestimation problem in the DQN algorithm, which can be combined with the Dueling DQN algorithm to improve the overall performance of the model. e formula for calculating the Q value of the target network in the Double DQN algorithm is as follows: where θ and θ ′ represent the parameters in the main network and target network, respectively. e loss function is the mean square error of the Q value of the main network and the target network. e formula is shown as follows: In this study, we analyze the stock market from stock data, technical indicators, and candlestick charts and fuse the features of the different data sources to obtain stock market state features representation and help the agent learn the optimal dynamic trading strategy. Trading action a t � { long, neural, short } � {1, 0, −1}, long, neural and short represent buy, hold, and sell, respectively. When the trading action is long, cash is converted into stock as much as possible, and when the trading action is short, all shares are sold into cash. In addition, transaction costs in stock trading cannot be ignored. High-frequency transactions result in higher costs; the transaction cost in this paper is 0.1% of the stock value [40]. e trading process is shown in Algorithm 1.
Input: stock data, technical indicators, candlestick chart; Initialize the experience replay memory D to capacity C; Initialize the main Q network with random weights θ; Initialize the target Q network with θ' � θ; for episode 1 to N do for t � 1 to T do e sum of the features extracted by the deep learning module represents the environment state s t ; With the probability E choose a random action a t ; Otherwise select a t � argmax a Q(s t , a; θ) ; Get the reward r t and next state s t+1 ; Store the transaction (s t , a t , r t , s t+1 ) to D; if t%n � 0 then Sample minibatch (s t , a t , r t , s t+1 ) randomly from D; Set: 2 Update the target network parameters θ' � θ every N steps; end if end for end for

Experiment and Results
is section mainly introduces the dataset, evaluation metrics, comparison methods, implementation details, and experimental result analysis.

Datasets.
In this study, we verify the dynamic trading strategy learned from the proposed model on datasets of Chinese stocks and S&P 500 stock market stocks and compare them with other trading strategies. e period range of the dataset is from January 2012 to January 2021. e training period ranges from January 2012 to December 2018; the testing period ranges from January 2019 to January 2021. For stock data, it includes the daily open price, high price, low price, close price, and trading volume of the stock, as shown in Table 2.

Metrics.
e evaluation indicators used in this paper are PR, the annualized rate of return (AR), SR, and max drawdown duration (MDD). e details are as follows: (i) PR refers to the difference between the assets owned at the end of the stock transaction and the original assets divided by the original assets. (ii) AR is the ratio of the profits to the principal of the investment period of one year. e formula is defined as follows: AR � total profits principal * 365 trading days * 100.
(iii) SR is a standardized comprehensive evaluation index, which can consider both profits and risks at the same time to eliminate the adverse impact of risk factors on performance evaluation. (iv) MDD refers to the maximum losses that can be borne during trading, and the lower the value, the better the performance of the trading strategy.

Baselines
(i) Buy and Hold (B&H) [41] refers to the construction of a certain portfolio according to the determined appropriate asset allocation ratio, and the maintenance of this portfolio during the appropriate holding period without changing the asset allocation status. And B&H strategy is a passive investment strategy. (ii) Based on the Q-learning algorithm, two models are proposed to implement stock trading [31]. e two models perceive the stock market environment in different ways, model 1 analyzes the stock market through the k-means method, model 2 analyzes the stock market through a candlestick chart, and the experimental results show that model 1 performs better than model 2, so we only compare with model 1. In model 1, the size n of clusters is set to 3, 6, and 9, and we compare them, respectively. (iii) A LSTM-based agent is proposed to learn the temporal relationship between data and implement stock trading [34]

Implementation Details.
is study is based on deep reinforcement learning to implement stock trading and fuses the features of stock data, technical indicators, and candlestick charts as the state of the stock market. LSTM network extracts features of stock data and technical indicators, the size of the hidden layer is 128, and the size of the candlestick chart is 390 × 290. In the process of learning the optimal trading strategy, an episode is the trading period ranges from January 2012 to December 2018. e episode in training is 200. In the ε-greedy strategy of reinforcement learning, the ε � 0.8. e length of the sliding time window is set to 30 days, and the learning rate is 0.0001.

Comparative Experiment on the Chinese Stock Dataset.
We select 10 Chinese stocks with different trends for comparative experiments, and the initial amount is 100,000 Chinese Yuan (CNY). e results of the experiment are shown in Tables 3-6. We select three stocks with different trend changes to further demonstrate the PR changes, as shown in Figure 6. e traditional trading strategy B&H is a passive trading strategy, which has an advantage in stocks with rising prices. However, it does not perform well for stocks with large price fluctuations or downward trends. It can be seen from Figure 6 that for the stock 002460 with an upward trend, the B&H trading strategy can obtain a higher PR, while for the other two stocks, 601101 and 600746, with different trends, the PR obtained are not as good as the other trading strategies. Trading strategies learned based on the Q-learning algorithm are dynamic, compared with the traditional trading strategy B&H. In most cases, the trading strategies learned based on model 1 can obtain higher PR, AR, SR, and lower MDD for stocks with different trends in different fields. Nonetheless, reinforcement learning lacks the ability to perceive the environment. Compared with the trading strategies learned based on the deep reinforcement learning model, the trading strategies learned based on model 1 do not have obvious advantages. LSTM_based, TFJ-DRL, and HW_LSTM_RL are all methods based on deep reinforcement learning. e data sources analyzed by these methods are relatively single, compared with the traditional trading strategy B&H and trading strategies learned based on Model 1. e trading strategies learned by these methods can obtain more profits for stocks with different trends in different fields. From the experimental results, we can see that the dynamic trading strategies learned by our proposed model have better performance. It can be seen from Tables 3-6 that stocks with different trends, the trading strategy learned by the model proposed in this paper, have better performance. Compared with other trading strategies, the evaluation indicators of our trading strategy are the highest in most cases. On the whole, the average PR of our trading strategy is  Tables 7-10. In this section, we select ten stocks with different trends and show their price changes and PR changes in detail, as shown in Figure 7.     Model 1 n=9 [31] LSTM_based [34] TFJ-DRL [35] HW_LSTM_RL [36] Our LSTM_based [34] TFJ-DRL [35] HW_LSTM_RL [36] Our     LSTM_based [34] TFJ-DRL [35] HW_LSTM_RL [36] Our -0.

Scientific Programming
For stocks with different trends in S&P 500, our trading strategy also has a better performance. It can be seen from Tables 7-10 that compared with other trading strategies, our trading strategy can obtain higher yields, SR, AR, and MDD, and on the whole, our PR reached 212.96, SR reached 1.16, obviously higher than other trading strategies.
To further verify the performance of our proposed model in different stock markets, we conducted Mann-Whitney U test on the profits rate of 20 stocks selected from the Chinese stock market and the S&P 500 stock market. e results showed that P � 0.677 > 0.05, indicating that there is no significant difference between the returns obtained by our model in the Chinese stock market and the S&P 500 stock market, and it has good generalization ability.

Reward Function Comparison Experiment.
In this section, we set the reward function with SR and without SR and select two stocks from the Chinese stock market and the S&P 500 stock market for comparison experiments. e stocks selected from the Chinese stock market are 600746 and 601101, and the stocks selected from the S&P 500 are GOOGL and IBM, the training period ranges from January 2012 to December 2018, and the testing period ranges from January 2019 to January 2021. e experimental results are shown in Table 11. It can be seen from the experimental results in Table 11 that compared to the reward function without the SR, when the reward function contains the SR, the learned trading strategies have a better performance overall. Different from the existing algorithmic trading based on deep reinforcement learning, most of which take profit rate as reward function, this study takes investment risk into account and adds SR and profit rate as reward function, and the learned trading strategy obtains higher PR, AR, SR, and MDD.

Ablation Experiments.
In this section, to verify the effectiveness of multisource data fusion, we conduct an ablation experiment. ree groups of comparative  LSTM_based [34] TFJ-DRL [35] HW_LSTM_RL [36] Our -0. experiments are carried out, all of which are based on deep reinforcement learning to implement stock trading. e first group analyzes the stock market through stock data and technical indicators; the second group analyzes the stock market through a candlestick chart; and the third group analyzes the stock market through stock data, technical indicators, and candlestick chart. We select the trading data of GOOGL stock from January 2012 to January 2021 as the dataset for this section, in which January 2012 to December 2018 is the training data, and January 2019 to January 2021 is the test data. e comparison results are shown in Figure 8 and Table 12. e experimental results show that compared with the trading strategies learned in the first two groups, the trading strategies learned in the third group can obtain higher PR, SR, AR, and lower MDD. is also proves that the analysis of multisource data can obtain a deeper feature representation of the stock market, which is more conducive to learning the optimal trading strategy.

Conclusion
Correct analysis of the stock market state is one of the challenges when implementing stock trading based on deep reinforcement learning. In this research, we analyze multisource data based on deep reinforcement learning to implement stock trading. Stock data, technical indicators, and candlestick charts can reflect the changes in the stock market from different perspectives, we use different deep neural networks to extract the features of the data source and fuse features, and the fused features are more helpful to learn the optimal dynamic trading strategy.
It can be concluded from the experimental results that the trading strategies learned based on the deep reinforcement learning method can be dynamically adjusted according to the stock market changes and have more advantages. Compared with other trading strategies, our trading strategy has better performance for stocks with different trends, and the average SR value is the highest, which means that under the same risk, our trading strategy can get more profits. However, textual information such as investor comments and news events has an impact on the fluctuations of stock prices and cannot be ignored. It is important to obtain informative data from relevant texts for stock trading. In future research, we will consider different text information and train more stable trading strategies.
Data Availability e experimental data in this article can be downloaded from Yahoo Finance (https://finance.yahoo.com/).