Hybrid Online and Offline Reinforcement Learning for Tibetan Jiu Chess

. In this study, hybrid state-action-reward-state-action (SARSA ( λ ) ) and Q-learning algorithms are applied to diﬀerent stages of an upper conﬁdence bound applied to tree search for Tibetan Jiu chess. Q-learning is also used to update all the nodes on the search path when each game ends. A learning strategy that uses SARSA ( λ ) and Q-learning algorithms combining domain knowledge for a feedback function for layout and battle stages is proposed. An improved deep neural network based on ResNet18 is used for self-play training. Experimental results show that hybrid online and oﬄine reinforcement learning with a deep neural network can improve the game program’s learning eﬃciency and understanding ability for Tibetan Jiu chess.


Introduction
Compared with Go, chess, Shogi, and other games achieving the top level of human beings by deep neural network and reinforcement learning, the research of Tibetan Jiu chess is still in the primary stage. e current chess power of Jiu chess is low, which has not defeated the primary players of human beings. And the standard Jiu chess game data are very few. At present, the complete Jiu chess manual obtained is only about 300 games. Jiu chess game has 2 sequential stages of layout and battle, and the layout stage can only enter into the battle stage after all the blank intersections are filled alternately, which leads to the lengthy search path for an upper confidence bound applied to trees (UCT) search. Considering the useless steps that the algorithm may take during the exploration, the search path will be much longer than that of chess and Go. In this case, how to improve the efficiency of Jiu chess program in self-play learning under the special rules and the limitations of laboratory hardware is our study motivation.
In this study, the state-action-reward-state-action (SARSA(λ)) and Q-learning algorithms are innovatively used in different stages of UCT search for Jiu chess. In the selection, expansion, and simulation section of the UCT search, the SARSA(λ) algorithm is used to update the quality of each action. In the backpropagation stage, the Q-learning algorithm is used to update the quality of each action. e quality of each action is denoted by its Q-value. e Q-learning algorithm is also used globally after the end of each game. Probability distribution function based on two-dimensional (2D) normal distribution matrix is used as feedback to SARSA(λ) and Q-learning for the layout stage. e feedback function for a time difference (TD) algorithm based on important shapes [1,2] is constructed for the battle stage. ResNet18 structure is improved to be suitable for the deep neural network training considering its lower error rate and better performance in image classification [3]. e contribution of this study is outlined in the following: (1) Hybrid deep reinforcement learning algorithm is firstly applied to Jiu chess game. e proposed strategy can prevent many worthless or low value nodes generated by the deep learning from wasting computing resources.
Using SARSA(λ) and Q-learning algorithms in different stages of UCT search provides guarantee for learning fine chess steps faster and improves the algorithm's learning efficiency; besides, final result can be fed back to all steps to improve the accuracy of the return estimation value.
(2) Constructing probability distribution function based on two-dimensional (2D) normal distribution matrix makes the reinforcement learning model have the ability of prioritizing its learning tactics in the center of the layout. (3) e proposed improved Resnet18-based network, with very small parameter scale, has achieved a good learning effect, reduced the use of computing power, and significantly shortened playing time by training on a common workstation or graphics processing unit (GPU) server in a short time. e experimental results have demonstrated it.
Section 2 of the paper introduces the rules and difficulties of Tibetan Jiu chess, and Section 3 introduces related works on Tibetan Jiu chess. Section 4 discusses the hybrid SARSA(λ) and Q-learning algorithm applied to different stages of the UCT search and the improved RedNet18-based deep neural network structure. Section 5 details the experiment and data analysis. Finally, we outline the conclusions and propose future work.

Rules and Difficulties in Tibetan JIU Chess
ere are two kinds of Tibetan chess, mi mang and Jiu, which are widely played in Sichuan, Gansu, Tibet, Qinghai, Yunnan, and other Tibetan areas [4] in China. e process of playing chess can be divided into two successive stages: layout and fighting. In the layout stage, the central area is occupied first in order to construct the dominant formation (square, standard, etc.) for the fighting stage; hence, the quality of the layout has an important impact on the final victory. In the fighting stage, actions include moving chess pieces, jumping capture, or even square capture. e key to victory is to take the lead in constructing a girdling formation, and the necessary condition to constructing this formation is to construct a square.

Tibetan Jiu Chess Rules.
e public Jiu board is 14 × 14 points. e Jiu game process is divided into two sequential stages: layout and battle. Jiu chess is a two-player game. e white side plays with white stones and the black side uses black stones. Players alternate turns. Stones must be added or moved to empty points on the game board. A Jiu game starts with an empty board. e task in the layout stage is to place one stone on a point at each move until there are no empty points on the board. e goal in the battle stage is moving or capturing stones until one player wins the game. e movements and capturing methods are similar to those of international checkers [1,2].

Layout Stage.
In the layout stage, white plays first, followed by black. e first and second moves must be placed on one of the points of the diagonal line of central grids. en, each side alternates in placing one stone on one point until no empty points remain on the board. After filling the board, game turns into battle stage.

Battle Stage.
In the battle stage, there are three actions to select in each turn.
(1) Move. Usually, a player moves a stone to the up, down, left, or right adjacent empty point (see Figure 1(a)). But there is the exception. If a player has no more than 14 stones left, he can move a stone to any empty cross point he wants. is exception is shown in Figure 1(c). Since the black side only has no more than 14 stones left, the diamond black stone moves to the point where the arrow directs. After this move, the square is constructed. During this move, the point of the diamond black stone is the beginning and the point of the black stone directed by the arrow is the ending. In this case, the black side can capture any one stone of its opponent. e dim diamond marked white stone is taken away by the black side after this move.
(2) Jumping Capture. When the opponent's stone is adjacent to the player's stone and there is an empty point directly behind it, the player will perform a jumping capture.
is action can be continued until the player cannot capture stones or the player's turn ends. As shown in Figure 1(b), the diamond marked white stone is placed to the final empty point where the arrows direct after four continuous jumping. is continuous jump begins from the point of the diamond white stone and ends at the point of the white stone directed by the arrow. e white side captures all the four black stones on the path where the arrows direct.
(3) Square Capture. In one turn, if a player constructs a square with four adjacent stones, he will capture one of his opponent's stones located at any point on the board. is is called square capture.
ere are three important shapes which are called gate, square, and dalian (or chain) associated with square capture. Gate is the basic shape which is shown in (A) in Figure 1(d). Square, one of the important shapes, is shown in (B) in Figure 1(d). Dalian or chain is the most important Jiu shape, which is shown in (C) in Figure 1(d). is shape is critical to the winning of the game. It is comprised by seven stones of the same color and one empty point. e stone adjacent to the empty point is called vital stone which is marked by the circle. By moving this vital stone to any of the two empty points, a player can construct a square and then capture one stone anywhere of its opponent on the board in one turn. A player can capture his opponent's stones by repeatedly moving the vital stone in different turns.

Winning the Game.
If a player wants to be a winner, he will make sure one of the following conditions is met: (1) He must have at least one special shape like chain while his opponent does not have any gates before his opponent has less than 14 stones. (2) He will win by taking away all stones of his opponent when both sides have no special shapes or gates.

Difficulties in Tibetan Jiu Chess
(1) Deep and wide tree search space. e layout stage is closely connected with the battle stage. When the layout is finished, the battle begins. e search space of the game tree is huge and the search path is far longer than that of general chess games. (2) Special rules make much more low value states than high value states. e Jiu chess layout first covers the central area and then gradually expands outward. e importance of position gradually decreases from the center to the outside. It takes a long time for ordinary deep neural networks to learn to choose high value states from much more low value states.
(3) Extremely limited research and expert knowledge.
Jiu chess players are mostly distributed in Tibetan areas, creating huge difficulties in collecting and processing Jiu chess data. At present, we have only collected and analyzed 300 complete chess record data, represented as SGF files.

Related Work
Deep reinforcement learning agorithms used in the Atari series of games, inlcuding Deep Q Network (DQN) algorithm [5], 51-atom-agent (C51) algorithm [6], and those suitable for continuous fieds with low search depth and narrow decision tree width [7][8][9][10][11][12][13][14][15], have achieved or exceeded the level of human experts. In the field of computer games, pattern recognition [6,16], reinforcement learning,  Complexity deep learning, and deep reinforcement learning algorithms are used in Go, including Monte Carlo algorithm and upper confidence bound applied to tree (UCT) algorithm [17][18][19][20], temporal difference algorithm [21], the deep learning model combined with UCT search algorithm [22], and DQN algorithm [23,24], and they have also achieved quite good results in computer Go game, which shows that the idea of deep reinforcement learning algorithm can adapt to the computer game environment. In addition, the application of deep reinforcement learning algorithm in Backgammon [25][26][27], Shogi [28][29][30], chess [31][32][33][34], Texas poker [35,36], Mahjong [37], and Star Craft II [38] has achieved human excellence and even exceeded human achievements. ere is no doubt that the deep reinforcement learning algorithm will make a good progress in this field when it is applied to Jiu chess.
At present, Jiu chess is little-known to the people because it is mainly spread in Tibetan people gathering areaes. Due to special rules and smaller players compared with Go or other popular games, the research on Jiu chess is very limited. Jiu chess is a complete information game, along with Go and chess. However, completely different from Go or chess, Jiu rules are very special with two consequential stages, which makes the game tree search path very long. ere are three Jiu chess programs from all current literatures [1,2,39]. And all of them have low chess power although they have different playing levels. In [1], several important shapes of Jiu were first recognized and about 300 playing records were collected and processed. Strategies based on chess shapes were designed to defend the opponent. It was the first Jiu program based on expert knowledge, but it had very low power because of the limited human knowledge. In [39], a Bayesian network model which was designed to solve the problem of small sample data for Jiu playing records estimated the chess board rapidly. is method alleviated the extreme lack of expert knowledge to a certain extent, but the model was only useful in the layout stage. In [2], a time difference algorithm was used to realize the probability statistics and prediction of chess type, which was the first time reinforcement learning was applied to Jiu chess. It was verified that solely applying SARSA(λ) or Q-learning algorithm with special different parameters was helpful to improve the efficiency of recognizing and evaluating chess board. However, it did not make important contributions to the chess power because of not applying a deep neural network or UCT search. Q-learning algorithm is not only used in the games, but also widely used in other fields such as the wireless network to reduce the cost of resources [40]. It is very promising. So in our study, we first combine SARSA(λ) and Q-learning algorithms in different stages of UCT search to help us get the motivation of achieving better performance under low hardware cost.

Learning Methodology
UCTsearch combined with deep learning and reinforcement learning is of high performance for Go, chess, and other games. However, it requires considerable hardware resources to support exploration to the huge state space and training for the deep neural network. To take advantage of reinforcement learning and deep learning while reducing hardware requirements as much as possible, this study proposed an improved deep neural network model combined UCT search, with hybrid online and offline time difference algorithms for Tibetan Jiu chess, which is exquisite and efficient in self-play learning ability.
Hybrid TD algorithms are applied to different stages of the UCT search, which minimizes searching in low-value state spaces. ereby, the computational power consumption and training time of the exploration state space are reduced. According to the unique characteristics of Jiu chess, a TD algorithm reward function is proposed based on a 2D normal distribution matrix for the layout stage, enabling the Jiu chess reinforcement learning model to more quickly acquire layout awareness of Jiu chess priorities. e reward function for the battle stage is also designed based on Jiu chess shapes. e improved ResNet18-based deep neural network based is used for self-play and training.

UCT Search.
e reinforcement learning algorithm used in this study is based on a UCT [41,42] search algorithm (see Figure 2), which combines Q-learning, SARSA(λ), and expert domain knowledge. When the model performs selfplay learning, it searches from the root node, uses SARSA(λ) combined with immediate feedback from domain knowledge in the former three stages (selection, extension, and default policy simulation), and uses Q-learning updation in the backpropagation stage.
In the selection, extension, and default policy simulation stages, the board situation is evaluated through expert knowledge and this evaluated value is returned to each node of the path using the SARSA(λ) algorithm updation method (see the bold parts of the routes in the selection, extension, and default policy simulation stages of Figure 2). In the backpropagation stage, the board situation is evaluated through Q-learning algorithm updation (see the bold part of the route in the backpropagation step of Figure 2).

Hybrid SARSA(λ) and Q-Learning Algorithm.
In this study, the node of the UCT search tree is expressed as where S represents the state of the node in the search tree; W represents the value of each action of the node; N represents the number of times the action of the node is selected, and for each action a selected, N(s, a) ⟵ N(s, a) + 1; P represents the probability of selecting each action under the state (calculated by the neural network); V is the winning rate estimated by the neural network; Q is the winning rate estimated by the search tree; and E is a variable of auxiliary Q updation. In the selection, extension, and simulation stages, E is updated according to the SARSA(λ) algorithm, which is expressed as follows: where R(s, a) is the reward value of the current situation which can usually be calculated by the board situation 4 Complexity evaluation, c is learning rate, s i+1 is the next state, searched by taking action a, and a i+1 is the action selected in state s i . e node (s i , a i ), which is previously searched in the searching path, is updated in turn by the following equation: where c 1 is learning rate of SARSA(λ), i is denoted as the state index satisfying 0 ≤ i ≤ c, c is the number of steps from the status of the root node to the status of the search ending at the stage of value return in each turn, i � 0 represents the searching is initialized by taking the current situation as the search tree root node, and i � c represents the searching is ended. Especially at the end of the game, c is simply the total number of steps from the beginning to the end of the game.
In the backpropagation stage of UCT search or at the time of each game end, equation (4) is used to update the Q-values of all nodes on the path, where c 2 is learning rate of Q-Learning: and if s i+1 is a leaf node, Q(s i+1 , a i+1 ) will be calculated by In the selection, extension, and simulation stages of the UCT search algorithm, this hybrid algorithm selects action a with the maximum value from state s through equation (6) by calculating the score of each alternative action: a � argmax a (Q(s, ·) + U(s, ·)).
For each alternative action a at state s, there is the comprehensive evaluation function which is represented by the following equation: where Q(s, a) is obtained by equation (3) where c puct represents the parameter balancing the exploitation and exploration of UCT algorithm.

Feedback Function Based on Domain Knowledge.
Q-value is updated by the combination of the SARSA(λ) and Q-learning algorithms [43,44]. SARSA(λ) is used to update the nodes on the game path in the selection, expansion, and simulation stages of the UCT [18]. In the backpropagation stage, Q-learning is used to update the values of all nodes in the search path. e hybrid strategy enables the algorithm to learn to prune the huge state space effectively, improving computing speed and reducing the consumption of hardware resources. e layout stage of Jiu chess plays a key role in the game and its outcome [1]. e value of the board position decays from the center of the board to the outside, which is similar to the probability distribution of a 2D normal distribution matrix. erefore, the importance of each intersection in the layout stage can be approximated by the 2D discrete normal distribution (see Figure 3).
In the battle stage, constructing the chess shapes discussed in [1,2] is very important in gaining a victory. e feedback function used by the SARSA(λ) and Q-learning algorithm is shown in equation (9)

Selection Expansion Default policy simulation Backpropagation
Repeat Q + U where f(x, y) is the joint probability density of X and Y is shown in the following equation [39,45]: where X and Y represent chessboard coordinates and μ represents the mean value of the range of the chessboard (x, y ∈ [0, 13], σ 1 � σ 2 � 2, μ 1 � u 2 � 6.5, ρ � 0). V chain , V gate , and V square are approximations summarized through the Jiu chess rules [1,2] and experience. V chain (s) � 7c chain , where c chain is the sum of the number of squares on the chessboard in the current state; V gate (s) � 3c gate , where c gate is the sum of the number of gates on the chessboard in the current state; and V square (s) � 4c square , where c square is the sum of the number of squares on the chessboard in the current state. Using the 2D normal distribution approximate matrix probability distribution based on expert knowledge as the feedback function of the TD algorithm in the layout stage, we can produce better search simulations and self-play game performance in the case of deep search depths and large search branches and learn a reasonable chess strategy in the layout stage better and faster [46].

Improved Deep Neural Network Based on ResNet18.
Strategy prediction and board situation evaluation are realized using a deep neural network, which is expressed as a function with parameter θ, as shown in equation (11), where s is the current state, p is the output strategy prediction, and v is the value of board situation evaluation: (p, v) � f θ (s). (11) e RESNET series of deep convolution neural networks shows good robustness in image classification and target detection [3]. In this study, a ResNet18-based network is used to transform the full connection layer. It can simultaneously output strategy prediction p and board situation evaluation v (see Figure 4). Unlike ResNet18 [3], a public full connection layer with 4096 neuron nodes is used as the hidden layer. e fully connected layer is replaced by a 196-dimensional output fully connected layer to output drop strategy p and a one-dimensional fully connected layer to output the estimation v of the current board status. e neural network is used to predict and evaluate the state of nodes that have not previously been evaluated. e neural network outputs p, v to the UCT search tree node so that e number of input characteristic graphs is two. It means that there are two channels. e first channel is the position and color of the pieces in the chessboard state. At the end of the game, we play back the experience and output the dataset i ≤ 0 ≤ fc |(s, π, z) i to adjust the parameters of the deep neural network. e loss function used is shown in the following equation: In the i-th state of the path, π i � N i /‖N i ‖ and z i � W i /N i . e training of the neural network is carried out

Results
e experiment was performed with the following parameters: c puct , step, and learning rate (see Table 1). We compare the efficiency of three methods of updating the Q value: hybrid Q-learning with SARSA(λ), pure Q-learning, and pure SARSA(λ). e server and hardware configuration used in the experiment is shown in Table 1.

e Improvement of Learning and Understanding Ability.
It is important to measure the learning ability of a Jiu chess agent that can quickly learn to form squares in the layout stage. erefore, we calculate the influence of the hybrid update strategy compared to pure Q-learning or SARSA(λ) algorithms on the sum of the squares of 200 steps of self-play game training, as shown in Table 2. When using the hybrid update strategy, the total number of squares in 200 steps is almost the sum of the number of squares occupied when only using pure Q-learning or SARSA(λ), proving that the hybrid update strategy can effectively train a Jiu chess agent to understand the game of Jiu chess and learn it well.
We also counted the number of squares per 20 steps in the 200-step training process. Figure 5 shows that the number of squares occupied when using the hybrid update strategy is significantly more than that when using pure Q-learning or SARSA(λ) algorithms. Especially in first 80 steps, hybrid updating strategy is significantly more effective than pure Q-learning or SARSA(λ) algorithms. And from step 140 to step 200, hybrid updating strategy has a little drop compared to the first 120 steps. It is because parameter c puct at 0.1 must decrease to fit later training to avoid useless exploration.
is result indicates that the chess power trained by the hybrid update strategy is stronger than that of pure Q-learning or SARSA(λ). In addition, this strategy produces stronger learning and understanding as well as better stability.

Feedback Function Experiment.
In order to improve the reinforcement learning efficiency of Jiu chess, a feedback function based on a 2D normal distribution matrix is added to the reinforcement learning game model as an auxiliary measure for updating value. To test the effect of this measure, we conducted 7 days and 110 instances of self-play training and obtained data with and without the 2D-normal-distribution-assisted feedback mechanism for comparison. Table 3 and Figure 6 show that the total number of squares occupied when using the 2D normal distribution assistant in the self-play training process is about three times that of programs that do not; the average value is close to   e total number of squares occupied when using the 2D normal distribution assistant during training is significantly more than that without the assistant. Two-dimensional normal distribution auxiliary matrix shows the application of expert knowledge in fact. e learning ability to play in the layout has improved about 3 times with the two-dimensional normal distribution auxiliary matrix.
We also compare the learning efficiency of using a 2D normal distribution assistant program from the perspective of the quality of specific layouts. As shown in Figure 7(a), in 10 days of training, the program without the aid of the 2D normal distribution has little knowledge of the layout. After using the 2D normal distribution matrix to assist in training, the layout process can learn specific chess patterns faster,    e above experiments show that the feedback function based on the two-dimensional normal distribution matrix can reduce the calculation amount of Jiu chess program and improve the self-learning efficiency of the program because in the layout stage, the importance of the chessboard position is close to the two-dimensional normal distribution matrix. Because Jiu chess can have more than 1000 matching steps and slow search iteration process, the evaluation of layout stage is optimized by using two-dimensional normal distribution matrix, which is actually the knowledge of model experts. e feedback value of the model indirectly reduces the process of blindly exploring the transfer value of chessboard state action, so it takes less time to get better results.

Conclusion
In this study, the deep reinforcement learning model, combined with expert knowledge, can learn the rules of the game faster in a short time. e combination of Q-learning and SARSA(λ) algorithms can make the neural network learn more valuable chess strategies in a short time, which provides a reference for improving the learning efficiency of the deep reinforcement learning model. e deep reinforcement learning model has produced good layout results obtained by the two-dimensional normal distribution matrix of expert knowledge modeling, which also proves that deep reinforcement learning can shorten the learning time by combining expert knowledge reasonably. e better performance of ResNet18, which was used to make the deep reinforcement learning model training more effectively at low resources cost, has been verified by experimental results.
Inspired by Wu et al. [47,48], we consider to improve Jiu chess power not only from the state-of-the-art of technologies but also from the holistic social good perspectives in the future work. We will collect and process much more Jiu chess game data to establish the big data resource which can turn big values for using light reinforcement learning model to reduce the cost of computing resources. We will also explore the possibilities of combining multiagent theory [49] and mechanism [50] to reduce the network delay of the proposed reinforcement model besides using the strategies proposed in [39,41].

Data Availability
Data are available via the e-mail xiaer_li@163.com.

Conflicts of Interest
e authors declare that they have no conflicts of interest.