An Algorithm of Reinforcement Learning for Maneuvering Parameter Self-Tuning Applying in Satellite Cluster

Satellite cluster is a type of artificial cluster, which is attracting wide attention at present. Although the traditional empirical parameter method (TEPM) has the potential to deal with the mission of satellite flocking, it is difficult to select the proper parameters. In order to improve the flight effect in the problem of satellite cluster, as well as to make the selection of flight parameters more reasonable, the traditional sensing zones are improved. A 3σ position error ellipsoid and an induction ellipsoid are applied for substituting the traditional repulsing zone and attracting zone, respectively. Besides, we propose an algorithm of reinforcement learning for parameter self-tuning (RLPST), which is based on the actor-critic framework, to automatically learn the suitable flight parameters. To obtain the parameters in the repulsing zone, orientating zone, and attracting zone of each member in the cluster, a three-channel learning framework is designed. The learning process makes the framework finally find the suitable parameters. Numerical experimental results have shown the superiorities compared to the traditional method, which include trajectory deviation and sensing rate or terminal matching rate, as well as the improvement of the flight paths under the learning framework.


Introduction
Satellite cluster is a new space architecture emerging after satellite constellation and satellite formation in recent years [1][2][3]. Different from the satellite formation, which is generally required to design geometric configuration and control desired formation, satellite cluster emphasizes more on the coordination and cooperation among the members in a system. rough the specific technology of satellite cluster, multiple spacecraft with the same or different functions can be connected into an organic whole by a self-organizing network. us, the system is expected to have the flexibility to realize one or more tasks [4,5].
At present, the research of satellite cluster can be generally divided into two types: one is a long-distance loose cluster, which mainly considers the drift and periodic configuration design under long-term condition, and the other is a short-distance cluster, which involves the techniques about cooperative control and collision avoidance.
For the long-distance cluster, the boundedness of the system is mainly considered, which means that the influence of various perturbations on the boundary of the cluster will be analysed [6]. From the perspective of fuel optimization, the maneuver sequences will be solved for maintaining the loose flying of the cluster for a long time. Mazal and Gurfil developed a cluster flight control algorithm based on fuelefficiency for distance-keeping of spacecraft cluster [7]. Based on the relative elements, Wang and Nakasuka applied nonlinear programming for solving the orbit design of fractionated spacecraft [8]. For a long-distance cluster, Dang et al. found the analytic distance bounds for the coplanar relative motion [9]. On the contrary, the techniques of multiagent control, such as graph theory and consistency algorithms, are being continuously studied for the problems of a short-distance cluster. In the field of formation control modelled by second-order dynamics, Ren et al. studied the consensus-based formation control in the absence of centralised leadership [10,11]. Considering the fixed and switching topologies, Olfati-Saber and Murray solved the consensus problems for networks of dynamic agents [12]. For the leader-follower consensus problem, Song et al. proposed a pinning control algorithm based on graph theory to handle the condition without a strongly connected interaction graph [13]. Aiming at the multivehicle system of double-integrator dynamic, Qin et al. investigated the consensus strategies to deal with the time-varying reference velocity [14]. However, the complex information link and massive computing are always confusing the algorithms based on graph or consistency theory.
Inspired by biological clusters, humans have constructed a variety of artificial clusters, such as robot clusters and unmanned aerial vehicle clusters. e traditional empirical parameter method (TEPM) has been proved to be effective in many multiagent fields. For describing the motion of flocking particles, Reynolds created a distributed behavioural model [15]. Vicsek et al. suggested a model where the particles were driven with constant velocity and the system was biologically motivated [16]. To reveal the relationship between the individual and the group based on behavioural transitions, Couzin et al. presented a self-organizing model of group formation [17]. Based on these classic models, the algorithms for realizing the specific missions of the cluster were proposed [18], and some attraction/repulsion functions, which are used for achieving swarm aggregation, were designed [19]. With the deep discussion about the interactions between the particles in a swarm, the rule-based control or behaviour-based control was gradually concerned to act in dynamic multiagent systems [20,21]. Specifically, in the field of aeronautic and aerospace, the behaviour-based path-planning technique for configuring the cluster structures [22], avoiding collision [23], and dealing with needs of aviation swarm convoy [24] were studied, respectively. Due to the wide application prospect of behaviour control methods in biological clusters, more and more space agencies are expecting to introduce the concept of biological clusters into space systems. In this way, it will be possible to make the satellite cluster similar to a biological cluster for completing complex space tasks with simple and cheap spacecraft. Nevertheless, the selection of the behaviour parameters, which is generally decided through the experiences from the scholars, has not been very deeply discussed yet. In order to train the behaviour parameters, one can apply supervised learning if the prior experiences can be obtained. However, such experiences are usually hard to obtain. erefore, it is a promising direction to find a way to optimize the parameters without the experience data.
In recent years, reinforcement learning has been paid more and more attention in the field of intelligent clusters.
rough interacting with the environment, agents in the cluster can optimize their maneuvering strategies under the model-free condition [25,26]. erefore, the traditional maneuvering strategies, which are based on man-made rules, can be improved. Instead of using the fixed rules, Morihiro et al. proposed a self-organized framework based on reinforcement learning for the flocking agents to conduct group missions [27]. In cooperative multirobot systems, Gu and Yang applied fuzzy policies with policy gradient approach to solve leader-follower problems [28]. Under selforganizing principles derived from natural interactions, Chen et al. solved a swarm pursuit game through a multiagent reinforcement learning framework [29], and Hung and Givigi presented a Q-learning algorithm, which was applicable to a stochastic environment, for the flocking fixed-wing unmanned aerial vehicles [30]. erefore, reinforcement learning is a promising method for dealing with the cluster problem of multiple agents. As the most energetic branch of present reinforcement learning, the actor-critic method is suitable for motion problem of continuous agent systems [31].
In this paper, we propose an innovative self-organizing algorithm of reinforcement learning for parameter selftuning (RLPST), which is based on the actor-critic framework for the path planning of short-distance satellite cluster. e proposed learning algorithm is composed of three channels which can automatically adjust the flight parameters of an agent in the zone of repulsing (ZOR), zone of orientation (ZOO), and zone of attraction (ZOA), respectively. rough iterative learning, the maneuvering strategies of the cluster can be optimized. In this way, the disadvantages under traditional control strategies based on man-made experience parameters for the cluster are broken, and it is expectable to apply the proposed algorithm in a variety of clustering tasks. e main contributions of this paper are as follows: (1) it is the first time to apply the actor-critic framework into the path planning of satellite cluster. Aiming at two kinds of classical space cluster scenarios, we introduce the reinforcement learning to deal with the relative distance between the members of short-distance satellite cluster, which fills the blank of self-parameter tuning based on the heuristic method in the field of a short-distance satellite cluster. Under the same three flocking principles of Reynolds, we have compared the results under the proposed RLPST and the ones under TEPM and proved the superiority of the proposed algorithm. (2) It is the first time to apply the actorcritic framework to optimize the flight parameters of ZOR, ZOO, and ZOA in a cluster through three channels, respectively, instead of directly optimizing the maneuver of the agent. In this way, it makes full use of the known model information. Besides, the learning difficulty when applying the reinforcement learning to the satellite cluster problem, which has large-scale continuous state and action space, can be reduced. e structure of this paper is as follows: Section 2 presents the model of satellite cluster and related sensing areas; Section 3 discusses reinforcement learning algorithms for continuous systems; Section 4 applies reinforcement learning to the motion of satellite cluster; Section 5 simulates the proposed algorithm under two classic scenarios, respectively; and Section 6 discusses the simulation results. Finally, Section 7 draws the conclusions.

Problem Statement
It is supposed that the subject of the research is a satellite cluster with N members, with a virtual host point, O, which is near to the center of the cluster in space. Figure 1 draws the flocking satellites and the virtual host point. In Figure 2, it shows that each member in the cluster is an agent, which has the ability of induction, interacting with the environment. By inputting the current states and the maneuvering strategy, the member is able to obtain the reward for preparing the correction of the strategy. e symbol θ is denoted as the true anomaly, a as the semimajor axis, and e as the eccentricity of the orbit of the virtual host satellite, and then the following equations are obtained [32]: 2e sin θ(1 + e cos θ) 3 1 − e 2 ( ) 3 . (1) To facilitate the description of the problem, the following coordinate systems are established: (a) Earth centered inertial (Ox i y i z i ); (b) orbital coordinate system of the member satellite (Ox o y o z o ); and (c) orbital coordinate system of the virtual host satellite (Ox r y r z r ). In Ox r y r z r , the position vector x i � [x i , y i , z i ] T is denoted. erefore, according to the two-body motion rule of spacecraft, ignoring the secondorder small quantities, the dynamic equation of the ith member in the cluster can be expressed as follows [32]: where r f represents the distance between the mass point of the satellite to the origin and f j i (j � x, y, z) represents the force in the corresponding channel.

Position Error Model of a Cluster Member.
e position of a satellite is objective; however, it cannot be accurately known. In different types of missions, there will always be measurement, navigation, control, and other deviations and the influences of perturbation. ese factors will cause the real orbit diverged from the nominal orbit of the spacecraft, resulting in trajectory deviation. Taking the Gaussian distribution as an example, the covariance matrix of the spacecraft state distribution is denoted as P, and then, it is   obtained that the real state vector x is in a hyperellipsoid, which is centered on the nominal state vector x [33,34]. e sphere of the hyperellipsoid can be expressed as follows: (3) Meanwhile, the probability density function of the relative state error distribution is where n represents the dimension of the state space and l represents the Markov distance constant. Specifically, when l � 3, equation (3) represents the 3σ error ellipsoid. e above equation shows a six-dimensional ellipsoid. e matrix, A, is denoted as where the matrix A is a real symmetric positive definite matrix. e symbol, R, is denoted as the position component of x; therefore, the position error ellipsoid of the spacecraft can be expressed as follows:

Sensing Area Division of a Cluster Member.
Traditionally, for dealing with problems of cluster, the sensing areas, which are generally known as zone of repulsion (ZOR), zone of orientation (ZOO), and zone of attraction (ZOA), are defined from the inside to the outside as the spherical regions [17]. Figure 3(a) shows these traditional ZOR, ZOO, and ZOA. Under such uniform sensing areas, it will be certainly convenient for describing the problem and designing the cluster control strategy. However, considering the location deviations of spacecraft cluster members and the capability of the attaching sensors, it is necessary to improve the way for dividing sensing areas.
Here, we redefine the sensing areas of a satellite member in the cluster, which is illustrated in Figure 3(b). According to Figures 3(a) and 3(b), we see that, for each member satellite in the cluster, there exists three sensing areas. Compared with the traditional sensing areas, the redefined ones have replaced the ZOR and ZOA part with the 3σ error ellipsoid and the induction ellipsoid, respectively. e details of the redefined sensing areas are expressed below.

ZOR.
e 3σ error ellipsoid area of each member is defined as the zone of repulsion. It is assumed that the position deviation obeys the Gaussian distribution. As a result, each member in the cluster has its own position error ellipsoid. If the ellipsoid of a specific member interacts with the one of other individuals, the collision between the two members may occur. erefore, such an ellipsoid is the ZOR for making repulsive force to avoid the probable collision. e members in the ZOR will make repulsive force on the center member. In this way, it will avoid the individuals in the cluster getting too close from each other.

ZOO.
e orientation area is defined as a standard sphere with a specific radius. For a specific member, its ZOO is an ideal zone that neighbours, which are located in such an area, keep suitable distance with this specific member. is member will receive the orientation force from the neighbours in its ZOO, which makes the member tends to align its speed with its neighbours gradually. In this way, the flight process will be smooth.

ZOA.
e attracting area is defined as the induction ellipsoid. Traditionally, the attracting area is uniform. However, in the case of spacecraft cluster problem, due to the capability of the sensing elements, the sensing ability may be strong or weak in different directions. erefore, we use an ellipsoid model to nearly describe the induction area of the member in the cluster.

Location Criterion of Sensing Zones.
For the members located in the sensing areas of the ith satellite, it is important to determine which region these members belong to. In traditional ways, the belonging sensing area is usually determined by the location of the mass center of the member. Different from the traditional method, this paper applies the idea of the Box method [35], which takes the location relation of error ellipsoids as the criterion to judge whether two members in the cluster are repulsive or not. If the error ellipsoids intersect with each other, the repulsion force will be generated between the two members.
In order to detect the position relation between the two error ellipsoids, the algebraic criterion is needed. During this process, it needs to carry out the affine transformation on the two ellipsoids. e process of affine transformation is shown in Figure 4.
Suppose that the S 1 frame, which is centered at the nominal mass point of ith member, is parallel to Ox r y r z r . en, the position error ellipsoid of the ith member is expressed as follows: Denote the symbol X � [x, y, z, 1] T ; therefore, the error ellipsoids of the ith member and the jth member can be transformed as where B Mathematical Problems in Engineering where the frame comes to S 4 and the condition a ≤ b ≤ c is satisfied. erefore, we can obtain the standard discriminants as equations (9) and (10). e characteristic polynomial is defined as follows: e relevant characteristic equation of the above polynomial is According to the location judging algebraic criterion [36], the position relation between the ellipsoid of the ith member and the one of the jth members can be detected.
If the characteristic equation has two different real roots, the two ellipsoids are separated. us, it will be easy to judge if the jth member locates in the ZOO or ZOA of the ith member or not. Otherwise, the two ellipsoids are not separated, which means that the jth member locates in the ZOR of the i member, and the repulsive force is generated.

Analysis of the Force Acting in Sensing Areas.
For the ith member, when the sensing sets X r , X o , and X a , which relate to the ZOR, ZOO, and ZOA, respectively, are obtained, we can calculate the force directions of the ith member. It is noted that, for each member in a cluster, the member will have its own ZOR, ZOO, and ZOA. erefore, we only talk about the condition of an arbitrary member in a cluster here. When this arbitrary member is mentioned, it is called "center member" for distinguishing it from the neighbours in its three sensing areas.

Force Direction in ZOR.
From the left part in Figure 5, it shows that the individual m is closer to the center member than the individual n under the S 1 frame. According to the traditional repulsive rule, the center member will be repulsed by the individual m more than the individual n. However, for the ellipsoidal ZOR of the center member, the intensity of the repulsive force should be related to the close degree from the individual m or n to the boundary of ZOR. Because the traditional repulsing area is defined as a standard sphere, the boundary is uniform in all directions, which is convenient to calculate the intensity of the repulsive force. erefore, a special treatment is needed to deal with the nonuniform boundary of the repulsive area, which is shown in the right part in Figure 5. After the affine transformation process, the distance from the center member to the individual m and the one to the individual n is approximately equal. e reason of this situation is because that both of the individual m and n are originally near to the boundary of the repulsing area. erefore, through the matrix, T tr , which represents the transformation matrix from the S 1 frame to the S 2 frame, the direction of repulsive force of the ith member in the cluster is expressed as where r ij is the relative position vector in S 2 frame and d r is the boundary of the repulsing area, which is expressed as follows: It is noticed that the constant coefficient 1.2 is used to avoid ambiguity caused by zero denominator.

Force Direction in ZOO.
Due to the specific definition of ZOO, which is a standard sphere with radius d m , the direction of orientation force can be expressed as where v i and v j represent the velocity of the ith member and the jth member in Ox r y r z r , respectively.

Force Direction in ZOA.
Compared with the traditional ZOA, the proposed ZOA is set as the induction area of the center member, which means that the sensing ability is not uniform for the center member. In addition, we need to guarantee that the intensity of attractive force should be zero at the boundary of ZOO. As a result, the corresponding boundary of attracting area of each member in the cluster needs to be calculated. e direction of the attractive force is expressed as where d a is the boundary of the attracting area, which can be defined as follows: It is noted that, in Figure 6, we see that the individual m and the individual n are located at the boundary of ZOA. e center member is expected to judge which is the farthest neighbour in its ZOA. en, the relative position vector from the center to that neighbour will be used to generate d a .
When the virtual host satellite is in a circle orbit or nearcircle orbit, the conditions, _ θ � n and € θ � 0, are satisfied. To denote the symbol as X � x y z _ x _ y _ z T , the dynamic model, which is expressed in equation (2), can be rewritten as follows: To denote the symbols, , and c ai ∈ [−c min ai , c max ai ] as the flight parameters related to ZOR, ZOO, and ZOA respectively, the motion controller of the ith member in the spacecraft cluster is designed as where It is noticed that the term −BX − ] is added to make the agent move stably during the control gap.
Based on equations (13) to (16), the force directions of the ith member, p i r , p i o , and p i a , can be calculated. erefore, for the controller shown in equation (19), the key is to find the corresponding parameters c ri , c oi , and c ai . e effect of cluster flight will be largely determined by these parameters.
In traditional ways, the parameters are selected according to the experimental results or the expert experiences, which is known as TEPM. Nevertheless, considering the intelligent development of spacecraft and the raising labour cost, the satellite cluster needs to have a certain automatic capability to adjust the parameters in the future. To achieve this goal, an innovative algorithm of RLPST is proposed, which applies the reinforcement learning framework and is expected to make the flight parameters self-tuned along with multiple learning times.

e Fuzzy Inference System.
In order to apply the reinforcement learning into the space cluster, which has continuous dynamic systems, it is reasonable to find a way to not only avoid the curse of dimensionality but also have the clear physical meaning. erefore, a zero-order Takagi-Sugeno (T-S) fuzzy system is employed as the approximator. It is assumed that the fuzzy system has L rules and n input variables. e fuzzy inference rule is Rule l: IF s 1 is F l 1 , . . . , and s n is F l n then z l � ϕ l , where s i (i � 1, . . . , n) represents the ith input of the fuzzy system, F l i represents the fuzzy set of the ith input variable, z l represents the output of the lth rule, and ϕ l represents the output parameter.
With the h membership functions of each s i , the output of the fuzzy system is expressed as where s � [s 1 , . . . , s n ] T is the state vector and μ F l i is the membership function of s i under the lth rule. In addition, the expression of Ψ l (s) is as follows:

3.2.
e Actor-Critic Learning Algorithm. Reinforcement learning is a type of algorithm that interacts with the environment. e agent optimizes its behaviour through the rewards obtained from the environment for maximizing the total benefits. In the Markov process, the value function of reinforcement learning can be expressed as where c ∈ [0, 1) is the discount factor and R i is the immediate reward which is obtained from the environment. In order to solve Markov decision problem in continuous action space, a type of reinforcement learning algorithm called adaptive heuristic critic (AHC) has been widely studied and applied. In the AHC algorithm, the value function and the policy function are approximated, respectively. In this way, the learning structure is called the actor-critic framework. In such a learning algorithm, the critic part is used to estimate the value function, while the actor part is used to generate the action. To generalize the state space and the action space, the critic part and the actor part are both composed of T-S systems. To apply the temporal difference (TD) learning method, we need two critic parts for estimating the current value function V t (s t ) and the next value function V t (s t+1 ). e temporal difference can be expressed as follows: E is denoted as the variance of the difference signal, which is shown as and the adaptive update rule of the parameters in the critic is expressed as where α is the learning rate of the critic. Furthermore, according to the gradient descent method, it is shown that To sum up, we have Combining with equation (22), equation (28) can be solved.
e adaptive update rule for the critic part is shown as above. As for the actor part, the adaptive update rule of the output parameter, ϕ A , is expressed as where β is the learning rate of the actor. e partial derivative of u t is expressed as follows:

Algorithm of Reinforcement Learning for Parameter Self-Tuning in Satellite Cluster
e proposed learning framework in this paper is singlelooped, which can be divided into three channels of repulsing area (r), orientating area (o), and attracting area (a), respectively. e input of the fuzzy system is single, which is defined as the proportion of the total number of the sensing members in every sensing area.
e ith member is taken as an example, and its inputs for fuzzy systems are expressed as s a � n a N ,

Mathematical Problems in Engineering
where n a , n o , and n r represent the number of sensing neighbours in the repulsing area, orientating area, and attracting area, respectively. Besides, the symbol N represents the total number of all sensing neighbours of the ith agent. erefore, s a , s o , and s r represent the proportions of sensing members in the corresponding areas. e critic part and the actor part are composed of fuzzy systems. e inference rule is shown as where · { }| q represents the variable in the q channel (q � a, o, r { }). As mentioned in Section 3.1, it is supposed that the input has h membership functions and the fuzzy system has L rules in total. It is noticed that, because the input is single, it meets that L � h. erefore, according to the membership degree of the input, the output can be calculated: e fuzzy inference process of the actor part is similar to that of the critic part and the difference lies in the consequent parameter to each membership degree: e three sensing areas have different proportions of sensing members; therefore, the designed reward function, R t | q , is expressed as From the structure of the reward function, it is shown that if the proportion of the sensing member, s r , is positive, the system will receive a positive reward, which will stimulate the system to enhance the coefficient of force in the repulsing sensing area. Except that the condition where s r is positive, the rewards of other sensing areas will be decided according to the states of s a . When s a is larger than ε, which is a positive separator, the reward about ZOA is positive; otherwise, the reward about ZOO is positive. It is mentioned that the rewards of the three areas cannot be calculated simultaneously. Otherwise, it may make the parameters enhanced simultaneously, which may make the learning invalidate. erefore, the calculation process of the reward needs to be prioritized according to specific tasks. e whole diagram of learning logic is illustrated in Figure 7.
In Figure 7, there exists two critic parts and one actor part in each channel. e two critic parts are applied to estimate the value of current time, V(t), and the value of next time, V(t + 1). According to s a (t), s o (t), and s r (t), the parameters, c ai , c oi , and c ri are calculated. Bring these parameters into the motion controller, which is designed in equation (19), s r (t + 1), s o (t + 1), and s a (t + 1) are obtained. Besides, the immediate reward, R t , is also acquired from the environment. According to R t , V(t), and V(t + 1), the time difference, Δ t , is calculated. e output parameters of the critic part and the actor part can be adjusted according to Δ t .
To sum up, the learning algorithm of RLPST is shown as Algorithm 1.

Simulation
A cluster with four satellite members, which are numbered from No. 1 to No. 4, is selected as the numerical experimental object. It is supposed that the reference orbit is a circular orbit with the radius of 10 4 km. e symbols, x 10 , x 20 , x 30 and x 40 , are denoted as the initial states of the satellites in the cluster from No. 1 to No. 4. e first three items of these vectors represent the relative position in m, while the last three items represent the relative velocity in m/s. In this numerical experiment, for each cluster member, the quadratic matrix of the position error ellipsoid is set as A and that of the induction ellipsoid is set as M. Based on the reference in [33,34,37], the values of A and M are set as follows: Two classic scenarios, which include the scenario of adding members into the cluster and the scenario of members following a flight path, are considered, respectively. It is noticed that when we talk about adding members,  Table 1.
For representing the smoothness of the flight paths, the signal σ is defined to express the deviation degree from the whole flight path to the center baseline: where ρ i and ρ ref represent the position vector of the ith member and the corresponding position vector on the center baseline, respectively.
For the mission of adding members into an original cluster, it is appropriate to judge the terminal matching degree of new adding members. erefore, the signal η m , which is called the terminal matching rate, is defined to represent the degree of terminal status in ZOO: where Num m i represents the number of neighbours in the ZOO of the ith member.
In order to express the effectiveness of improving the flight paths under the proposed RLPST, the signal Cost is defined to represent the quality of distances among new adding members, which is shown as where m r and m a represent the corresponding coefficients of sets X r and X a , respectively. k is denoted as the empirical parameter to substitute the value of c r , c o , and c a in TEPM; then, the experimental results under the simulation time T with 1000 s are shown below.
From Figures 8(a) and 8(b), the trajectories of TEPM with k � 3 and k � 6.5 are illustrated, respectively. In Figure 8(a), it is seen that the terminal positions of the four members are relatively far away from each other, which does (1) for all cluster members do (2) for all channel do (3) Initialize the membership functions (4) Initialize V q � 0, ϕ C l | q � 0, ϕ A l | q � 0, for l � 1, . . . , L; (5) end for (6) end for (7) for each episode do (8) for all cluster members do (9) Initialize states of the cluster member (10) for all Time step do (11) Calculate the 3σ position error ellipsoid according to equation (7)  (12) Maintain all sensing neighbours of the cluster member (13) Obtain the sensing sets, X r , X o , and X a , respectively, based on the results of equation (11)-equation (12)  (14) Calculate the force direct p r i , p o i , and p a i according to equation (13), equation (15), and equation (16), respectively (15) for all channel do (16) Calculate the output of the actor c q through equation (36)  (17) Calculate the output of the critic V q (t) from equation (35)  (18) Interact with the environment (19) Obtain the reward R t , and the output of the critic V q (t + 1) (20) Calculate the time difference Δ t from equation (24)  (21) Update ϕ C l | q and ϕ A l | q � 0 according to equation (26) and equation (30), respectively (22) end for (23) end for (24) end for (25)  not satisfy the requirement of the mission. is is because the empirical parameter is selected too small. On the contrary, from Figure 8(b), due to the large empirical parameter, it shows the nonsmooth trajectories of satellites No. 3 and No. 4. Although the requirement of terminal positions of the four satellites is guaranteed, the flying process will waste unnecessary fuels for the nonsmooth flight paths. erefore, it is seen that we will easily get confused for the selection of empirical parameter under the TEPM. Whether the parameter is selected too large or too small, the flying effect cannot meet the goal of mission.
To compare with the results of TEPM, we set the discount factor c as 0.8, the reward separator ε as 0.5, the learning rate of the critic α as 10 − 7 , the learning rate of the actor β as 10 − 8 , the coefficients in equation (41) m r as 1000, and m a as 10 − 3 . us, the results under the proposed RLPST are shown in Figures 9 and 10, where the results of members adding with different learning times are illustrated, respectively. From Figure 9 Figure 9(b), it can be seen that, with the increase in learning times, the direction changing is fixed, and the satellites No. 3 and No. 4 have basically determined the same flight direction as the original cluster. However, they have not integrated with each other yet and they are still attracted by the original cluster continuously. Figure 10 shows the finished training results after 55 times of learning. As the flight progresses, satellites No. 3 and No. 4 finally have merged into the original cluster to form a new cluster, and the mission of adding members is completed. Figure 11 shows the trajectory deviations under the TEPM and the proposed RLPST, respectively. It is seen that the deviation under the TEPM is relatively large when the empirical parameter is set too large or too small. e result is reasonable because when the empirical parameter is too small, the whole flying condition cannot meet the terminal requirement of the mission, and when the empirical parameter is too large, the flight paths are nonsmooth, which may cause large trajectory deviation as well. When the empirical parameter is set to be an acceptable value, the trajectory deviation will meet the low point in the figure. However, compared with the proposed RLPST, the deviation under the RLPST is obviously lower than that under the TEPM, which means that the proposed RLPST has more smooth flight path which is a benefit for saving fuels and avoiding complex maneuvering strategies. e terminal matching rate represents the final states of the cluster, and the ideal value is equal to one, which means that the adding member keeps a moderate distance with not only the original cluster members but also other adding members. From Figure 12, it is seen that the rate is different with different empirical parameters, which means that the matching rate cannot be guaranteed optimal under the TEPM. On the contrary, the solid line represents the rate under the RLPST, and it is clear that the rate is equal to one when the learning process is finished.
From Figure 13, the variation in the cost line along with learning times is illustrated. It is clearly seen that the cost is generally decreased with the increasing learning times. e figure indicates that the learning process has reduced the cost effectively, which means that the total flying condition is improved gradually during the process.  Table 2.

Scenario of Members following a Flight
Similar to Section 5.1, to express the deviation degree from the whole flight path to the center baseline, the following definition is executed: where ρ i and ρ ref represent the position vector of the ith member and the corresponding position vector on the center baseline, respectively. Besides, the signal Cost is also defined to represent the quality of distances among new adding members, which is shown as follows: In addition, in the scenario of members following a leader, it will be reasonable to care about how many neighbours can each member sense. e more neighbours that a member can sense, the more information can the member obtain, which will be benefit for planning the flight paths. erefore, the symbol η s is defined as the sensing rate for representing the degree of sensing ability of the members in the cluster: where Num s i represents the number of neighbours that the ith member can sense.
We set the discount factor c as 0.8, the reward separator ε as 0.2, the learning rate of the critic α as 5 × 10 − 4 , the learning rate of the actor β as 10 − 4 , the coefficients m r as 1000, and m a as 10 − 3 . us, the experimental results under the simulation time T with 1000 s are shown below.
From Figures 14(a) and 14(b), the trajectories under TEPM with k � 6 and k � 22 are illustrated, respectively. In Figure 14(a), it is seen that satellites No. 2 to No. 4 fly aside from satellite No. 1 in the latter half of flight. e reason why their flight paths deviate from the leader is because that the empirical parameter is set too small that the members cannot sense the leader. It indicates that, in TEPM, if the empirical parameter is too small, some unexpected situations will occur which may result in the failure of the mission. In Figure 14(b), it is seen that the flight paths of the members are nonsmooth, which is not a good flight condition for following the leader. Certainly, this is because of the large value of the selected empirical parameter. Figures 14(a) Figure 16, it is seen that when the empirical parameter is set from 12 to 16, the sensing rate is equal to one. In such a condition, all members in the cluster can sense other members during the whole flight. However, when the empirical parameter is too small or too large, the sensing rate will not be guaranteed to be one, which means that flight effect may be badly influenced. On the contrary, the solid line represents the sensing rate under the proposed RLPST, which is guaranteed to be one when the learning process is finished. e figure shows the superiority of the proposed RLPST because of the assurance of the optimal sensing rate. Figure 17 shows the trajectory deviations under the TEPM and the proposed RLPST, respectively. Similar to the condition shown in Figure 11, when the empirical parameter is too small or too large, the deviation is obviously high because of the badly flight results. In the figure, if the parameter is chosen as about 12, it has the lowest value of deviation. However, compared with the value under RLPST,  it indicates that the value under RLPST is still smaller than the lowest value under TEPM. erefore, the RLPST method has the ability to meet the more lower deviation within the safety flight range, which makes the flight path more smooth. From Figure 18, the variation in cost along with the learning times is illustrated. Similar to the curves drawn in Figure 13, the cost is generally decreased with the increase in learning times. When the learning process is finished, the near lowest cost for the mission is found. e figure indicates that the flight condition is gradually improved during the learning process, and the cost of the flight can be effectively reduced through the proposed RLPST.

Discussion
In Section 5.1, we simulate the scenario of members adding under the TEPM and the proposed RLPST, respectively. e results show that it is difficult to select proper empirical parameters under TEPM. In addition, the trajectory deviations and terminal matching rates under TEPM and RLPST are compared. e trajectory deviation under RLPST is lower than that under TEPM. On the contrary, the value of the terminal matching rate under RLPST is guaranteed to equal to one, while that under TEPM cannot be. erefore, the superiorities of the proposed RLPST are obviously proved. e variation in the cost along with the learning times shows the flight paths can be gradually improved through the learning framework.
In Section 5.2, the scenario of members flight path following under the TEPM and the proposed RLPST are simulated, respectively. Apart from the superiorities of  RLPST in the aspects of low trajectory deviation and decreasing cost, the sensing rate under RLPST shows the advantage compared with that under the TEPM, which means that RLPSTmethod makes the cluster member be able to obtain more information from neighbours for completing the mission. A steep downward trend in Figure 18 is due to the selection of learning rates. At the final part of learning process, the performance of the cluster becomes sensitive to the learning rates, which is still a challenging problem. e time cost and iteration times for the two simulated scenarios are listed in Table 3.
By comparing with the similar studies of [29,30], it is seen that the time costs and iteration times of the two simulated scenarios are acceptable, which means the proposed RLPST can improve the flight path within a reasonable payment.

Conclusion
Due to the difficulties of parameter selection under TEPM for satellite cluster flying, a type of parameter-self-tuning method based on the actor-critic algorithm is proposed for handling the problem. Considering the specific condition of satellite cluster, the three sensing zones are redefined and the method for determining the belonging zones of sensing members of each cluster member is presented. To tune the flight parameter in each sensing zone, the fuzzy inference systems are employed to compose the actor and critic parts. With the proper design of reward function, a three-channel learning framework of parameter self-tuning for satellite cluster is designed. Compared with the TEPM, the proposed RLPST algorithm shows the superiorities. e results of simulation experiments indicate that the proposed RLPST has the lower trajectory deviation and guarantees the better terminal matching rate for scenario of members adding as well as the better sensing rate for scenario of members flight path following than the TEPM. Besides, the numerical experimental results also have shown the decrease in the cost along with the learning times in the two scenarios, which proves that the proposed RLPST has the ability to gradually improve the flight paths of the satellite cluster under the learning framework.
Denote the symbol X � [x, y, z, 1] T ; therefore, the error ellipsoid can be transformed as i is the ellipsoidal quadratic matrix. It is assumed that the jth unit is in the sensing area of the ith unit with the relative distance r ij � r j − r i � [x ij , y ij , z ij ] T , and the position error ellipsoid of the jth unit in S 1 frame can be expressed as  us, the transformation matrix, T 2 , which is applied to align the axes, can be obtained: Besides, the transformation matrix T 3 is defined as follows: e S 2 frame is defined as the coordinate system for axis alignment of the ith unit. erefore, the error ellipsoids of the ith unit and the jth unit in S 2 frame can be expressed as (A.10) After the transformation process, the distance vector from the jth unit to the ith unit is expressed as where T tr is the matrix for distance transformation. For satisfying the condition which is applicable for the location judging algebraic criterion, the origin of the frame needs to be translated to the nominal mass point of the jth unit. e translation matrix is denoted as T 4 , which is expressed as follows: After the translation process, the frame comes to S 3 , where the error ellipsoids of the ith unit and the jth unit are expressed as follows: For aligning the axes, the rotation matrix is denoted as T 5 . us, we have where the frame comes to S 4 .

Data Availability
e data, such as number of cluster members: N, simulation time: T, initial state of No. 1 satellite: x 10 , initial state of No. 2 satellite: x 20 , initial state of No. 3 satellite: x 30 , initial state of No. 4 satellite: x 40 , discount factor: c, separate factor: ε, learning rate: α, learning rate: β, factor for calculating Cost: m r , factor for calculating Cost: m a , reference orbit, matrix of error ellipsoid: A, matrix of induction area: M, deviation degree: σ, matching rate: η m , sensing rate: η s , degree of flight quality: Cost, time cost, iteration times, trajectories under TEPM, and trajectories under RLPST, used to support the findings of this study are included within the article.

Mathematical Problems in Engineering
Conflicts of Interest e authors declare that they have no conflicts of interest.