Control of Magnetic Manipulator Using Reinforcement Learning Based on Incrementally Adapted Local Linear Models

,


Introduction
A reinforcement learning (RL) agent interacts with the system to be controlled by measuring its states and applying actions according to a policy so that a given goal state is attained. e policy is iteratively adapted in such a way that the agent receives the highest possible cumulative reward, which is a scalar value accumulated over trajectories in the system's state space. e reward associated with each transition in the state space is described by a predefined value function.
Existing RL algorithms can be divided into critic-only, actor-only, and actor-critic variants. e critic-only variants optimize the value function (V-function) that is then used to derive the policy; the actor-only variants work directly on the policy optimization without any need for a value function; and actor-critic variants optimize both functions simultaneously. An example of the actor-only RL variant, often called Q-learning, can be found in [1] and that of the actor-critic variant in [2].
From a different point of view, RL algorithms can be also divided into model-based and model-free variants. Examples of both approaches can be found in [3,4]. e modelbased variants include a model representation of the system to be controlled and can be pretrained in simulation (offline) and then updated when controlling the actual system (online). Model-free methods learn online exclusively through trial and error. Both variants have their specific advantages and disadvantages. We can often find remarks about the model-free approach requiring much more data, especially in high-dimensional cases [1]. In this paper, we employ the model-based, critic-only variant without any online training so that we can compare different modelling approaches.
We focus on two promising categories of approximation algorithms: genetic programming and local linear regression. Our aim is to contribute to the methodology of choosing the optimal out of dozens of existing modelling algorithms when presented with a specific RL task. is problem arises not only in connection with the RL framework (see [5]) but in modelling of a dynamical system in general [6,7].
Genetic algorithms (GA) and their many variations are well established as a tool for modelling or parameter estimation of dynamical systems [8,9]. However, genetic programming as a modelling approach used within RL is relatively new and promises good results with high-dimensional systems where other approaches fail. It creates a continuous-time, globally nonlinear model described by an analytical equation built of combinations of predefined functions [10]. As it is common with genetic optimization algorithms, these methods tend to be computationally demanding. On the other hand, local regression is a wellestablished modelling approach for model-based RF agents where the model is composed of local linear models, offering fast and computationally cheap approximation. ere are several variants of local modelling methods; comprehensive examples of grid-based local linear model structure and data-based local linear regression (LLR) are described in [11,12], respectively. Even though the use of local regression techniques within RL has been researched in the past, it was mainly based on simple, memory-based approximation methods such as the LLR, which is thoroughly described and examined in [13,14], and more complex incremental methods such as the receptive field weighted regression (RFWR) [15,16] or locally weighted projection regression (LWPR) [17] were omitted, with the exception of [18], where the RFWR algorithm was used as a critic approximator. e RFWR and LWPR methods provide significant benefits in lower memory use and higher stability by employing optimization-based (RFWR) or statistical (LWPR) methods to discover the optimal distribution of the local models' areas of validity, that is, the receptive fields.
It is important to benchmark the modelling methods because of the large number of existing approaches, which aim at similar tasks, while there are no simple guidelines on the method choice. Also, the presented algorithms are not yet well established within the RL domain. Finally, studying control algorithms for magnetic manipulation systems has importance on its own because of its application in many industrial fields (medical applications, magnetic levitation systems, etc.), thus leading to the two separate aims of this paper: exploring control algorithms suitable for control of precise magnetic manipulator systems and benchmarking different modelling approaches.
When dealing with real magnetic manipulator systems, we also need to address practical issues that are often neglected in simulations, that is, nonlinearities such as actuator dead zones, saturations, Coulomb friction, signal delays, and so on. ese present significant obstacles then implementing the control algorithm on a real system. In some cases, the dead zone and saturation problem can be addressed by nonlinear or adaptive control laws. For example, [19] shows an approach using fuzzy control with Gaussian membership functions, which is in practice similar to the RFWR method, and [20] describes a gain-scheduling adaptive approach to deal with internal system bounds. Using RL to find a control law for a nonlinear system also has the advantage that it can often deal with such disturbances on its own through the optimization process; for example, only a limited range of the actor outputs may be limited, which is the approach utilized in this paper.
In this paper, we also present minor adjustments to the RFWR algorithm in Section 3, proposed to lower the computational complexity while preserving stability when working with low-dimensional problems.

Magnetic Manipulator.
Genetic programming was already applied to nonlinear systems like an inverted pendulum or a collaborative robot [2,10,21]. To further investigate the approximation capabilities of these methods, we use a different system-a magnetic manipulator (Magman). is system consists of four coils that are independently operated by separate current controllers and a steel ball that can move freely over the coils; see Figure 1. To ensure that the ball moves only in the measured direction with limits on the edges, it is placed in a groove with 10 mm in size. In this case, we decided to limit the system to the first two coils only, as a system with four inputs is much more complex in terms of the RL computational complexity, while it does not enrich the system with different nonlinearities as it only spatially repeats the same kind of nonlinear behaviour. e steel ball can be positioned by properly controlling the current and thereby the magnetic force of the coils. e magnetic force a coil exerts on the ball is highly dependent on the distance of the ball from the coil's centre, which introduces a significant nonmonotonic nonlinearity [22][23][24].
All experiments and simulation were scripted in MATLAB.
e coil currents are controlled by stabilized current source modules, which communicate that MATLAB through a USB/RS232 transceiver using the virtual COM port (VCP) protocol on Windows OS. As the ball position is measured with a laser sensor with analog (voltage) output, the Humusoft MF634 IO card was used to measure the signal in real time from the MATLAB environment with a sampling period of 5 ms. Even though the Window OS is not an RTOS, with this sampling frequency, the period jitter is negligible (below 0.1%), and thus, the system can be considered real time. Table 1 lists the parameters of the magnetic manipulator we use in our experiments. With the task being a precise positioning of an object in a magnetic field, similar concepts can be found in many real-world applications, for example, maglev, microrobots, contactless stirring of chemicals, and so on.

Complexity
Approximate equations of motion inferred using the first principle method can be found in [24].
e system parameters were either measured directly or estimated using MATLAB parameter estimation toolbox based on measured data. Generally, the system can be described by a continuoustime, nonlinear state-space model as follows: where x � [x, _ x] ⊤ is the state vector composed of the position x and velocity _ x of the ball, forming the continuous system state space x ∈ X ⊂ R 2 ; x . in _ X ⊂ R 2 is the state vector derivative; and u � [u 1 , u 2 ] ⊤ is the input (action) vector composed of the coil currents. u 1 , u 2 form the system input space u ∈ U ⊂ R 2 .
e nonlinear vector function f: X × U ⟶ _ X thus describes the system dynamics. In this paper, by modelling the system, we mean approximating the underlying real function f using various methods, which all build upon experimentally measured input-output data. Each data point is formed by corresponding assumed inputs and outputs of the function f -(x . k , x k , u k ). In practice, these data points are corrupted by noise and other disturbances that are assumed to be with zero means.

SNGP.
Single-node genetic programming (SNGP) is a graph-based genetic programming algorithm evolving a population organized as an ordered linear array of interlinked individuals, each representing a single program node [2,10,21]. Generally, symbolic regression algorithms try to find a model in the form of an analytical expression for a given data set by forming and evolving the expression out of elemental functions and operations. In our case, the algorithm is based on the assumption that the nonlinear function f in (1) can be efficiently approximated by the following equation: where the nonlinear function f i , called the feature, is developed by means of genetic programming with n f being the maximum number of features, n s number of states, and the coefficients β i estimated by the least-squares method. e features are constructed from a list of elementary functions that are assumed to be able to produce the required fitting approximation of the presented data. e features can be combined by common operators or nested, but the maximal depth of the expression is limited to avoid overfitting. e symbolic model is evolved so that the mean-squared error over the training data is minimized.

MGGP.
e second GP algorithm we used is called multigene genetic programming (MGGP). As opposed to SNGP, it combines the features defined also by 2 into treelike structured expressions called genes. e final expression is formed by a linear combination of these genes, which act as the individual features in equation (2). e parameters of this top-level linear combination are again estimated through least squares. Further details about the algorithm can be found in [25]. e actual MGGP implementation we used is extended with linear combinations of features [26] that enable the algorithm to find affine transformations of the feature space via a backpropagation-like technique, thus making it easier for the driving genetic programming algorithm to approximate the nonlinearities.

Receptive Field Weighted Regression.
Receptive field weighted regression (RFWR) is an incremental approximation method that creates a set of local linear models and the corresponding Gaussian basis functions called the receptive fields and gradually updates them to fit the inputoutput data. e set of local linear models is updated with new data points (called the query points) using a weighted variant of the recursive least squares (RLS) method and the basis functions are updated through a gradient search with the help of heuristic decision rules. It can continually improve the set of models while still providing the best estimation of the approximated function at each query point based on the previously provided data. e original algorithm, first presented in [15,16], which is the basis we build upon, can be best described by the following pseudocode: For each existing local model (3) Calculate model weight w according to (4) (4) If w > activation limit w act (5) Update model parameters using RLS according to (6) and (7)  (6) Update the corresponding receptive field using (12) and (14)  (7) End (8) End   (15)  (11) Else if two or more models were activated with weight w > pruning limit w prun (12) Prune the model with the smaller receptive field (13) End (14) Calculate the model output as a weighted average of the activated local models (15) End Usually, the receptive field activation limit is set as w act � 0.001. is parameter represents the weight limit for a local model to be updated according to the new data and to be included in the output estimation through a weighted average with another activated model. e pruning limit is usually set as w prun � 0.7, which represents the highest acceptable overlap of neighbouring receptive fields.
e RFWR variant described in this paper follows the main outline of the original algorithm [15] with several adjustments and improvements for the sake of stability and computational complexity for low-dimensional problems. is mainly concerns the rules for adding new local models, adjusting their receptive fields, and generalizing the algorithm in a way that the receptive fields are placed and optimized in a lower number of dimensions than the order of the models. is is especially useful in cases when the nonlinearities are significant mainly in one or two dimensions of the state space of the system. is algorithm, in its original implementation, is successfully being used to approximate inverse models of nonlinear systems to be used as a feedforward compensator [27,28]. Figure 2 shows an example approximation of a complex univariate nonlinear function by the RFWR algorithm.
Each of the local models is represented by a parameter vector b � [b 1 , b 2 , . . . , b n ] ⊤ . With the input vector (a query point) X q � [x 1 , x 2 , . . . , x n ] ⊤ , the output y q is calculated by e weight w of a local model at a query point X q is determined by its Gaussian receptive field as follows: with c � [c 1 , c 2 , . . . , c n ] the vector of model centre coordinates and C − 1 the distance inducing matrix of the basis function (receptive field). e overall output is then calculated as a weighted average of the outputs of the activated local models. e output estimate of the set of local models and their receptive fields is calculated by the following equation: We modified the original RFWR algorithm described in [15] to be used for low-dimensional problems. ese modifications consist of the following:

Complexity
where P is the covariance matrix of the estimate, λ is a forgetting parameter, and y q is the acquired output for the actual system state X q called the query point. e covariance matrix P needs is usually initialized as a diagonal matrix.

Updating dimensions of Basis Functions.
To avoid calculating the matrix inversion in (4) for every local model, an upper triangular matrix M is used instead of C. Because of symmetry and positive definiteness, these matrices relate according to To update the receptive field, we update M using a gradient-descent optimization of the cost function J as follows: where w i is the activated receptive field weight, y i is the estimated output of the respective model at the query point (y q ; X q ), and n is the number of local models. e parameter α is the gradient optimization step size. As the calculation of the cost function J according to (9) is computationally very complex, we simplified the optimization algorithm through a set of heuristic decision rules and implemented the optimization as follows: is implementation introduces a parameter p, which is an expression of a simple heuristic to decide whether the value of a basis function (weight) at the actual query point should be increased or decreased.
is enables to stop updating the distance inducing matrix when a precision criterion is met and to limit the maximal number of local models to avoid overfitting.
Parameter p can be determined by various decision rules. A simple yet effective set, which was used in this research, can be created by using a long-term (cumulated over time) MSE of a particular model according to the data points, which can be described by

Adding New Local Models.
During the optimization process, it is possible that no model exceeds the activation limit w act . In such a case, a new local model with a receptive field is added to the approximation set. e centre of the receptive field is automatically placed at the actual query point, and the model parameters are initialized to fit the measured output of the approximated system. What needs to be determined is the area in the state space that should be covered by the newly created receptive field. e original algorithm uses a default diagonal distance inducing matrix for every local model. However, an optimal distance inducing matrix can be determined. Intuitively, the new receptive field should cover the gap between the already existing models. e distance inducing matrix should be initialized as a diagonal matrix with parameters that ensure that the new receptive field does not overlap with any existing one more than a preset limit. In our case, the limit was set to 0.5w prun . Since this would be a complex optimization task not suitable for realtime calculation, we simplified the criterion so that the maximal overlapping weight of two models is analyzed only over the line segment connecting their centres. In that case, the distance parameter for initializing the distance inducing matrix can be determined by the following equation.where v i � c n − c i is a vector between the new centre c n and the centre of a neighbouring receptive field c i . A two-dimensional example is shown in Figure 3. is method yields a better estimate of the distance inducing matrix of the new receptive field than the fixed initial dimension matrix in the original algorithm as it requires fewer iterations to stabilize and to cover the gap between neighbouring receptive fields.
e distance parameter d i has to be calculated for every existing local model, and the minimal distance d min is used to initialize the distance inducing matrix according to where I is a unity matrix of the corresponding order.
In the specific case of the magnetic manipulator, the inputs of the local models would correspond to (x, u) and the output to x . .

Reinforcement
Learning. Consider the following discrete deterministic state-space model of a system to be controlled: where k ∈ Z denotes discrete time instants, x k , x k+1 ∈ X ⊂ R n is the state vector, and u k ∈ U ⊂ R m is the input vector. An RL agent learns to control the system so that it achieves the maximal cumulated reward on a trajectory from the initial state to the desired state [10]. At each state transition, as described by (14), the agent receives a scalar reward according to Complexity r � ρ x k , u k , x k−1 . (15) e reward function ρ is usually based on the distance of the current state to the goal state. e optimal control law, called the policy, π: X ⟶ U is determined as follows such that it maximizes the cumulative reward, called the return: where c ∈ (0, 1) is the discount factor and the initial state x 0 is selected from the state space domain X. e return for any permissible initial state x is captured by the value function V: X ⟶ R defined as follows: An approximation of the optimal V-function V(x) can be found by solving the Bellman equation as follows: f(x, u)) + cV (f(x, u))]. (18) e optimal action can be found as the action that steers the system to a state with maximal value [21]. is corresponds to maximization of the right-hand side of (18):

Experimental Results
We prepared training and validation I/O data sets measured on the magnetic manipulator with random input signals as a list of data points in the form (x . k , x k , u k ). e random input signals (coil currents) were generated in the way that only one coil was active at a time that eliminated possible electromagnetic interactions between them (the coil current was controlled by an HW-based current feedback controller module rendering the transient times negligible). Figure 4 shows an example of a training data set.
Even though the ball's position measurement is very precise, it still contains significant noise. For that reason, the time-domain derivatives of the position (velocity and acceleration) needed for the dynamic model approximation were determined using the Savitzky-Golay filter, which is an FIR filter based on least-squares polynomial approximation able to perform numerical differentiation while filtering the noise simultaneously [29,30].
Especially for the RFWR implementation, it is important to note that the system's nonlinearity is mainly significant along the position of the ball and the system can be seen as linear in parameters along the other dimensions (acceleration and velocity). In this case, the general model (1) can be rearranged as follows: where the function f(x, u) represents the significant nonlinearity suitable for local approximation, the term b 0 sign( _ x) represents a simple model of dry friction, the term b 1 _ x represents the viscous friction, and the last term b 2 _ x 2 models nonlinear damping caused by electromagnetic induction influencing the steel ball while moving rapidly through a magnetic field. Despite being nonlinear, all of the terms are linear in their parameters and can be modelled globally, which means that the local models share param- (20) is quite important in practical situations where Coulomb friction is not negligible. e sign function is often being used to approximate the effects of Coulomb friction wherever there is no significant stiction (difference between static and dynamic friction effects).
ere are better approximations for simulation purposes, for example, the sigmoid function; however, most of them are not linear in parameters and thus not applicable for RLS parameter estimation. e same data set was presented to all of the approximation methods (RFWR, SNGP, and MGGP). Due to the stochastic nature of the two algorithms based on genetic programming, the same process was repeated with different pseudorandom seeds. Overall, 30 runs for SNGP and MGGP and 1 run for RFWR were made. Table 2 shows the summary of the MSE results.
Based on the 5-SAP validation results, 21 best models (10x SNGP, 10x MGGP, and 1x RFWR) were chosen to be used for RL control of the magnetic manipulator. Based on these models, we calculated the approximations of the optimal V-functions using the fuzzy value iteration [10,21]. Furthermore, we used equation (19) to calculate the corresponding policies. Figure 6 shows examples of the value functions that resulted in the policies shown in Figure 7.
All the policies were tested on the actual magnetic manipulator. A sequence of five consecutive goal states was chosen as a goal state trajectory, and a corresponding V-function and policy were calculated for each one. During the actual control process, a new policy is used every time the goal state changes. Figure 8 shows an example of the control experiments with an RL controller based on the model SNGP 10. Furthermore, Figure 9 shows the ball's trajectory in the state space plotted over the V-function and the policy.
To conduct all of the experiments, we measured the ball's position using a laser distance sensor, and a PCIe I/O card Humusoft MF634 was used to capture the sensor's analog output signal. e coil currents were driven by a custom dual-channel current controller. Both of the devices are operated from MATLAB.

Results.
We compared the performance of various models of a complex nonlinear system created with three different approximation methods, two of which were based on genetic programming and the third was based on a modified local linear approximation algorithm (RFWR). Based on these models, we used a model-based critic-only RL agent to control the system and validate the results. Table 2 shows the resulting MSE values from the estimation, validation, and control processes.
Most of the models selected for the actual control experiments were successful in achieving stable control, although they differ in precision. e histogram in Figure 10 shows the number of models for the two GP-based algorithms (SNGP and MGGP), which fall into several MSE categories. e MSE describes the control precision as the mean-squared error between the goal and the actual trajectory of the closed-loop controlled system.

Conclusions
First of all, the results show that it is possible to construct even such a complex nonlinear system using the RL framework. e results are not significant in terms of control precision, which depends highly on the specific system, amount of experimental data, and many other factors. An important achievement is a fact that all of the modelling algorithms demonstrated in this paper provide a viable alternative to the commonly used methods while being much less computationally expensive (in the case of RFWR) or much more user-friendly (in the cases of SNGP and MGGP).
Also, it was proven that the commonly used RL framework may be built even on tom of imperfect models.
It is clear from the results that the methods that at least partially depend on random number generation (SNGP and MGGP) need to be run repeatedly in search for the best solution, which clearly outperforms the result of the local approximation method (RFWR). On the other hand, the RFWR method requires significantly less computational power than the GP-based methods. Also, it seems that both the SNGP and the MGGP are able to find similarly precise models with the MGGP having a higher probability of converging to the best solution. It is interesting to note that models with better training or validation fit are not always better for control, as can be seen in Table 2. All of the methods provide a useful tool to be used within the reinforcement learning framework with the main advantage of the GP-based approximation method being a form of output (analytical expression) that is understandable and readable and whose complexity is controllable through intuitive parameters. Considering the simplifications made during the simulations and experiments and a relative imprecision during the control processes, there is space for future research in modifying these methods to be suitable for higherdimensional systems, implementing GPU, handling specific nonlinearities (friction, hysteresis, etc.), and using them also for an approximation of the V-functions and policies.

Data Availability
Research data in the form of simulation and experimental results and MATLAB files are available from the corresponding (first) author upon request (martin.brablc@ vutbr.cz).

Conflicts of Interest
e authors declare that they have no conflicts of interest.