This paper presents certain stochastic search algorithms (SSA) suitable for effective identification, optimization, and training of artificial neural networks (ANN). The modified algorithm of nonlinear stochastic search (MN-SDS) has been introduced by the author. Its basic objectives are to improve convergence property of the source defined nonlinear stochastic search (N-SDS) method as per Professor Rastrigin. Having in mind vast range of possible algorithms and procedures a so-called method of stochastic direct search (SDS) has been practiced (in the literature is called stochastic local search-SLS). The MN-SDS convergence property is rather advancing over N-SDS; namely it has even better convergence over range of gradient procedures of optimization. The SDS, that is, SLS, has not been practiced enough in the process of identification, optimization, and training of ANN. Their efficiency in some cases of pure nonlinear systems makes them suitable for optimization and training of ANN. The presented examples illustrate only partially operatively end efficiency of SDS, that is, MN-SDS. For comparative method backpropagation error (BPE) method was used.
The main target of this paper is a presentation of a specific option of direct SS and its application in identification and optimisation of linear and nonlinear objects or processes. The method of stochastic search was introduced by Ashby [
The stochastic direct search (SDS) had not been noticed as advanced concurrent option for quite a long time. The researches and developments works of Professor Rastrigin and his associates promoted the SS to be competing method for solving various problems of identification and optimization of complex systems [
It has been shown that SDS algorithms besides being competing are even advancing over well-known methods. Parameter for comparing is a property of convergence during solving the set task. For comparing purposes gradient methods were used in reference [
During the last 20 years, vast interests have been shown for advanced SDS, especially on the case where classical deterministic techniques do not apply. Direct SS algorithms are one part of the SSA family. The important subjects of random search were being made: theorems of global optimization, convergence theorems, and applications on complex control systems [
The author has been using SDS algorithms (in several of his published papers) regarding identification of complex control systems [
Through experience in application of certain SDS basic definition the author was motivated to introduce the so-called modified nonlinear SDS (MN-SDS) applicable as numerical method for identification and optimization of substantial nonlinear systems. The main reason is rather slow convergence of N-SDS of basic definition and this deficiency has been overcome.
The application of SDS is efficient for both determined and stochastic description of systems.
The SDS algorithm is characterized by introduction of random variables. An applicable option is generator of random numbers [
The previously said is enhanced by well-developed modern computer hardware and software providing suitable ambient conditions for creation and implementation of SDS methods and procedures.
The solving of theoretical and/or practical problems usually requests firstly an identification task followed by final stage, that is, a system optimization. The analysis and synthesis of systems always consider the previously said [
Methods of SDS are ones of competing options for numerical procedures providing solution for identification and optimization of complex control systems [
A parameters identification of the above system anticipates certain measurements of the system variables observing the criteria function:
The criteria function
When
Further for the purpose of simplicity of this presentation a function
Methods of optimization start with iterative form where from current state
So,
In case of SDS the relation (
A random variable
If it is introduced terms
Consider
Consider
Some of the more complex forms of SDS are as follows.
Consider
Iterative procedure for a SDS fastest descent is
The graphic presentation of SDS behavior in the space of gradient
SDS algorithms in gradient field.
Gradient is starting from point
The random vector
The SSA properties used for ranking with other competitive methods (gradient, fastest descend, Gaus-Zeidel, and scanning and others) are
The main property is dissipation, that is, losses in one step of search and displacement on hipper-surface of the function
The convergence
Reciprocal value of relation (
The local SDS property includes probability of failing step of searching:
So, the choice of optimal algorithm from local property point of view guides to compromise of choice of three:
Besides local properties the relevant ones are properties describing the search in whole [ total number of steps, that is, number of effective steps accuracy, that is, allowed error the procedure is ending
The SDS option choice is guided by the above indicated local characteristics properties
It is necessary also to observe that the dispersion
Finally, it is worthwhile to mention that the choice of algorithm is subject to request to have procedures (identification and optimization) effective, having in minds that system model or criteria function is nonlinear. Substantially nonlinear process models so, described by (
Good properties of SDS as per (
By increasing the system dimension the probability for acceptable tests decrease indicating that the target is reached after rather numerous tests and effective steps
The idea to have failed tests converted onto useful accumulated information guides toward the so-called modified nonlinear SDS. The previously said is shown in Figure
Now it is possible to form the next form of SSA search:
A modification defined by (
Having in mind that MN-SDS explore accumulated information some sort of self-learning features are obtained (specifically when all data are memorized) leading to the conclusion that stochastic probability of testing depends on accumulated information:
This brings that searching is guided within the vicinity of the best probe or test [
Now the vector guidance toward the target is
In practice the optimisation starts without self-learning option. In that sense MN-SDS regardless of the range of accumulated information of failed tests not memorized enables sampling of those most useful ones.
The main result achieved by MN-SDS is improved convergence compared to that of nonlinear SSA of basic definition (Section
For the SSA purposes uniform distribution of random numbers
By increasing of system dimension the probability
The idea to use failed SDS tests for increase of efficiency of nonlinear SDS initiated creation of the so-called modified nonlinear SDS (MN-SDS). In fact failed tests give accumulated information for increasing of N-SDS convergence. Rearranged iterative procedure for minimization of criteria function
The MN-SDS procedure provides acceptable choice referring to the set of three characteristics
The convergence
This is shown like curve ABC in Figure
Comparability of convergence MN SDS and N-SDS with gradient.
Nonlinear SSA (N-SDS) of basic definition (the curve
It is worthwhile to recognise when to use SDS or some other optimisation method. In that sense in Figure
The applicability areas of gradient and SDS methods.
It is necessary to mention that random numbers generator should pass more strict tests whenever the system dimension is large [
Diagram of procedure for computer processing of MN-SSA.
Hereinafter an implementation of MN-SDS over multilayer ANN with feedforward information flow through network will be considered (FANN).
The FANN properties (well adopted by researchers and designers) enable wide range of ANN to be transformed into FANN.
The SDS, that is, MN-SDS, can be applied on both FANN model forms, oriented graph and matrix form.
For MN-SDS the first mentioned form is more practical having in mind available heuristic options offering a more efficient MN-SDS.
In this part of the paper and onward a multilayer FANN will be observed as shown in Figure
The FANN a general type (a) and used model of neuron (perceptron) (b) where
After adjustment of symbols (in expression (
The vector of parameter
The vector
Stochastic methods (also including MN-SDS) instead of
An application of MN-SDS on the FANN (Figure
Application MN-SDS algorithm of the training FAAN involve the introduction of random vector
If in the FANN there are more outputs than only one, previously it must be from an error for one output:
The increment of
Synthesis of ANN is engineering design. An ANN design starts with set-up of an initial architecture based on experience and intuition of a designer. In this Section
An ANN architecture is determined by the number of inputs and outputs, neurons, and interlinks between them and biases a perceptron is a neuron with adjustable inputs and activation function not necessary to be of step type-threshold.
Experience shows, that is quite clear that for solution of complex problem it is necessary to create a complex ANN architecture.
Working with SDS various experiences have confirmed the previously noticed. The SDS methods are more efficient comparing to numerous known ones specifically when complex optimization problems are in question specifically when complexity is measured by system dimensions. Having in mind the significance of multilayer of an FANN hereinafter the structure shown in Figure
It is worthwhile to mention that successful optimization process of an FANN does not mean that it would have necessary features: first of all required capacity and properties for generalization [
Capacity features (
An FANN property of generalization is basic paradigm of its performance. In fact it is an essential property approving that “network has learnt” and should provide valid answers to inputs not within training pairs. In other words testing data should have the same distribution as a set of training pairs.
The synthesis issue is an open problem. In [
The name universal approximator is linked to a three-layer network having perceptrons in hidden layer with nonlinear activation function (type sigmoid) and perceptrons at outputs with linear activation function (Figure
Basic FANN architecture of universal approximation.
By introducing the following designations, for hidden neurons
When a starting network structure has been set up, then its dimension is
The range of training pairs samples for
The FANN ability to memorize and capacity
In case of collision between ANN dimension and required training samples
Previous consideration with fix configuration during training procedures is characterized as static.
There are some methods approaching to optimization dynamically; during training procedures network structures are changed such as cascade correlation algorithm [
Training with MN-SDS is processed through forward phase. Numerical procedure is simple and gives enough information SDS shows on dynamically approach in training, optimization and syntheses of artificial neural networks.
This example is concerned with the theory and system control [
The linearized multivariable system described in internal presentation (see relation (
If the static of the real system is described by matrix form,
In the expression (
The relation (
Linear forms type (
Having in mind the previously indicated, numerical experiment for coefficients
For identification of
The current values of the sizes
For step iteration used
The final results after random procedure with N-SSA are
The accuracy of
Implementation of MN-SDS is possible after transforming of equation system (
By application of MN-SDS method some
Collection data of variables
|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
|
1,3 | 2,1 | 2,9 | 2,0 | 2,3 | −0,8 | −0,9 |
1,1 | 3,9 | 6,0 | 2,8 | 2,1 | −0,6 | −1,0 | |
7,0 | 7,0 | 2,7 | 2,6 | 7,0 | +0,8 | −3,0 | |
7,0 | 1,9 | 1,1 | 3,0 | 3,0 | +2,0 | −4,0 | |
|
|||||||
|
1,0 | −1,0 | 2,0 | 1,0 | −1,0 | 1,0 | 0,0 |
1,0 | −1,0 | 2,0 | 1,0 | 1,0 | 0,0 | 0,0 | |
0,0 | −1,0 | 2,0 | 1,0 | 2,0 | −1,0 | 0,0 | |
0,0 | 1,0 | 2,0 | 1,0 | 1,0 | 0,0 | −1,0 | |
−1,0 | 0,0 | 2,0 | 1,0 | 0,0 | 1,0 | −1,0 | |
|
|||||||
|
−1,0 | +1,0 | −1,0 | +1,0 | 0,0 | −1,0 | 0,0 |
+1,0 | 0,0 | +1,0 | 0,0 | +1,0 | −1,0 | 0,0 |
This example is linked to treatment of training of perceptron related to “XOR” logic circuit by using of SDS procedures. The said example had an important role in R&D works in the field of artificial neural network. In fact Minsky and Pappert (1969) confirmed that perceptron cannot “learn” to simulate XOR logic circuit and not to expect much of ANN [
If from XOR true table training pairs are formed
where
It is obvious that relations (2), (3), and (4) are excluding each other. A perceptron cannot be trained to simulate XOR logic circuit.
The aforesaid is understood as obstacle “
The previously indicated somehow interrupted further development of ANN more than 20 years up to works of Hopfield [
The problem has been overcome by studying of multilayer and recurrent ANN properties as well as creation of advance numerical procedures for training of the same [
This example incorporates the results of training having at least one hidden layer of neurons which could be trained to simulate XOR logic circuit.
Figure
For training pairs two options can be used (Figures
Further on results realized with training pairs
At the time of training it was shown that
Values of all other
The criteria function for both cases is
The SSA iterative procedures were used to minimize
In Figure
The results of training are shown so for a step activation function (threshold):
The method BPE is disabled since
Finally results after
The random vector of forward propagation is with dimension
Let us refer to an example when an activation function
BPE methods implementation was made with known procedures [
In this paper MN-SDS and BPE use ECR (procedures error correction rules) which are more convenient for “online” ANN training [
Whenever the training coefficient (
Optimization, for example, training of multilayer ANN; XOR problem.
Cost function; (a) for
In this example some of theoretical R&D results of [
The real problem is approximated model of an ANN training program related on technical term for process in reactor in which the concentrate heated to powder so that will behave like a fluid [
When the said FSR is to be tempered either the first time or after service attempts, there is a program of tempering to temperature
The mapping in real conditions have not enough data to make FANN (Figure
On Figure
Application of the relations ((
Based on the aforesaid there is overnumbered required training pairs
Basic moments in conjunction approximation model FSR through the use of MN-SDS in the training of FANN assignment training random vector feedforward phase in layers for each training pair achieved numerical value for cost function for sequence for each training par the procedure optimization, that is, training for FANN
Due to the big volume of numerical data in the paper, we will not list all the details.
Out of the two ANN under training with same training samples more strong generalization has network with less number of neurons.
Data defining program warming FSR (
|
|
---|---|
0 | 20 |
10 | 20 |
20 | 50 |
30 | 50 |
40 | 100 |
50 | 100 |
70 | 150 |
80 | 250 |
90 | 250 |
100 | 350 |
105 | 350 |
110 | 450 |
115 | 450 |
120 | 550 |
125 | 600 |
130 | 600 |
140 | 600 |
160 | 600 |
170 | 600 |
175 | 425 |
185 | 325 |
205 | 150 |
215 | 110 |
225 | 75 |
235 | 60 |
240 | 50 |
Plan introducing FSR in operative mode (AB), work (BC), possibly down time (CD).
The real mapping data of FSR at operative environment.
The initial structure of FANN for approximation model’s tempering of FSR.
Trend cost functions multilayers FANN’s
That is the reason to refer to less complex network structures:
Structure
More acceptable solution is
It has been possible to continue to a minimized architecture
The closest solution is the FANN
Oriented graph of
Having in mind that determining of an ANN, that is, FANN architecture is always open problem then estimation of an adopted structure is suitable after the final numerical result including validity test of a network training.
Theoretical results for universal approximator are derived for nonlinear activation, function of hidden neurons of type
Since
Application the relation of
Here will be presented numerical data and the results of network training for
On the beginning of the numerical procedure for practical reasons
The symbolic presentation of the vector unknow parameters
Random vector
Behind the training network
Previous data characterize the approximation process model tempering temperature of FSR (
Some responses to the test inputs for checking the validity of the obtained model deviate a large error. Counting these to interrelate (sharing) the total number received relatively rough estimate of generalization capabilities appropriate network. Based Figure
Application BPE method gave the following values of generalization: about 70% for network
The previous presented FSR model could be used in more sophisticated option as an intelligent process monitoring.
Why someone should go to the application of stochastic search methods (SSMs) to solve problems that arise in the optimization and training of ANN? Our answer to this question is based on the demonstration that the SSMs, including SDS (stochastic direct search), have proved to be very productive in solving the problems of complex systems of different nature.
In particular, previous experience with ANN and relatively simple architecture suggest that they can exhibit quite a complex behavior, which can be traced to (i) a large number of neuron-perceptron elements in ANN system, (ii) the complexity of the nonlinearity in the activation function of neurons within ANN, (iii) the complexity of the neuron activation function model (i.e., higher level models), (iv) complexity in the optimization procedure due to the large volume of data in the training set, and (v) the complexity of the specification of internal architecture of particular types of ANN.
The features listed above require competitive methods which can deal efficiently with such complexity. The SDS represent a combinatorial approach offering great potential via certain heuristics and algorithms they provide for numerical procedures.
For example, various methods based on the notion of gradient and which are considered competitive when applied to complex systems cannot avoid the linear scaling of convergence in numerical implementations. In SDS, the trend is roughly speaking proportional to
The author has previously used some algorithms belonging to the SDS methodology, such as nonlinear SDS (N-SDS) and statistical gradient (SG-SDS). Both of these methods have exhibited poor convergence. By starting from N-SDS, we have designed MN-SDS which already for
In this paper, Section
The present synthesis elements are simplified theoretical results of the recent studies. This is illustrated by Example
Let us finally mention that there is an increasing interest in using SSM, both from academia and industry. This is due to the fact that SSM, and in a particular SDS, can find increasing applications in economics, bioinformatics, and artificial intelligence, where the last area is intrinsically linked to ANN.
The central goal of this study is the presentation of stochastic search approach applied to identification, optimization, and training of artificial neural networks. Based on the author’s extensive experience in using SDS approach to the problems of identification and optimization of complex automatic control system, a new algorithm based on nonlinear SDS (N-SDS), which is termed MN-SDS, is proposed here. The MN-SDS offers significant improvement in convergence properties compared to nonlinear N-SDS and some other SSA.
MN-SDS maintains all the other good features of the existing SDS: a relatively easy adaptation to problem solving; simple mathematical construction of algorithmic steps; low sensitivity to noise.
The convergence properties of MN-SDS make it superior to majority of standard algorithms based on gradient scheme. Note that convergence is the most suitable characteristics for comparing the efficiency of algorithms for systems with the same number of optimization parameters. For example, already for more than three parameters the MN-SDS exhibits better convergence properties than most other algorithms, including those based on the gradient. This means that, in certain optimization procedures and training, MN-SDS is superior to widely used BPE method for ANN and in the development of artificial intelligence.
The MN-SDS in optimization and training of ANN employs only feedforward phase flow of information in FANN. The parameters that are used in optimization within MN-SDS are changed using random number generator. The efficiency of MN-SDS in numerical experiments suggests that it can be applied to very complex ANN. This study has shown its application to feedforward ANN (FANN). The obtained results were compared with results obtained with BPE method, of course, when applied to the same problems.
Numerical experiments performed here can be implemented even on simple multicore PC using MATLAB package.
The author declares that there is no conflict of interests regarding the publication of this paper.
The author has been using the achievements of Professor Rastrigin courses and expresses his gratitude. Unfortunately there are no possibilities for direct thanks for using some elements for the figures out of the Professor’s books.