Convergence of an Online Split-Complex Gradient Algorithm for Complex-Valued Neural Networks

The online gradient method has been widely used in training neural networks. We consider in this paper an online split-complex gradient algorithm for complex-valued neural networks. We choose an adaptive learning rate during the training procedure. Under certain conditions, by firstly showing the monotonicity of the error function, it is proved that the gradient of the error function tends to zero and the weight sequence tends to a fixed point. A numerical example is given to support the theoretical findings.


Introduction
In recent years, neural networks have been widely used because of their outstanding capability of approximating nonlinear models.As an important search method in optimization theory, gradient algorithm has been applied in various engineering fields, such as adaptive control and recursive parametrical estimation 1-3 .Gradient algorithm is also a popular training method for neural networks when used to train neural networks with hidden layers, gradient algorithm is also called BP algorithm and can be done either in the online or in the batch mode 4 .In online training, weights are updated after the presentation of each training example, while in batch training, weights are not updated until all of the examples are inputted into the networks.As a result, batch gradient training algorithm is always used when the number of training samples is relatively small.However, in the case that a very large number of training samples are available, online gradient training algorithm is preferred.
Conventional neural networks' parameters are usually real numbers for dealing with real-valued signals 5, 6 .In many applications, however, the inputs and outputs of a system are best described as complex-valued signals and processing is done in complex space.In order to solve the problem in complex domain, complex-valued neural networks CVNNs have been proposed in recent years 7-9 , which are the extensions of the usual realvalued neural networks to complex numbers.Accordingly, there are two types of generalized gradient training algorithm for complex-valued neural networks: fully complex gradient algorithm 10-12 and split-complex gradient algorithm 13, 14 ; both of which can be processed in online mode and batch mode.It has been pointed out that the split-complex gradient algorithm can avoid the problems resulting from the singular points 14 .
Convergence is of primary importance for a training algorithm to be successfully used.There have been extensive research results concerning the convergence of gradient algorithm for real-valued neural networks see, e.g., 15, 16 and the references cited therein , covering both of online mode and batch mode.In comparison, the convergence properties for complex gradient algorithm are seldom investigated.We refer the reader to 11, 12 for some convergence results of fully complex gradient algorithms and 17 for those of batch splitgradient algorithm.However, to the best of our knowledge, convergence analysis of online split-complex gradient OSCG algorithm for complex-valued neural networks has not yet been established in the literature, and this becomes our primary concern in this paper.Under certain conditions, by firstly showing the monotonicity of the error function, we prove that the gradient of the error function tends to zero and the weight sequence tends to a fixed point.A numerical example is also given to support the theoretical findings.
The remainder of this paper is organized as follows.The CVNN model and the OSCG algorithm are described in the next section.Section 3 presents the main results.The proofs of these results are postponed to Section 4. In Section 5 we give a numerical example to support our theoretical findings.The paper ends with some conclusions given in Section 6.

Network Structure and Learning Method
It has been shown that two-layered CVNN can solve many problems that cannot be solved by real-valued neural networks with less than three layers 13 .Thus, without loss of generalization, this paper considers a two-layered CVNN consisting of L input neurons and 1 output neuron.For any positive integer d, the set of all d-dimensional complex vectors is denoted by C d and the set of all d-dimensional real vectors is denoted by R

2.1
Here "•" denotes the inner product of two vectors.
For the convenience of using OSCG algorithm to train the network, we consider the following popular real-imaginary-type activation function 13 : Let the network be supplied with a given set of training examples {z q , d q } Q q 1 ⊂ C L ×C 1 .For each input z q x q iy q 1 ≤ q ≤ Q from the training set, we write U q U q,R iU q,I as the input for the output neuron and O q O q,R iO q,I as the actual output.The square error function can be represented as follows: where " * " signifies complex conjugate, and The neural network training problem is to look for the optimal choice w of the weights so as to minimize approximation error.The gradient method is often used to solve the minimization problem.Differentiating E w with respect to the real parts and imaginary parts of the weight vectors, respectively, gives μ qR U q,R x q μ qI U q,I y q , 2.6 −μ qR U q,R y q μ qI U q,I x q .2.7 Now we describe the OSCG algorithm.Given initial weights w 0 w 0,R iw 0,I at time 0, OSCG algorithm updates the weight vector w by dealing with the real part w R and w I separately: w mQ q,R w mQ q−1,R − η m μ qR w mQ q−1,R • x q − w mQ q−1,I • y q x q μ qI w mQ q−1,I • x q w mQ q−1,R • y q y q , w mQ q,I w mQ q−1,I − η m −μ qR w mQ q−1,R • x q − w mQ q−1,I • y q y q μ qI w mQ q−1,I • x q w mQ q−1,R • x q y q , m 0, 1, . . ., q 1, 2, . . ., Q.

2.8
For k, q 1, 2, . . ., Q, and m 0, 1, . . ., denote that U mQ k,q,R w mQ k,R • x q − w mQ k,I • y q , U mQ k,q,I w mQ k,I • x q w mQ k,R • y q , μ qR U mQ k−1,q,R x q μ qI U mQ k−1,q,I y q , p m,q,k I −μ qR U mQ k−1,q,R y q μ qI U mQ k−1,q,I x q . 2.9 Then 2.8 can be rewritten as Δw mQ q,R −η m p m,q,q R , Δw mQ q,I −η m p m,q,q I .

2.10
Given 0 < η 0 ≤ 1 and a positive constant N, we choose learning rate η m as Equation 2.11 can be rewritten as 12 and this implies that This type of learning rate is often used in the neural network training 16 .
For the convergence analysis of OSCG algorithm, similar to the batch version of splitcomplex gradient algorithm 17 , we shall need the following assumptions.
A1 There exists a constant c 1 > 0 such that max 2.14

Main Results
In this section, we will give several lemmas and the main convergence theorems.The proofs of those results are postponed to the next section.
In order to derive the convergence theorem, we need to estimate the values of the error function 2.4 at two successive cycles of the training iteration.Denote that ρ m,q,R ρ m,q,I ,

3.2
where ρ m,q,R 1/2 μ qR t m,q 1 U m 1 Q,q,R − U mQ,q,R 2 , ρ m,q,I 1/2 μ qI t m,q 2 U m 1 Q,q,I − U mQ,q,I 2 , each t m,q 1 ∈ R 1 lies on the segment between U m 1 Q,q,R and U mQ,q,R , and each t m,q 2 ∈ R 1 lies on the segment between U m 1 Q,q,I and U mQ,q,I .
With the above Lemmas 3.1-3.3,we can prove the following monotonicity result of OSCG algorithm.Theorem 3.4.Let {η m } be given by 2.11 and let the weight sequence {w mQ } be generated by 2.8 .Then under Assumption A1 , there are positive numbers N and η such that for any N > N and 0 < η 0 < min{1, η} one has To give the convergence theorem, we also need the following estimation.
Lemma 3.5.Let {η m } be given by 2.11 .Then under Assumption A1 , there are the same positive numbers N and η chosen as Theorem 3.4 such that for any N > N and 0 < η 0 < min{1, η} one has The following lemma gives an estimate of a series, which is essential for the proof of the convergence theorem.

Lemma 3.6 see 16 . Suppose that a series
is convergent and a n ≥ 0. If there exists a constant c 6 > 0 such that The following lemma will be used to prove the convergence of the weight sequence.
Lemma 3.7.Suppose that the function then there exists a point θ ∈ Φ 1 such that lim n → ∞ θ n θ .
Now we are ready to give the main convergence theorem.
Theorem 3.8.Let {η m } be given by 2.11 and let the weight sequence {w n } be generated by 2.8 .Then under Assumption A1 , there are positive numbers N and η such that for any N > N and 0 < η 0 < min{1, η} one has

3.13
Furthermore, if Assumption A2 also holds, then there exists a point w ∈ Φ 0 such that

Proofs
Proof of Lemma 3.1.Using Taylor's formula, we have where t m,q 1 lies on the segment between U m 1 Q,q,R and U mQ,q,R .Similarly we also have a point t m,q 2 between U m 1 Q,q,I and U mQ,q,I such that

4.6
By 2.9 , 2.10 , 3.1 , and the Mean-Value Theorem, for 2 ≤ k ≤ Q and m 0, 1, . . ., we have where c 1 c 7 max 1≤q≤Q x q 2 y q 2 .Similarly we have In particular, as where where c s are nonnegative constants.Recalling η m ≤ 1, then we have where c s 1 2Q c L and c L max{ c 1 , c 2 , . . ., c s−1 }.Similarly, we also have

4.15
This together with 2.9 and 4.6 leads to
Proof of Lemma 3.3.Recalling Lemmas 3.1 and 3.2, we conclude that 4.17 Proof of Theorem 3.4.In virtue of 3.6 , the core to prove this lemma is to verify that

4.18
In the following we will prove 4.18 by induction.First we take η 0 such that

4.19
For m ≥ 0 suppose that

4.20
Next we will prove that

4.21
Notice that

4.22
where t m,k 5 lies on the segment between U m 1 Q,k,R and U mQ,k,R , and t m,k 6 lies on the segment between U m 1 Q,k,I and U mQ,k,I .Similar to 4.14 , we also have the following estimation: where c 11 c 2 Q.By 4.6 and 4.22 -4.23 we know that there are positive constants c 12 and c 13 such that where c 14 c 12 c 13 , c 15 c 11 c 12 c 13 .Taking squares of the two sides of the above inequality gives

4.25
Now we sum up the above inequality over k 1, . . ., Q and obtain

4.28
On the other hand, from 4.22 we have

4.29
Similar to the deduction of 4.24 , from 4.29 we have

4.30
It can be easily verified that, for any positive numbers a, b, c, Applying 4.31 to 4.30 implies that

4.32
Similarly, we can obtain the counterpart of 4.28 as and the counterpart of 4.32 as

4.34
From 4.28 and 4.33 we have

4.35
From 4.32 and 4.34 we have

4.36
Using 2.11 and 4.36 , we can get

4.38
Using 4.20 and 4.35 , we obtain

4.39
Combining 4.38 and 4.39 we have

4.40
Thus to validate 4.21 we only need to prove the following inequality:
Proof of Lemma 3.5.From Lemma 3.3 we have

4.45
Using 2.9 and 4.6 , we can find a constant c 17 such that This together with 2.13 leads to

4.47
Thus from 4.45 and 4.47 it holds that

4.49
Proof of Lemma 3.6.This lemma is the same as Lemma 2.1 of 16 .
Proof of Lemma 3.7.This result is almost the same as Theorem 14.1.5 in 18 , and the details of the proof are omitted.
Proof of Theorem 3.8.Using 2.9 , 4.6 , 4.14 , and 4.15 , we can find a constant c 18 such that

4.50
From 2.6 and 4.22 we have

4.52
Using 2.6 , 2.9 , and Lemma 3.5, we have Therefore, when q 0, we complete the proof of lim m → ∞ ∂E w mQ q /∂w R 0, and we can similarly show that lim m → ∞ ∂E w mQ q /∂w R 0 for q 1, . . ., Q. Thus, we have shown that

Numerical Example
In this section we illustrate the convergence behavior of the OSCG algorithm by using a simple numerical example.The well-known XOR problem is a benchmark in literature of neural networks.As in 13 , the training samples of the encoded XOR problem for CVNN are presented as follows: 1 1 , z 2 −1 i, d 2 0 , z 3 1 − i, d 3 1 i , z 4 1 i, d 4 i .

5.1
This example uses a network with two input nodes including a bias node and one output node.The transfer function is tansig • in MATLAB, which is a commonly used sigmoid function.The parameter η 0 is set to be 0.1 and N is set to be 1.We carry out the test with the initial components of the weights stochastically chosen in −0.5, 0.5 .Figure 1 shows that the gradients tend to zero and the square error decreases monotonically as the number of iteration increases and at last tends to a constant.This supports our theoretical analysis.

Conclusion
In this paper we investigate some convergence properties of an OSCG training algorithm for two-layered CVNN.We choose an adaptive learning rate in the algorithm.Under the condition that the activation function and its up to the second-order derivative are bounded, it is proved that the error function is monotonely decreasing during the training process.
With this result, we further prove that the gradient of the error function tends to zero and the weight sequence tends to a fixed point.A numerical example is given to support our theoretical analysis.We mention that those results are interestingly similar to the convergence results of batch split-complex gradient training algorithm for CVNN given in 17 .Thus our results can also be a theoretical explanation for the relationship between the OSCG algorithm and the batch split-complex algorithm.The convergence results in this paper can be generalized to a more general case, that is, multilayer CVNN.

1 I.Lemma 3 . 1 .
The first lemma breaks the changes of error function 2.4 at two successive cycles of the training iteration into several terms.Suppose Assumption A1 is valid.Then one has

Then 3 .
6 is obtained by letting c 5 Q c 3 c 4 .

Thus, 3 .Figure 1 :
Figure 1: Convergence behavior of OSCG algorithm for solving XOR problem sum of gradient norms ∂E w n /∂w R ∂E w n /∂w I .
where f R is a real function e.g., sigmoid function .If simply denoting f R as f, the network output O is given by each term of 4.41 can be assured for N ≥ N and 0 ≤ η 0 ≤ min{1, η} by setting Thus, from 4.6 , 4.50 , and Cauchy-Schwartz inequality, there exists a constant c 19 such that for any vector e