The Construction and Approximation of ReLU Neural Network Operators

In the present paper, we construct a new type of two-hidden-layer feedforward neural network operators with ReLU activation function. We estimate the rate of approximation by the new operators by using the modulus of continuity of the target function. Furthermore, we analyze features such as parameter sharing and local connectivity in this kind of network structure.


Introduction
Artificial neural network is a fundamental method in machine learning, which has been applied in many fields such as pattern recognition, automatic control, signal processing, auxiliary decision-making, and artificial intelligence. In particular, the successful applications of deep (multihidden layer) neural networks in image recognition, natural language processing, computer vision, etc. developed in recent years have made neural networks attract great attention. In fact, ever the function of XOR gate was implemented by adding one layer from the simplest perceptron, which leaded to the single-hidden-layer feedforward neural network.
A single-hidden-layer feedforward neural network has the expression form where c i , θ i ði = 1, 2,⋯,nÞ are called as output weights and thresholds, the dimension of input weights ω i ði = 1, 2,⋯,nÞ corresponds to that of the input x, ϕ, is called the activation function of this network, and n 1 is the number of neurons in the hidden layer. If x ∈ ℝ d , A 1 = ½ω 1 ω 2 ⋯ ω n 1 T (T denotes the transpose) is the input weight matrix of size n 1 × d, Θ 1 = ½θ 1 , θ 2 ,⋯,θ n T and C 1 = ½c 1 , c 2 ,⋯,c n T are vectors of thresholds and output weights, respectively, then ð1Þ can be written as where ϕðA 1 x + Θ 1 Þ means that ϕ acts on each component of A 1 x + Θ 1 . Now, the architecture of the neural network with two hidden layers is really not difficult to understand. If the second hidden layer contains n 2 neurons, the input weight matrix A 2 is the size of n 2 × n 1 , the vector of thresholds is Θ 2 , and the output weight vector is O; then, the two-hidden-layer feedforward neural network can be mathematically expressed as We call w = max fn 1 , n 2 g as the width of the network N ðxÞ, and its depth is naturally 2.
The theoretical research and applications of the singlehidden-layer neural network model had been greatly developed in the 80's and 90's of last century; particularly, there were also some research results on the neural networks with multihidden layers at that time. So, in [1], Pinkus pointed out that "Nonetheless there seems to be reasonable to conjecture that the two-hidden-layer model may be significantly more promising than the single layer model, at least from a purely approximation-theoretical point of view. This problem certainly warrants further study." However, whether it is a single-hidden-layer or multihidden-layer neural network, three fundamental issues are always involved: density, complexity, and algorithms.
The so-called density or universal approximation of a neural network structure means that for any given error accuracy and the target function in a function space with some metrics, there is a specific neural network model (except for the input x, other parameters are determined) such that the error between the output and target is less than the preaccuracy. In the 1980s and 1990s, the research on the density of feedforward neural network has achieved many satisfactory results [2][3][4][5][6][7][8][9]. Since the single-hidden-layer neural network is an extreme case of the multilayer neural networks, the current focus of neural network research is still on complexity and algorithms. So-called the complexity of a neural network means that to guarantee a prescribed degree of approximation, a neural network model requires the numbers of structural parameters, including the number of layers (or depth), the number of neuron in each layer (sometimes use width), and the number of link weights and the number of thresholds. In particular, it is of interest to have more equal weights and thresholds, which is called as the parameter sharing, as this reduces computational complexity. The representation ability that has attracted much attention in deep neural networks is actually the study of complexity problem, which needs to be investigated extensively and urgently.
The constructive method is an important approach to the study of complexity, which is applicable to single-and multiple-hidden-layer neural network. In fact, there are two cases here: one is that the depth, width, and approximation degree are given, while the weights and thresholds are uncertain; the other is that all these are given; that is, the neural network model is completely determined. In order to determine the weights and thresholds in the first kind of neural network, we simply use samples to learn or train. Theoretically, the second kind of neural network can be applied directly, but the parameters are often fine-tuned with a small number of samples before use. There have been many results about the constructions of network operators [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]. It can be seen that these research results have an important guiding role in the construction and design of neural networks. Therefore, the purpose of this paper is to construct a kind of two-hidden-layer feedforward neural network operators with the ReLU activation function, and the upper bound estimate of approximation (or regression) ability of this neural network for two variable continuous function defined on ½−1, 1 2 is given.
The rest of the paper is organized as follows: in Section 2, we introduce new two-hidden-layer neural network operators with ReLU activation function and give the approximation rate of approximation by the new operators. In Section 3, we give the proof of the main result. Finally, in Section 4, we give some numerical experiments and discussions.

Construction of ReLU Neural Network Operators and Its Approximation Properties
Let r : ℝ ⟶ ℝ denote the rectified linear unit (ReLU), i.e., rðxÞ = max f0, xg: For any ðx 1 , Obviously, σ is a continuous function of two variables supported on ½−1, 1 2 : By using the fact that jxj = rðxÞ + r ð−xÞ, σ can be rewritten as follows: From the above representation, we see that σðx 1 , x 2 Þ can be explained as the output of a two-hidden-layer feedforward neural network. It is obvious that σ possesses the following some important properties: For any continuous function f ðx 1 , x 2 Þ on ½−1, 1 2 , we define the following neural network operator: where bxc is the largest integer not greater than x, and dxe denotes the smallest integer not less than x. We prove that the rate of approximation by N n ðf Þ can be estimated by using the modulus of smoothness of the target function. In fact, we have Remark 2. For 0 < α < 1, we define the following neural network operators: Using a similar process of the proof in Theorem 1, we can getÑ ð10Þ for any ðx 1 , Remark 4. Now, we describe the structure of N n ðf Þ by using the form (3).
The input matrix of the first hidden layer is and its size is 4ð½n + ffiffiffi n p − d−n − ffiffiffi n p e + 1Þ 2 × 2. The bias vector of the first hidden layer is and the dimension is 4ð½n + ffiffiffi n p − d−n − ffiffiffi n p e + 1Þ 2 . The input matrix of the second hidden layer is Its general term and dimension are f ððk 1 /n + ffiffiffi n p Þ, ðk 2 /n + ffiffiffi n p ÞÞ and ð½n + ffiffiffi n p − d−n − ffiffiffi n p e + 1Þ 2 , respectively.
We can see that there are two different numbers in weight matrices A 1 and A 2 , respectively. That is, neural network operators N n ð f Þ have a strong weight sharing feature. There are some results about the constructions of this kind of neural networks [14,[27][28][29]. Moreover, A 2 shows that this neural network is locally connected. Finally, the simplicity of bias vector Θ 2 also greatly reduces the complexity of the neural network.

Proof of the Main Result
To prove Theorem 1, we need the following auxiliary lemma.

Numerical Experiments and Some Discussions
In this section, we give some numerical experiments to illustrate the theoretical results. We take f ðx 1 , x 2 Þ = x 2 1 + x 2 2 , ðx 1 , x 2 Þ ∈ ½−1, 1 2 as the target function. Set ð52Þ Figures 1-3 show the results of e 100 ðx 1 , x 2 Þ, e 1000 ðx 1 , x 2 Þ and e 10000 ðx 1 , x 2 Þ, respectively. When n equals to 10 6 , the amount of calculation of N n ðf ; x 1 , x 2 Þ is large. Therefore, we choose 6 specific points and the corresponding values of e n ðx 1 , x 2 Þ, which are shown in Table 1.
From the results of experiments we see that as the parameter n of neural network operators increases, the approximation effect increases; we only need to notice M f = 2, ωðf ; ð1/ ffiffiffi n p ÞÞ = 4/ ffiffiffi n p , and after the simple calculation, we can demonstrate the validity of the obtained result.
If we investigate network operators (6) carefully, we cannot help but ask why we use f ððk 1 /n + ffiffiffi n p Þ, ðk 2 /n + ffiffiffi n p ÞÞ