Error Bounds for Approximations Using Multichannel Deep Convolutional Neural Networks with Downsampling

. Deep learning with speci ﬁ c network topologies has been successfully applied in many ﬁ elds. However, what is primarily called into question by people is its lack of theoretical foundation investigations, especially for structured neural networks. This paper theoretically studies the multichannel deep convolutional neural networks equipped with the downsampling operator, which is frequently used in applications. The results show that the proposed networks have outstanding approximation and generalization ability of functions from ridge class and Sobolev space. Not only does it answer an open and crucial question of why multichannel deep convolutional neural networks are universal in learning theory, but it also reveals the convergence rates.


Introduction
Deep learning [1] has made remarkable achievements in many fields.Essentially, it is based on structured neural networks similar to the biological nervous system to extract data features for realizing specific learning goals.In these structured neural networks, a particularly important one called deep convolutional neural networks (DCNNs) has achieved state-of-the-art performance in many domains [2][3][4].Normally, multichannel convolution is used, and the resulting multichannel deep convolutional neural networks (MDCNNs) have also achieved excellent performances in classification [5,6], natural language processing [7], biological [8][9][10], and many other domains [11][12][13].
However, compared with the successful applications of MDCNNs, the theoretical basis is incomplete, which is the main reason why it is widely criticized.In this paper, we present some approximation theories of functions for downsampled MDCNNs where the downsampling operator plays the role of pooling, which reduces the width of deep neural networks.Before giving the main results of downsampled MDCNNs, we first briefly look back at the basic concepts of fully connected neural networks (FNNs) and DCNNs.
An FNN with input vector x ∈ ℝ d and L hidden layers of neurons fh ðjÞ : ℝ d ⟶ ℝ d j g with widths d j ∈ ℕ + is defined iteratively by where σ : ℝ ⟶ ℝ is an univariate activation function acting componentwise on vectors, W ðjÞ ∈ ℝ d j ×d j−1 is a weight matrix, b ðjÞ ∈ ℝ d j is a bias vector in layer j, and h ð0Þ ðxÞ = x with the width d 0 = d.Now, the form of (1) used to approximate functions is Note that if L = 1, the FNNs defined by (1) degenerate into the well-known classical shallow neural networks.The most important part of (2) to learning functions is the free parameters of weights and bias.It is easy to find that the form (2) involves free parameters of weights ∑ L i=1 d i d i−1 and bias ∑ L i=1 d i to be trained, leading to huge computational complexity when d i is large.
For DCNNs, we use the definition from [14].Let d, d j , and L be the positive integers.The convolution of w ∈ ℝ K and x ∈ ℝ d is mathematically defined as w * x ∈ ℝ d+K−1 , where ðw * xÞ i = ∑ d j=1 w i−j+1 x j , i ∈ ½d + K − 1 (½d + K − 1 denotes the set f1, 2,⋯,d + K − 1g) which can be equivalently rewritten as where T w ∈ ℝ ðd+K−1Þ×d is a Toeplitz-type matrix given by With the above notations, a DCNN with input vector x ∈ ℝ d and L hidden layers of neurons fh ðjÞ : is defined iteratively by where σ : ℝ ⟶ ℝ is an univariate activation function as before, w ðjÞ ∈ ℝ K denotes a filter supported on ½K, and b ðjÞ ∈ ℝ d j is a bias vector in layer j, h ð0Þ ðxÞ = x.The form of (5) to learning functions is Compared with FNNs, DCNNs defined by (5) involve a sparse matrix T w ð jÞ in the j-th layer, each row of which has no more than K nonzero elements.The number of weights and biases is KL and ∑ L i=1 d i , respectively, which is a large reduction of parameters.
However, this kind of DCNNs results in width increasing; that is, for input signal x ∈ ℝ d , we have T w x ∈ ℝ d+K−1 which is rarely used in practice.To improve this unusual structure, downsampling also known as pooling operators is applied in the DCNNs to reduce width formally [15,16].The key role of downsampling is reducing the dimension of features and retaining effective information.To describe it mathematically, we adopt a general version given below.
Definition 1 (downsampling [15]).Let x ∈ ℝ d , a downsampling set S ⊂ ½d, be an index set.D S is called a downsampling operator indexed by S if D S ðxÞ = x S , where x S denotes the vector indexed by S.
Factually, except for downsampling, in real applications, multiple filters are usually utilized in each layer of DCNNs to obtain multichannel outputs.Each output is made up of channel combinations that provide the flexibility needed to avoid variance issues and loss of information [17], and different channels will play the role of extracting multiple features of the input data [16].Specifically, as pointed out in [18], convolution from the current layer to the next in the multichannel case is often organized as follows: inputs of each input channel first convolute with all related filters to compose the convoluted inputs, and then the convoluted outputs are composed of linear combinations of the convoluted inputs, and finally, an activation function (usually ReLU) is acted on each convoluted outputs componentwise.Along with this fact in mind, the key to the MDCNNs considered in this paper is multichannel convolution which is mathematically defined as follows.
Definition 2 (multichannel convolution).Let C, C ′ , and K ∈ ℕ + be input channel size, output channel size, and filter size, respectively.Filters W = ðW n,j,i Þ n∈½K,j∈½C ′ ,i∈½C are defined as a three order tensor.Let X ∈ ℝ d×C be the input data with C channels and the output of channel j ∈ ½C ′ named Y :,j without bias, and activation function is defined as the sum of convoluted input data, i.e., where the Toeplitz-type matrix T W :,j,i is defined as (3).
1Þ×C′ be a bias matrix and σ be the activation function; the multichan- The whole multichannel convolution structure is shown in Figure 1.
Remark 3. Here, we remark that Definition 2 implies that there are C × C ′ filters in total, and each filter has the same size K.For convenience, we assume that the input data have the same size d for all channels such that the corresponding 2 Journal of Applied Mathematics outputs also have the same size degenerates into equation ( 3).The multichannel convolution from the current layer to the next provides the main ingredient of MDCNNs.Combined with the downsampling operator given by Definition 1, MDCNNs with downsampling are given below.
Definition 4 (MDCNNs with downsampling).Let C ðlÞ , K ðlÞ ∈ ℕ + be the channel size and filter size in layer lð1 ≤ l ≤ LÞ, the set A = fl 1 , l 2 ,⋯,l n g ⊂ ½L satisfying 1 ≤ l 1 < l 2 < ⋯ < l n ≤ L is used to introduce the downsamplings, and A l j ⊂ ½d j ð1 ≤ j ≤ nÞ is the downsampling sets.A MDCNN with downsampling operators D A j s and input data X ∈ ℝ d×C having widths fd i g L i=0 is defined iteratively by d 0 = d and for j = 2, 3, ⋯, n Here, all channels in the same layer have equal size, the downsampling operators D A j s act on each channel of the layer l j , card ðA j Þ denotes the cardinal number of A j , and the tensor W ðiÞ denotes filters between layers i and i − 1.Finally, the form of MDCNNs used to approximate functions is where c j,i ∈ ℝ are coefficients.The structure of MDCNNs is shown in Figure 2.
Remark 5.The form (11) indicates that the objective form has three important ingredients fW ðiÞ , B ðiÞ , C ðiÞ g corresponding to filters, bias, and channel size.From another perspective, it belongs to If all layers have only one channel, MDCNNs will degenerate into DCNNs.We say that the MDCNNs with downsampling have uniform filter lengths if all channels in every layer have the same size.Under this circumstance, we call the MDCNNs with downsampling uniform.All MDCNNs with downsampling considered in our main results are uniform.
However, the existing theoretical studies cannot be applied to MDCNNs.For example, Zhou [14,15,19] only considers single-channel DCNNs whose widths are increasing to depth.The multichannel convolution was also used in a recent network Butterfly-Net [20,21], which is based on butterfly algorithm.However, the multichannel convolution is only part of its network structure, and the structure of our MDCNNs relying on multichannel convolution solely is different from that of Butterfly-Net.Moreover, they study the approximation of Fourier representation of input data, which is also different with ours.To investigate the approximation ability of MDCNNs, we study its behavior on ridge functions and functions from Sobolev space H r ðℝ d Þ. MDCNNs considered in this paper have finite width d, finite filter size K + 1ð<dÞ, and finite channels in each layer.In addition, the activation function is the popular rectified linear unit (ReLU) defined as a univariate function given by σ ðuÞ = ðuÞ + = max f0, ug, u ∈ ℝ, which is often utilized to guarantee the nonlinear properties of the neural networks.As pointed out by [19,22], linear combinations of ReLU units can express the objective functions with arbitrary 3 Journal of Applied Mathematics accuracy.Hence, the main proof techniques of our theorems are constructing the structured MDCNNs to obtain the ReLU approximations of the objective functions.In addition, we emphasize the benefit of multiple channels: different channels from some fixed layers can extract transformed data features from the previous layer.Concretely, we utilize channels to store the ReLU units, obtain new ReLU units, and deposit initial data.In this way, our proposed MDCNNs can achieve better results in approximating functions than the structure from DCNNs and FNNs.In summary, we make the following contributions to the approximation theory of MDCNNs: (i) To construct MDCNNs by introducing the multichannel convolution so that different channels are used to extract different data features.To introduce the downsampling operator into the MDCNNs so that the width-increasing nature can be avoided from layer to layer (ii) To present a theorem for approximating ridge functions by MDCNNs of the form gðξ • xÞ with ξ ∈ ℝ d and g : ℝ ⟶ ℝ which demonstrates that for this widely used simple but important function family, MDCNNs have better approximation abilities than FNNs and DCNNs (iii) To prove a theorem for approximating functions in Sobolev space H r ðℝ d Þ which shows the universality of MDCNNs and the benefit of depth.In addition, it also reveals better approximation performances than FNNs and DCNNs The structure of this article is organized as follows: in Section 2, we present the main results for approximating functions from ridge class and Sobolev space and further compare them with some related work.Proofs of our main results are given in Section 3. Finally, we summarize the research of this paper in Section 4.

Main Results
Complicated functions can often be approximated by simple families [23], such as polynomials, splines, wavelets, radial basis functions, and ridge functions.Specifically, many approximation results are based on the combination of ridge functions [24,25].Our first main result of downsampled MDCNNs shows its good performance of approximation ability for ridge class.After that, we further provide the approximation ability of MDCNNs of functions from Sobolev's space.The two approximations constitute our main results.The main techniques of our proofs are constructing the approximations of objective functions by linear combinations of ReLU units at first and then specifying the networks' parameters such that the constructed MDCNNs' outputs match the linear approximations of ReLU units.

Approximation on Ridge Function. Mathematically, ridge functions are any multivariate real-valued function
induced by an unknown eigenvector ξ ∈ ℝ d and an unknown univariate external function g : Our first result shows the approximation ability of MDCNNs for ridge functions with the external function g ∈ K α and x ∈ B d , where this paper, we will use fU, P g to represent the number of computation units (widths or hidden units [15]) and free where c j ∈ ℝ.The number of computation units is U = 3L + 2ðL 0 − 1Þd − 3L 0 , and free parameters are Remark 7. The constructed MDCNNs have finite channels, finite width, and finite filter sizes, and the convergence rate denoted by ( 15) is not only dimension-free but also reveals the benefit of depth.Given arbitrary approximation accuracy ε ∈ ð0, 1Þ, Theorem 6 shows that we need at least T ≥ 2ð2C α /εÞ 1/α .Taking T = d2ð2C α /εÞ 1/α e, it needs L = d2ð2C α /εÞ 1/α e + dd/Ke + 2 layers, computation units U = 3d2ð2C α /εÞ 1/α e + 2ðdd/Ke − 1Þd + 6, and free parameters P = 3d2ð2C α /εÞ 1/α e + d + 6 to get (15).
Remark 8.A concrete example is as follows: let ξ ∈ B 5 , d = 5, K = 3, L 0 = 2, gðxÞ = sin ðxÞ satisfying gðxÞ belong to Lipschitz-1 class.By Theorem 6, we can construct an MDCNN with at most 3 channels, and the width of each channel is no more than 5, and L = T + 4 layers such that 2.2.Approximation on Function from Sobolev Space.How do MDCNNs behave for smooth functions?Our second theorem shows that functions in Sobolev space of order r can be well approximated by a downsampled MDCNN with at most 4 channels.
then, for any f = Gj Ω and an integer r > 2 + ðd/2Þ, there exists a downsampled MDCNN with finite width and at most 4 channels such that where C > 0 is an universal constant and kGk denotes the Sobolev norm of G ∈ H r ðℝ d Þ given by kGk = kð1 + jwj 2 Þ r/2 FðGÞðwÞk L 2 with FðGÞ being the Fourier transform of G.The number of computation units is U ≤ 4Ld, free parameters are P ≤ ððd + 2Þ/ðL 0 + 1ÞÞL + 5.
Remark 10.In fact, Theorem 9 demonstrates the universality of MDCNNs; that is, for any compact subset Ω ⊂ ℝ d , any function in CðΩÞ can be approximated by MDCNNs to an arbitrary accuracy when the depth L is large enough.The reason is that the set H r ðΩÞ is dense in CðΩÞ when we consider the Sobolev spaces that can be embedded into the space of continuous functions on Ω.Moreover, the proof of this theorem shows that our constructed MDCNNs have at most 4 channels in each layer and the width of each layer equals d.
Both of the two main results reveal the benefit of depth in terms of approximations of functions from ridge class and Sobolev space, which indicate that MDCNNs can approximate the two types of functions to arbitrary accuracy if the depth L ⟶ ∞.Moreover, the constructed MDCNNs have finite channels, finite width, and finite filter sizes, wich is more close to real-world scenes compared with [14,15,19].

Comparison and Discussion
. Most studies on the approximation theory of neural networks focus on two aspects: the first is obtained in the late 1980s about universality [26][27][28] meaning that any continuous functions can be approximated by (2) to arbitrary accuracy; in other words, the space FfW ðjÞ , b ðjÞ , 1 ≤ j ≤ lg is dense in the objective function space; the second is obtained about convergence rates of functions [24,25,[29][30][31] in the view of neurons, parameters, or depth.For fairness, in this part, we aim to compare our main results with other theoretical investigations of networks existing in the literature under approximation error ε ∈ ð0, 1Þ.Specifically, we shall do our comparisons in terms of width d L , filter size K, depth L, the number of computation units U, and free parameters P .
Let R n denote the set of combination of ridge functions with cardinal number no larger than n; it had been proven in [24] that any function from the Sobolev space W r,d p in the space L q with 2 ≤ q ≤ p ≤ ∞ behaves asymptotically of the order n −r/ðd−1Þ by FNNs.The superiority of Theorem 6 over [24] is the dimension-free property of the convergence rate given by ( 15) which demonstrates the good performance of MDCNNs in approximating ridge functions.Besides, let D A ðxÞ = D m ðxÞ = ðx im Þ bd/mc i=1 (b•c is the floor function), where m ≤ d is a scaling parameter.Paper [15] constructed a DCNN with filter size 4N + 6 in the last layer and finite depth dððd − 1Þ/ðk − 2ÞÞ + 1e.It obtained a convergence rate of Oð1/N α Þ for ridge functions with external function g ∈ K α , where one needs computation units at most 3dðd − 1Þ/K − 1 + 2ð2C α /εÞ 1/α + 8 and free parameters at most 2ð2C α /εÞ 1/α + 8d.However, the filter size is often no larger than the input dimension in practice, meaning that this structure is not frequently used.By comparison, even though our Remark 7 indicates that the computation units and free parameters of MDCNNs constructed from Theorem 5 Journal of Applied Mathematics 6 have the same order of [15], it is easy to find that our constructed network may be closer to real-world applications, and the convergence rate from (15) reveals the benefit of depth.

Proofs of Main Results
There are two kinds of downsampling operators D A 1 and D A 2 acting on each layer in our constructed MDCNNs to ensure the finite width property, where A 1 = ½K + 1 : K + d and A 2 = ½1 : d (½a : b denotes all integers belong to ½a, b).Thereby, for any x ∈ ℝ d and w ∈ ℝ K+1 , we can write the downsampled convolution as D A 1 ∘ ðw * xÞ = T w 1 x and D A 2 ∘ ðw * xÞ = T w 2 x, where T w 1 and T w 2 ∈ ℝ d×d are square matrices consisting of rows from T w given by , That is, convolution of w and x with downsampling operators D A 1 and D A 2 is equivalent to special matrixvector multiplication.Furthermore, the two kinds of convolution have the property that for input signal x ∈ ℝ d , we have T w 1 x ∈ ℝ d and T w 2 x ∈ ℝ d ; i.e., the input data and output data are equal widths.
We first introduce some lemmas in Subsection 3.1 that will be used later to prove our main results.Detailed proofs of our main results will be shown in Subsection 3.2.

Auxiliary Lemmas
i=1 jα i j, the constant B ≥ 1 is an arbitrary upper bound of max 1≤i≤d jx i j, and L 0 = dd/Ke.Then, there exists an MDCNN with 3 channels having downsampling set ½L 0 and Remark 12.The proof procedure of this lemma suggests that by changing the bias in layer L 0 to be b ðL 0 Þ = ½ð2A α B + t + ∑ d i=K+1 α i Þ1 d+K , 0 ⊗ 1 K+d ( ⊗ denotes the Kronecker product), we have ðh ðL 0 Þ ðxÞÞ 1,1 = σðα • x − tÞ.Besides, it is not difficult to get that, if we abandon the last channel of each layer; i.e., the channel used to store the input data x is deleted in the process of its proof; after L 0 layers of convolution, we have h ðL 0 Þ ðxÞ = α • x + B (here, in layer L 0 , we choose the downsampling set ½K + 1 : K + d), and max 1≤l≤L 0 kw ðlÞ k ∞ = max f1, kαk ∞ g, the computation units are 2L 0 d − d, and the free parameters P = d + 3.
Remark 13.The proof of this lemma tells us that our constructed MDCNN has three characteristics: first, only 3 channels are used in all layers; second, all channels have the same width d; at last, it has finite layers L 0 .It can be seen from these characteristics that MDCNNs have great superiority in the view of computation units and free parameters.
Proof.Our MDCNN contains 3 channels in layer l ð1 ≤ l ≤ L 0 − 1Þ, the first channel is used to get the target output, the second channel to shift the input data by K units, and the third channel to store the input data; the last layer contains 2 channels.By using convolution computation through L 0 layers, we get the desired result.We first construct filters and bias in the first layer, choosing W :,3,1 = ½0,⋯,0, 1 T ∈ ℝ K+1 , and b ð1Þ = ½−A α B,−B,−B ⊗ 1 d+K , where 1 d+K denotes the vector in ℝ d+K whose entries equal to 1; we have ð20Þ By taking the downsampling set having the form of (18).For :,j,: ∈ ℝ ðK+1Þ×3 has the following form:  having the form of (18).For l = L 0 , by choosing , and the downsampling set A 1 , we have ∈ ℝ d×2 : ð25Þ In the same way, for j ∈ ½2, ðh ðL 0 Þ ðxÞÞ :,j = σð∑ 3 i=1 T having the form of (18).In the representation of ( 18), we can easily find that outputs of the whole constructed convolutional network in each channel have equal width d.Thus, the computation units of the network are ð3L 0 − 1Þd, and free parameters P = d + 3.

Journal of Applied Mathematics
Our next goal is aimed at giving the convergence rates of functions coming from Lipschitz-αð0 < α ≤ 1Þ class.Before that, we introduce one more lemma inspired from [33,34].Lemma 14.Let g ∈ K α and m, T ∈ ℕ + ; there exists a piecewise linear function e g 1 ðxÞ with breakpoints f−1 + ð2r/mÞg m r=0 such that where Remark 15.There are two differences between Lemma 14 and [33,34]; on the one hand, Lemma 7.3 of [34] gives a similar result of this lemma, but it does not give the concrete expression of g used to approximate functions; on the other hand, [33] gives a concrete ReLU expression of functions used to approximate functions, but it is only for functions coming from the Lipschitz-1 class.Besides, both [33,34] are acquired under the assumption that input data x ∈ ½0, 1.
Equation (27) indicates that any function g ∈ K α can be excellently approximated by the linear combinations of ReLU units.Thereby, the forms of linear combinations by ReLU units inspire us to construct MDCNNs with downsampling to approximate functions.Our next lemma provides specific skills on how to embed (27) into downsampled MDCNNs.Lemma 16.Let g ∈ K α , x ∈ ½−1, 1, and m, T ∈ ℕ + ; there exists a downsampled MDCNN having L = T + 2ðL ≥ 3Þ layers all of which have only 3 channels, such that with c j ∈ ℝ, ðj ∈ ½3Þ.The computation units are U = 3T + 6, and the free parameters are weights kWk 0 = 3T + 4, bias kbk 0 = 2T + 3, and P = 3T + 3.
Proof.The main techniques are embedding the ReLU expression from Lemma 14 into some specific MDCNNs.Different channels will play the role of storing input data, shifting input data, and storing σ units. Choosing , where B 1 = 2jw 0 j, B l = 2jw l−1 j, and M l = ∑ l i=1 B i .In this matrix, columns represent filters for different output channels and filters in different rows are corresponding to the corresponding input channels with the index of rows and columns corresponding to the index of channels.Then, Conv σ W ðlÞ ,B ðlÞ ðxÞ = ½x + 1, σðx − ð2ðl − 1Þ/TÞ + 1Þ, ∑ l−2 t=0 w t σðx − ð2t/TÞ + 1Þ + M l−1 .By induction, we have Here, when l = T + 1, we change the bias of the second output channel to be 1 such that it has zero output.The third element contains the linear ReLU units from e g 1 ðxÞ.
For l = L = T + 2, W ðLÞ ∈ ℝ 1×3×3 with all elements equal to zero except for top left and bottom right which Thus, by (26), we have with computation units 3T + 6 and free parameters P = 3 T + 3.

Proofs of Main Theorems
Proof of Theorem 6.
By Remark 12, if we take an upper bound B = h = 1, then there exists an MDCNN with at most 2 channels such that h ðL 0 Þ ðxÞ = ξ • x + 1.By changing B ð1Þ to be zero in the proof of Lemma 16, we have Conv σ W ð1Þ ,B ð1Þ ðxÞ = ½ξ • x + 1, σðξ • x + 1Þ, 0. In the sequel, with L replaced by L − L 0 in Lemma 16, we get the desired results.
Proof of Theorem 9. Let m ∈ ℝ + ; the approximation of f is based on f m ðxÞ from [22] having the following form: [22], we have for some universal constant c > 0. We will embed f m ðxÞ into a downsampled MDCNN with at most 4 channels to get the target approximation.The core of our main method is using different channels to store a variety of data features.Next, we will prove the theorem by induction.For the first L 0 layers, by Lemma 11, there exists a downsampled MDCNN with 3 channels having L 0 layers such that h ðL 0 Þ ðxÞ ∈ ℝ d×2 with ðh ðL 0 Þ ðxÞÞ  Next, the MDCNN we constructed will contain at most 4 channels: the first channel is used to store linear combinations of linear rectification units, the second channel is used to store the next linear rectification unit, the third channel is used to shift the input data by K steps, and the fourth channel is used to store raw input information.Suppose for l = kðL 0 + 1Þ, h ðlÞ ðxÞ ∈ ℝ d×2 with ðh ðlÞ ðxÞÞ 1,1 = ∑ k i=1 v/mb i σðα ðiÞ • x − t i Þ + B k and ðh ðlÞ ðxÞÞ :,2 = ðx i + BÞ d i=1 ; then, for l = kðL 0 + 1Þ + 1, we choose W ðlÞ ∈ ℝ ðK+1Þ×4×2 given by  Next, for mðL 0 + 1Þ < l ≤ mðL 0 + 1Þ + L 0 , we will use the first channel to store the linear combination of ReLU units, and there is no need to store the input data.By Lemma 11, we have for l = mðL 0 + 1Þ + L 0 , h ðlÞ ðxÞ ∈ ℝ d×2 with ðh ðlÞ ðxÞÞ 1,1 = v/m∑ m i=1 b i σðα ðiÞ • x − t i Þ + B m and ðh ðlÞ ðxÞÞ 1,2 = α ð0Þ • x + B 0 where B 0 = ∑ d i=1 jα ð0Þ j.For l = ðm + 1ÞðL 0 + 1Þ, W ðlÞ ∈ ℝ ðK+1Þ×2 with W Thus, by choosing c 1,1 = 1, c 1,2 = −ðB m + A 0 BÞ + b 0 and c j,i = 0 otherwise, we have At last, by choosing m = bðL/L 0 + 1Þ − 1c leading to ðL 0 + 1Þðm + 1Þ ≤ L < ðL 0 + 1Þðm + 2Þ, it is inevitable to appear L > ðL 0 + 1Þðm + 1Þ.However, we need not worry about it since by using the identity map similar to that in Lemma 16, we can always have In addition, through its concrete form of m, we have ln m ≤ ln L, and since K ∈ ½d − 1, we further have where we use L ≥ 3ðL 0 + 1Þ in ðaÞ.Thus, we obtain m −1/2−1/d ≤ 6ðd/LÞ 1/2+1/d .Putting these into (38), we have In a similar way of [14], by using the Cauchy-Buniakowsky-Schwarz inequality, we have

Conclusion and Future Work
This paper studies the approximations of structured MDCNNs with downsampling.The results show that for functions from ridge class and Sobolev's space H r ðℝ d Þ, our proposed MDCNNs have better function approximation performances over other relevant studies, explaining the reason why MDCNNs are successful in applications to some extent.But, note that our MDCNNs only consider the signal input and convolution of vectors; therefore, how does the matrix or higher order of convolution work?Is there some relationship between MDCNNs and FNNs?How does MDCNNs behave in terms of their generalization and expressivity?These are interesting questions we left them as future work.

Figure 2 :Theorem 6 .
Figure2: Illustration of MDCNNs with downsampling: the input data has C channels; then, the multichannel convolution is acted on the input data with C ð1Þ × C filters, and the output of the first layer contains C ð1Þ channels; next, the multichannel convolution is acted on the outputs of the first layer with C ð2Þ × C ð1Þ filters following by a downsampling operator, and the output of the second layer contains C ð2Þ channels; and so on.