A Frobenius Norm Regularization Method for Convolutional Kernel Tensors in Neural Networks

The convolutional neural network is a very important model of deep learning. It can help avoid the exploding/vanishing gradient problem and improve the generalizability of a neural network if the singular values of the Jacobian of a layer are bounded around 1 in the training process. We propose a new Frobenius norm penalty function for a convolutional kernel tensor to let the singular values of the corresponding transformation matrix be bounded around 1. We show how to carry out the gradient-type methods. This provides a potentially useful regularization method for the weights of convolutional layers.


Introduction
In recent years, deep convolutional neural networks have been applied successfully in many fields, such as face recognition, self-driving cars, natural language understanding, and speech recognition [1]. As we know, in the field of deep learning, convolution without the flip is very important arithmetic [2]. From the viewpoint of linear algebra, each convolutional kernel corresponds to a linear transformation matrix. Given the input X and a kernel K, the output of convolution Y � K * X can be reshaped from the matrixvector multiplication of a corresponding transformation matrix M with the reshaped X. We use vec(X) to denote the vectorization of X. If X is a matrix, vec(X) is the column vector got by stacking the columns of X on top of one another. If X is a tensor, vec(X) is the column vector got by stacking the columns of the flattening of X along the first index (see [3] for more on flattening of a tensor). *us, given a kernel K, assume M is the linear transformation matrix corresponding to the kernel K, we have vec(Y) � Mvec(X). (1) Training a neural network can be seen as an optimization problem, which is seeking the optimal weights (parameters) by reaching the minimum of loss function on the training data. Exploding and vanishing gradients are fundamental obstacles to effectively training a deep neural network [4]. *e singular values of the Jacobian of a layer bound the factor by which it changes the norm of the backpropagated signal. On the other hand, the generalizability of a model can be improved by reducing the sensitivity of a loss function against the input data perturbation [5][6][7]. *erefore, when training deep convolutional neural networks, if the singular values of the Jacobian of each layer are close to 1, it can help to avoid the exploding/vanishing gradient problem [4,[8][9][10] and improve the generalizability of a neural network [7,11,12]. *is is to let the singular values of each linear transformation matrix M be bounded around 1. Our contribution in this paper is to give a Frobenius norm penalty function, to penalize the kernel K such that the singular values of the corresponding transformation matrix M be bounded around 1, thus ‖vec(Y)‖ ≈ ‖vec(X)‖, where ‖ · ‖ denotes a certain vector norm.
First, we briefly introduce the convolution arithmetic in deep learning; please see reference [2] for details. Depending on different strides and padding patterns, there are many different forms of convolution arithmetic [2]. Without losing generality, in this paper we will consider the same convolution with unit strides. *e notation * is to denote the convolution arithmetic in deep learning and · is to round a number to the nearest integer greater than or equal to that number. If a convolutional kernel is a matrix K ∈ R k×k and the input is a matrix X ∈ R N×N , each entry of the output Y ∈ R N×N is produced by where m � k/2 and X i, In convolutional neural networks, usually there are multichannels and a convolutional kernel is represented by a 4-dimensional tensor. If a convolutional kernel is a 4-dimensional tensor K ∈ R k×k×g×h and the input is 3-dimensional tensor X ∈ R N×N×g , each entry of the output Y ∈ R N×N×h is produced by where m � k/2 and When we refer to the convolution arithmetic in deep learning, only element-wise multiplication and addition are performed and there is no reverse for the convolutional kernel in deep learning. Given a general kernel K whose size is k × k × g × h and the input data matrix size N × N × g, and assume M is the linear transformation matrix corresponding to the kernel K, we will give a regularization term to let the following term toward a smaller value.
But the above term (4) is hard to minimize directly. In this paper, we will use ‖M T M − I‖ 2 F as the penalty function, where ‖ · ‖ F denotes the Frobenius norm of a matrix and I is the identity matrix, to let the singular values of M be bounded around 1.
In this paper, we will show how to modify the entries of the kernel K to minimize ‖M T M − I‖ 2 F from the viewpoint of linear algebra. *e knowledge in the field of matrix analysis plays a key role in this paper. As we know, given a matrix, the singular values/eigenvalues are continuous functions depending on the entries of the matrix [13]. We can calculate the partial derivatives of a singular value with respect to the entries, and let the singular values of a matrix be changed by modifying the entries. Here the transformation matrix M corresponding to a convolutional kernel is structured, i.e., M has a special matrix structure. Our goal is to regularize the singular values of M through modifying entries of the kernel K. *e modification of K is actually to carry out a modification of M on a special matrix manifold. *e contribution is that we derive a mathematical formula for the gradient of ‖M T M − I‖ 2 F with respect to the kernel K, i.e., z‖M T M − I‖ 2 F /zk p,q,z,y . *en gradient-based algorithms can be applied to effectively let the singular values of convolutional layers be bounded. Compared with the 2 norm, the Frobenius norm of a matrix is less sensitive to the perturbations of matrix entries. *is Frobenius norm-based formula will be numerically stable.
*ere are many techniques from different perspectives to improve the performance of a neural network model. In [14], a semisupervised deep model, which is robust over imbalanced and small training data sets, is proposed for human activity recognition from multimodal wearable sensory data. A semisupervised feature selection method, which shows superiority in video semantic recognition-related tasks, is proposed from the perspective that the instances with similar labels should have a larger probability of being neighbors [15]. Two deep learning-based frameworks are proposed, which make sense of spatio-temporal preserving representations for electroencephalography-based human intension recognition [16]. A modeling method based on neural networks describes the hysteresis nonlinearity better by adding a nonlinear function in the input layer [17].
As we know, batch normalization and dropout are two popular regularization methods [18][19][20][21]. Recently, for the weights of a neural network, there have been many papers devoted to enforcing the orthogonality or spectral norm regularization [8,9,12,22]. *e difference between our paper and papers including [8,9,12,22] and the references therein is about how to handle convolutions. *ey enforce the constraint directly on the h × (gkk) matrix reshaped from the kernel K ∈ R k×k×g×h , while we enforce the constraint on the transformation matrix M corresponding to the convolution kernel K. In [10], a convolutional layer is projected onto the set of layers obeying a bound on the operator norm of the layer and this is shown to be an effective regularizer. A drawback of the method in [10] is that projection can prevent the singular values of the transformation matrix from being large but cannot avoid the singular values to be too small. In [23], a 2-norm regularization method is proposed for convolutional kernels, but it is not a stable algorithm because the largest singular value may be overtaken by the second or the third largest singular value after one updating. In this paper, we propose a Frobenius norm regularization method for convolutional kernels.
As we have mentioned, the input channels and the output channels may be more than one, so the kernel is usually represented by a tensor K ∈ R k×k×g×h . *e rest of the paper is organized as follows: In Section 2, we first consider the case that the numbers of input channels and the output channels are both 1. We propose the penalty function, calculate the partial derivatives, and propose the gradient descent algorithm for this case. In Section 3, we propose the penalty function and calculate the partial derivatives for the case of multichannel convolution. In Section 4, we present numerical results to show the method is feasible and effective. In Section 5, we will give some conclusions and point out some interesting work that could be done in the future.

Penalty Function for One-Channel Convolution
As a warm-up, we first focus on the case that the numbers of input channels and the output channels are both 1. In this case, the weights of the kernel are a k × k matrix. Without loss of generality, assuming the data matrix is N × N, we use 2 Computational Intelligence and Neuroscience a 3 × 3 matrix as a convolution kernel to show the associated transformation matrix. Let K be the convolution kernel: *en the transformation matrix corresponding with the convolution arithmetic is where In this case, the transformation matrix M corresponding to the convolutional kernel K is a N 2 × N 2 doubly block banded Toeplitz matrix, i.e., a block banded Toeplitz matrix with its blocks are banded Toeplitz matrices. For the details about Toeplitz matrices, please see references [24,25]. We will let n � N 2 and use T to denote the set of all matrices like M in (6), i.e., doubly block banded Toeplitz matrices with the fixed band.
We will use ‖M T M − I‖ 2 F as the penalty function to regularize the convolutional kernel K and calculate z‖M T M − I‖ 2 F /zK p,q , i.e., the partial derivative of Frobenius norm of M T M − I versus each entry K p,q of the convolution kernel. Our method provides a new method to calculate the gradient of the penalty function of the transformation matrix versus the convolution kernel. People can construct other penalty functions of M and get the gradient descent method when training their convolutional networks. *e following lemma is easy but useful in the following derivation. Lemma 1. e partial derivative of square of Frobenius norm of A ∈ R n×n with respect to each entry a ij is z‖A‖ 2 If an entry a ij of the matrix A ∈ R n×n changes, only the entries belonging to j-th row or j-th volume of the matrix A T A are affected. Actually, we have the following lemma.

Lemma 2. If we use (A T A) s,t to denote the (s, t) entry of the matrix A T A, then z(A T A) s,t /za ij is the (s, t) entry of the matrix D � A T (e i e T j ) + (e j e T i )A, where
We have the following formula from Lemma 1 and Lemma 2 For a matrix M ∈ T, the value of K p,q will appear in different (i, j) indexes. We use Ω to denote this index set, i.e., for each (i, j) ∈ Ω , we have m ij � K p,q . *e chain rule formula about the derivative tells us that, if we want to calculate z‖M T M − I‖ 2 F /zK p,q , we should calculate z‖M T M − I‖ 2 F /zm ij for all (i, j) ∈ Ω and take the sum, i.e., 1 2 We summarize the above results as the following theorem. We can use the formula (11) *eorem 1 provides new insight into how to regularize a convolutional kernel K such that singular values of the corresponding transformation matrix are in a bounded interval. We can use the formula (11) to carry out the gradient-type methods. In the future, we can construct other penalty functions to let the transformation matrix corresponding to a convolutional kernel have some prescribed property and calculate the gradient of the penalty function with respect to the kernel as we have done in this paper.

The Penalty Function and the Gradient for Multichannel Convolution
In this section, we consider the case of multichannel convolution. First, we show the transformation matrix corresponding to multichannel convolution. At each convolutional layer, we have a convolution kernel K ∈ R k×k×g×h , and the input X ∈ R N×N×g ; element X i,j,d is the value of the input unit within the channel d at row i and column j. Each entry of the output Y ∈ R N×N×h is produced by  (12) where where M is as follows: and each B (c)(d) ∈ T, i.e., B (c)(d) is a N 2 × N 2 doubly block banded Toeplitz matrix corresponding to the portion K : ,: ,d,c of K that concerns the effect of the d-th input channel on the c-th output channel. Similar to the proof in Section 2, we have the following theorem.

Theorem 2.
Assume M is the structured matrix corresponding to the multichannel convolution kernel K ∈ R k×k×g×h as defined in (13) . Given (p, q, z, y), if Ω p,q,z,y is the set of all indexes (i, j) such that m ij � k p,q,z,y , we have zK p,q,z,y � 2 (i,j)∈Ω p,q,z,y t�1,...,g * N 2 *en the gradient descent algorithm for the penalty function ‖M T M − I‖ 2 F can be devised, where the number of channels may be more than one. We present the detailed gradient descent algorithm for the penalty function ‖M T M − I‖ 2 F as follows:

Numerical Experiments
*e numerical tests were performed on a laptop (3.0 GHz and 16G Memory) with MATLAB R2016b.
We use M to denote the transformation matrix corresponding to the convolutional kernel. *e largest singular value and smallest singular value of M (denoted as "σ max (M) and σ min (M)), the iteration steps (denoted as "iter") are demonstrated to show the effectiveness of our method. *e efficiency is related to the step size λ. According to our experience, the norm of the matrix reshaped from the gradient tensor G ∈ R k×k×g×h in Algorithm 1 decreases as the number of iteration steps becomes larger. *erefore, we can let the step size λ be a small number at first and gradually increase λ. In our numerical experiments, for Algorithm 1 we use the following dynamic adjustment of step size: Numerical experiments are implemented on extensive test problems. Our method is effective in letting σ max (M) and σ min (M) be approximate to 1. We present the numerical results for some random generated multichannel convolution kernels.
We start from a random kernel with each entry normally distributed on [0, 1], i.e., in MATLAB, K is generated by the following command.
rng (1): We consider kernels of different sizes with 3 × 3 filters, namely K ∈ R 3×3×g×h for various values of g, h. For each kernel, we use the input data matrix of size 20 × 20 × g. 4 Computational Intelligence and Neuroscience We then minimize R 1 (K) � ‖M T M − I‖ 2 F using Algorithm 1 and we demonstrate the beneficial effect of decreasing σ max (M) while increasing σ min (M). We present in Figure 1  About the running time of Algorithm 1, we have the following remarks. Given a k × k × g × h kernel and the input matrix size N × N × g, the gradient tensor G ∈ R k×k×g×h is computed by (14). From the extensive numerical experiments, it is observed that, for a given kernel, when the value of N changes, the values of the gradient tensor G computed by (14) remain almost unchanged. For example, for the kernel generated by the command "rng(1); K � randn(3, 3, 6, 3)" "; let G (1) denote the gradient tensor computed with N � 6, and let G (2) denote the gradient tensor computed with N � 64, then each entry G (1) p,q,z,y has at least three significant digits of the corresponding entry G (2) p,q,z,y . So in the practical training of neural networks, where N could be N � 64 or even larger for a convolutional layer, we can always compute the gradient tensor G with a smaller value of N, for example, with N � 6. From the extensive numerical experiments about different kernels, when N � 64, N � 128, and (1) Input: an initial kernel K ∈ R k×k×g×h , input size N × N × g , and learning rate λ.
(2) While not converged:  Computational Intelligence and Neuroscience N � 256, using the gradient tensor G computed with N � 6 to carry out Algorithm 1, the convergence figures of σ max (M) and σ min (M) are similar to the subfigures in Figure 1. For a given 3 × 3 × 64 × 64 kernel, the time for computing the gradient tensor G once with N � 6 is 12.10 seconds on our laptop. For a given 3 × 3 × 5 × 1 kernel, the time for computing the gradient tensor G once with N � 6 is 0.02 seconds on our laptop. *e time cost of our regularization method is affordable.

Conclusions
In this paper, from the viewpoint of linear algebra, we provide the Frobenius norm method to let the convolution be norm preserving, i.e., orthogonal. We let the singular values of the structured transformation matrix corresponding to a convolutional kernel tensor be close to 1. We give the penalty function and propose the gradient decent algorithm for the convolutional kernel tensor. Numerical experiments confirm that this method is effective to modify the singular values of the convolution operations. *is mathematical method may cast new light on the training of very deep convolutional neural networks. However, up to now, we only show the method is mathematically feasible and it may be of potential use in the field of deep learning. In the future, further numerical experiments are needed to show the effectiveness of this method during the training of neural networks.
In the future, we will evaluate this method on state-ofthe-art architectures like ResNet and DenseNet [26,27], considering all aspects like recognition rate and computational efficiency. As we know, training a neural network involves many details. It is not easy to see the performance improved while conducting experiments. *is will be left as our future work [28].

Data Availability
Data involved in the paper will be shared upon request. If anybody is interested in the data, please send an email to the corresponding author's email: peichang@cugb.edu.cn. *e corresponding author will send the data through email.

Disclosure
An earlier draft version of this paper was uploaded to arXiv and https://arxiv.org/abs/1907.11235 is the link on arXiv.

Conflicts of Interest
*ere are no conflicts of interest to declare.