A new type of information-theoretic method is proposed to improve prediction performance in supervised learning. The method has two main technical features. First, the complicated procedures used to increase information content are replaced by the direct use of hidden neuron outputs. Information is controlled by directly changing the outputs of the hidden neurons. In addition, to simultaneously increase information content and decrease errors between targets and outputs, the information acquisition and use phases are separated. In the information acquisition phase, the autoencoder tries to acquire as much information content on input patterns as possible. In the information use phase, information obtained in the acquisition phase is used to train supervised learning. The method is a simplified version of actual information maximization and directly deals with the outputs from neurons. The method was applied to the three data sets, namely, Iris, bankruptcy, and rebel participation data sets. Experimental results showed that the proposed simplified information acquisition method was effective in increasing the real information content. In addition, by using the information content, generalization performance was greatly improved.
First, several cases were observed where the information-theoretic methods did not necessarily succeed in increasing information content. For example, when the number of neurons increases, the adjustment among neurons becomes difficult, preventing the neural networks from increasing information content. Second, there is the problem of computational complexity. As expected, information or entropy functions require complex learning formulas. This suggests that information-theoretic methods can be effective only for the relatively small sized neural networks. Third is the problem of compromise between information maximization and error minimization. From an information-theoretic point of view, information on input patterns should be increased as much as possible. On the other hand, neural networks should minimize errors between targets and outputs. Because information maximization and error minimization are sometimes contradictory, it can be difficult to compromise between the two in one framework.
First, the process of information maximization can be realized by simulating the actual process of information maximization. In the information-theoretic method, when information increases, a small number of hidden neurons are activated. This number should be decreased in the course of learning as much as possible in order to increase the information. For this purpose, hidden neurons are arranged according to the magnitude of their variance. Specifically, hidden neurons with larger variance are more strongly activated. Much importance is placed on neurons with larger variances. This direct use of outputs can also facilitate the process of information maximization and reduce computational complexity.
Second, the information acquisition and use phases are separated. This is because it has been difficult to increase information maximization and achieve error minimization at the same time. First, information content in input patterns is acquired. Information content is then used to train supervised neural networks. This eliminates the contradiction between information maximization and error minimization in the same learning processes. The effectiveness of separation has been demonstrated in the field of deep learning [
Finally, relations between the present method and sparse coding should be noted as well. In deep learning, sparse representations play an important role when only a small number of components are nonzero and the majority is forced to be zero. Sparse coding is said to be related to improved separability and interpretation and is biologically motivated [
Information-theoretic methods were originally developed to increase information content in hidden neurons on input patterns. Various methods have been successfully applied for increasing information content to a certain quantity [
Though multilayered neural networks are supposed, the learning procedures are explained by using the simple layered network, because the same procedures are repeated in the multilayered networks. Let
Network architecture for supervised learning.
The information or entropy function in (
Thus, this information increase is realized by using the actual outputs from hidden neurons. When the information becomes larger and entropy becomes smaller, a small number of hidden neurons tend to fire, while all the others become inactive. To realize this situation, the winners of hidden neurons are considered. The first winner is defined as a hidden neuron with the highest variance and the second winner has the second highest variance and so on. Let
Information maximization is sometimes contradictory to error minimization. This means that when maximizing information, the errors between targets and outputs cannot easily be decreased. Recently, it has been shown that unsupervised learning is effective in training multilayered networks [
Network architecture for supervised learning with an information acquisition (a) and use phase (b).
Targets
The computational procedures for the information acquisition phase are here explained. The output from the output neuron in the autoencoder in Figure
The equation to be minimized is
In the information use phase, the connection weights obtained in the information acquisition phase are used initially. Let
First, the method was applied to the well-known Iris problem data set [
Network architecture with information acquisition phase (a) and use phase (b) for the Iris problem.
Information acquisition phase
Information use phase
First, an experiment was conducted to see whether our simplified method was effective in increasing information content:
Information for six hidden layers as a function of the parameter
1st layer
2nd layer
3rd layer
4th layer
5th layer
6th layer
As mentioned above, the information could be increased when the parameter
Generalization errors for six hidden layers as a function of the parameter
1st layer
2nd layer
3rd layer
4th layer
5th layer
6th layer
Table
Summary of experimental results by the simplified information maximization for the first to the sixth competitive layer for the Iris data. The simplified information maximization and standard multilayered neural networks are represented by “SIM” and “STD,” respectively.
Methods | Layers | Average | Std. dev. | Max | Min |
---|---|---|---|---|---|
SIM | 1 | 0.029 | 0.019 | 0.057 | 0.000 |
2 | 0.031 | 0.025 | 0.086 | 0.000 | |
3 | 0.029 | 0.030 | 0.086 | 0.000 | |
4 | 0.029 | 0.019 | 0.057 | 0.000 | |
5 | 0.026 | 0.025 | 0.057 | 0.000 | |
6 | 0.026 | 0.021 | 0.057 | 0.000 | |
|
|||||
STD | 1 | 0.049 | 0.036 | 0.114 | 0.000 |
2 | 0.029 | 0.027 | 0.086 | 0.000 | |
3 | 0.043 | 0.028 | 0.086 | 0.000 | |
4 | 0.046 | 0.039 | 0.114 | 0.000 | |
5 | 0.034 | 0.030 | 0.086 | 0.000 | |
6 | 0.049 | 0.027 | 0.114 | 0.029 |
The second type of data was the bankruptcy data (
Network architecture with information acquisition phase (a) and use phase (b) for the bankruptcy problem.
Information acquisition phase
Information use phase
Figure
Information for four competitive layers as a function of the parameter
1st layer
2nd layer
3rd layer
4th layer
Figure
Generalization errors for four hidden layers as a function of the parameter
1st layer
2nd layer
3rd layer
4th layer
Table
Summary of experimental results by the simplified information maximization for the first to the fourth hidden layers and the method without unsupervised information acquisition phase for the bankruptcy data. The simplified information maximization and standard multilayered neural networks are represented by “SIM” and “STD,” respectively.
Methods | Layers | Average | Std. dev. | Max | Min |
---|---|---|---|---|---|
SIM | 1 | 0.247 | 0.107 | 0.400 | 0.100 |
2 | 0.207 | 0.068 | 0.300 | 0.100 | |
3 | 0.217 | 0.065 | 0.300 | 0.100 | |
4 | 0.233 | 0.093 | 0.333 | 0.100 | |
|
|||||
STD | 1 | 0.270 | 0.090 | 0.400 | 0.167 |
2 | 0.233 | 0.065 | 0.333 | 0.167 | |
3 | 0.260 | 0.080 | 0.367 | 0.133 | |
4 | 0.320 | 0.115 | 0.533 | 0.133 |
Finally, the method was applied to the rebel participation data set [
Network architecture with information acquisition phase (a) and use phase (b) with 19 input, 100 hidden, and 2 output neurons for the rebel participation problem.
Information acquisition phase
Information use phase
Then, the experimental results showed to what extent information content could be increased for the higher hidden layers. Figure
Information for eight hidden layers as a function of the parameter
1st layer
2nd layer
3rd layer
4th layer
5th layer
6th layer
7th layer
8th layer
In the information use phase, connection weights obtained in the information acquisition phase were used to train multilayered neural networks. Figure
Generalization errors for eight competitive layers as a function of the parameter
1st layer
2nd layer
3rd layer
4th layer
5th layer
6th layer
7th layer
8th layer
Here we present the summary of the results on generalization performance. As shown in Table
Summary of the experimental results for the rebel participation data by information maximization and the method without the unsupervised information acquisition phase for the first to eighth competitive layers. The simplified information maximization and standard multilayered neural networks are represented by “SIM” and “STD,” respectively.
Method | Layer | Average | Std. dev. | Max | Min |
---|---|---|---|---|---|
SIM | 1 | 0.181 | 0.017 | 0.207 | 0.155 |
2 | 0.175 | 0.016 | 0.200 | 0.155 | |
3 | 0.180 | 0.016 | 0.205 | 0.150 | |
4 | 0.176 | 0.016 | 0.196 | 0.141 | |
5 | 0.178 | 0.019 | 0.209 | 0.148 | |
6 | 0.172 | 0.020 | 0.200 | 0.134 | |
7 | 0.179 | 0.012 | 0.200 | 0.155 | |
8 | 0.173 | 0.012 | 0.193 | 0.152 | |
|
|||||
STD | 1 | 0.189 | 0.016 | 0.218 | 0.166 |
2 | 0.184 | 0.020 | 0.209 | 0.157 | |
3 | 0.187 | 0.017 | 0.211 | 0.164 | |
4 | 0.190 | 0.017 | 0.216 | 0.166 | |
5 | 0.214 | 0.025 | 0.264 | 0.180 | |
6 | 0.216 | 0.023 | 0.252 | 0.184 | |
7 | 0.221 | 0.016 | 0.255 | 0.202 | |
8 | 0.225 | 0.021 | 0.271 | 0.198 |
The results presented in this study demonstrate that the simplified information acquisition method was effective in increasing information content, accompanied by improved generalization performance. The method is simpler than those which directly differentiate the information or entropy function [
One of the main findings of the analysis is that generalization performance was not degraded when the number of competitive layers was larger. In all three experimental results in Tables
First, close relations should be pointed out between information increase and the number of hidden neurons. Figures
From an initial state to a maximum information state.
Second, the separation of the information acquisition and use phases was shown to be effective in improving generalization. As information maximization methods were originally developed for neural networks [
From this consideration, it can be concluded that improved generalization, in particular, for higher layers, is due to the flexible control of the number of hidden neurons and the separation of information increasing processes and error minimization.
There are at least three problems with the present method, namely, winner determination, how to determine the quantity of information, and drastic changes in information. First, there is the problem of determining winners. As explained, the winners can be determined by using the variance of outputs from hidden neurons. When the variance becomes larger, the degree of winning becomes larger. The variance was adopted heuristically; thus, there is the possibility of some other options. For example, the simplest ways to choose winners is based on the magnitude of hidden outputs. When the magnitude of hidden neuron outputs becomes larger, the degree of winning becomes larger as well. This hypothesis seems to be natural for determining winners. However, even if the magnitude of hidden neuron output is large, the neuron may be not so important. For example, if all the output values of a neuron are the same, it is of no use. Thus, a method should be developed to determine the degree of winning neurons more naturally. Further comparison of those possible candidates for determining winners is needed to explain the necessity of the winners.
In addition, no explicit criteria exist to define the quantity of information content or the number of hidden neurons. As is explained in the experimental results, the information is forced to increase as much as possible and examine relations between information and generalization. It would be very convenient to have some criteria to stop this information increase. This is related to the determination of the appropriate number of hidden neurons.
Third, drastic changes should be explained in information increases, in particular, for the rebel data set in Figure
One of the main possibilities of the present method is that a number of different types of network architectures can be created simply by changing the information content. As mentioned, information maximization corresponds to the firing of a single neuron in the end. This means that the number of activated neurons can be changed by modifying the information content. Figure
A variety of architectures by the information-theoretic method.
The property is related to the production of appropriate network architectures for different problems. As is well-known, the learning of multilayered networks can be fasciculated by unsupervised learning [
In this paper, a new type of information-theoretic method to improve generalization performance was proposed. In the method, the complex procedures of information maximization were replaced by simpler ones. The method directly deals with outputs from hidden neurons. In the process of information maximization, a small number of neurons actually become activated. This process is realized by activating neurons in accordance with the magnitude of the neurons’ variance. When the neurons become more important (larger variance), they become more activated and larger.
In addition, the information acquisition and use phases are separated. In the information acquisition phase, information content in hidden neurons is increased by producing a small number of active hidden outputs. On the other hand, in the information use phase, the information obtained in the information acquisition phase is used to train supervised learning. As is well-known, information maximization has been sometimes contradictory with error minimization for supervised learning. The separation between the two phases showed that it was possible to compromise between error minimization and information maximization. This is because, in each phase, neural networks can focus solely on either the processing of information maximization or error minimization.
The method was applied to the Iris data set, bankruptcy data set, and rebel data set. Experimental results showed that the information increase and improved prediction performance were possible through the present method. In particular, for higher layers, information increase was directly related to generalization performance, though some abrupt changes in information occurred in learning.
Though information-theoretic methods have been used extensively in neural networks, in particular to examine how neural networks acquire information content on input patterns, their learning rules have been complicated for actual applications. The proposed method is simple enough that it can be applied to many problems, in particular, to large-sized data. This opens up new possibilities for information-theoretic methods.
The author declares that there are no competing interests regarding the publication of this paper.