1. Introduction

TSWJ

The Scientific World Journal

1537-744X 2356-6140

Hindawi Publishing Corporation

815156

10.1155/2014/815156

815156

Research Article

The Generalization Complexity Measure for Continuous Input Data

http://orcid.org/0000-0001-7400-7860

Gómez

Iván

¹ Cannas

Sergio A.

² Osenda

Omar

² Jerez

José M.

http://orcid.org/0000-0003-0012-5914

Franco

Leonardo

¹ Liu

Zhao

Departamento de Lenguajes y Ciencias de la Computación

Universidad de Málaga

29071 Málaga

Spain

uma.es

Facultad de Matemática, Astronomía y Física

Universidad Nacional de Córdoba

5000 Córdoba

Argentina

unc.edu.ar

2014

1042014

2014 18 12 2013 05 03 2014 10 04 2014

2014

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We introduce in this work an extension for the generalization complexity measure to continuous input data. The measure, originally defined in Boolean space, quantifies the complexity of data in relationship to the prediction accuracy that can be expected when using a supervised classifier like a neural network, SVM, and so forth. We first extend the original measure for its use with continuous functions to later on, using an approach based on the use of the set of Walsh functions, consider the case of having a finite number of data points (inputs/outputs pairs), that is, usually the practical case. Using a set of trigonometric functions a model that gives a relationship between the size of the hidden layer of a neural network and the complexity is constructed. Finally, we demonstrate the application of the introduced complexity measure, by using the generated model, to the problem of estimating an adequate neural network architecture for real-world data sets.

1. Introduction

Feed-forward neural networks trained by back-propagation have become a standard technique for classification and prediction tasks given their good generalization properties. However, the process of selecting adequate neural network architecture for a given problem is still a controversial issue. Several important contributions regarding the number of hidden neurons needed to implement a given function in a neural architecture have been made using different methods. Baum and Haussler [1] obtained some bounds on the number of neurons in an architecture related to the number of training examples that can be learnt using networks composed of linear threshold networks. Barron [2] made an important contribution about the approximation capabilities of feed-forward networks, computing an estimation of the number of hidden nodes necessary to optimize the approximation error. Camargo and Yoneyama [3] obtained a result for estimating the number of nodes needed to implement a function using Chebyshev polynomials and previous results from Scarselli and Chung Tsoi [4] about the number of nodes needed for approximating a given function by polynomials. Hunter et al. [5] focused on the importance of selecting the learning algorithm to train closer to optimal architectures. Methods based on the geometry of output classes [6–8], single value decomposition [9], information entropy [10], and the signal to noise ratio [11] have been used to obtain an approximation to the size of hidden layer in a neural architecture.

Some of the previous studies tried to determine the adequate architecture depending on the complexity of the data set available for a given problem, but as expected measuring the complexity of data is a difficult task. Firstly, it has to be clearly defined what exactly the measure tries to quantify, as complexity can be related to several aspects of the data. Even if different complexity measures related to the size of the architectures needed to implement the data or to the complexity of learning have been proposed in the past [12–14], they have not been applied to the neural network architecture selection problem, in principle because they have not been proposed with this focus.

Moreover, several approaches have been proposed within the learning theory area to analyze the relationship between generalization and complexity. Ho et al. [15, 16] studied the complexity that characterizes the difficulty of a classification problem, and they suggest using this value to guide the selection of classifier. Sánchez et al. [17] tried to characterize the behavior of the k-NN rule when working under certain situations. More specifically, their analysis focused on the use of some data complexity measures to describe class overlapping, feature space dimensionality, and class density and discover their relation with the practical accuracy of this classifier. Duch et al. [18] suggested that the identification of datasets with high complexity is important to test new methods in computational intelligence.

But most of these analyses focused on the complexity of the architectures and on the error obtained at the end of the training process rather than on the intrinsic complexity of the data. Recently, Franco and colleagues [19, 20] have proposed a complexity measure named “generalization complexity” (GC) that aims to quantify the level of generalization ability that can be expected when Boolean data are used in a classification algorithm. The measure has been also used in the process of architecture selection involved in the implementation of a neural network, as it is expected that for more complex data larger neural network architectures might be more adequate [21]. Nevertheless, the proposed measure can only be applied to Boolean input data so, in this work, the Boolean generalization complexity is first extended to the continuous input case, to then perform a series of tests to validate the proposal using a set of continuous functions with parametrized complexity. Also, by using the set of orthonormal Walsh functions, we extend the proposal for its use with patterns of data. Finally, a model is built from which it is possible to estimate the adequate feed-forward neural network architecture for real-world benchmark data sets by choosing the number of neurons to include in the hidden layer, as the size of the input and output layers is determined by the problem.

2. The Generalization Complexity Measure and Its Extension to Real Input Values

Our main goal in this work is to extend the GC measure defined in f:{0,1}D→{0,1} for real input and real output functions f:[0,1]D→[-1,1]. The choice of the intervals [0,1] for the input and [-1,1] for the output is arbitrary and it is used for simplicity with no restrictions for the general case. We will analyze the more general case of having a continuous output as this case can later be easily particularized to the Boolean output case, more related to classification problems.

The original definition of the GC measure [19, 20] comprises two terms accounting for the first and second nearest neighbor pairs of input data points ({ei}), where the neighborhood is defined in terms of their Hamming distance. Let Nex be the total number of examples (or equivalently patterns) considered and Nneigh the number of first nearest neighbors that every example (ei,f(ei)) has; that is, examples that are the closest Hamming distance. The first term of the GC measure, C1, known to be the more influential, is defined in Boolean space as (1)C1[f]=1NexNneigh∑j=1Nex(∑Hamming(ei,ej)=1|f(ei)-f(ej)|), where the first factor is a normalization one taking into account the number of pairs considered. Essentially, (1) measures the proportion of neighboring pairs that have different output, that is, belong to different output classes.

In the previous equation, the distance between pairs of inputs is measured by the Hamming distance, but this measure is not applicable for real valued input data. Instead, we will opt for a straightforward choice and use the Euclidean distance. We consider first the 1-dimensional (1D) case corresponding to a single continuous input variable, starting the process by discretizing the input interval [0,1] in N subintervals of length h=1/N. In this way a data point, ei, will be indicated by the subinterval in which its coordinates are included (xi-1,xi], where xi=ih (i=1,2,…,N), with x0=0 and xN=1. The total number of examples in the 1D case is equal to N, while, for an arbitrary dimension D, the discretization of every variable in the same way leads to ND examples.

Let us define fi for 1D as the value of the function at the center of subinterval i: fi≡f((xi-1+xi)/2), and also we assume that d(ei,ej)≡|xi-xj| and dmin⁡=min⁡{d(ei,ej)}=h. For fixed h, we will say that two input data points are first nearest neighbors if they are at distance dmin⁡ (this would be the equivalent of Hamming distance 1 in Boolean space).

In this way, (1) can be generalized as (2)𝒞1[f]=1NexNneighΔf∑j=1Nex(∑d(ei,ej)=dmin⁡|f(ei)-f(ej)|), where Δf=fmax⁡-fmin⁡. For D=1 we can obtain the first term of the complexity measure, 𝒞1[f], for continuous input data using a grid with N subintervals: (3)𝒞1[f]=12N∑i=1N|fi-fi-1|, where we used Δf=2, Nex=N, and substituted the sum over the two neighboring pairs by a forward sum over the sites. Defining the complexity measure density 𝒞1′[f]≡𝒞1[f]/dmin⁡, we can write (4)𝒞1′[f]=12∑i=1N|fi-fi-1h|h, which in the limit h→0 (N→∞) converges to (5)𝒞1′[f]⟶12∫01‍|df(x)dx|dx.

In terms of notation we will use C1 for the first term of the original Boolean GC measure, 𝒞1 for the discretized version for continuous functions, and 𝒞1′ will denote continuous generalization complexity density (CGC).

Equation (5) will be our proposal for the first term of the GC for continuous value input data for D=1. Clearly, this function will be larger for more fluctuating functions as expected. For D=2, we have (6)𝒞1[f]=h28∑i=1N∑j=1N[|fi,j-fi,j-1|0|fi,j-fi-1,j|+|fi,j-fi+1,j|mmmmm+|fi,j-fi,j+1|+|fi,j-fi,j-1||fi,j-fi,j-1|0], where fi,j is the value of the function within the square with coordinates x=ih, y=jh. The previous expression can be written more compactly as (7)𝒞1[f]=h24(∑i=1N∑j=1N-1|fi,j+1-fi,j|+∑j=1N∑i=1N-1|fi+1,j-fi,j|).

If f takes alternatively the maximum and minimum values (±1) on neighboring sites, 𝒞1[f]=1, taking care of counting only once the difference between neighboring sites. Defining the complexity measure density 𝒞1′[f]≡𝒞1[f]/dmin⁡ as before, and following the same steps, we get (8)𝒞1′[f]=14∫01‍dx∫01‍dy[|∂f(x,y)∂x|+|∂f(x,y)∂y|]. The above procedure can be straightforwardly generalized to arbitrary dimension D obtaining (9)𝒞1′[f]=12D∫01‍dx1∫01‍dx2⋯∫01‍dxD∑i=1D|∂f(x→)∂xi|. We observe that (9) is not bounded; that is, there is not a function with maximum complexity. This seems to be an intrinsic difficulty as for a real function the number of maxima and minima can grow indefinitely. In any case, (8) can be useful because it can measure complexities relative to a given function.

2.1. Testing the Generalization Complexity on a Set of Continuous Functions

Having introduced an extension of the complexity measure for a set of continuously distributed data (9) and (14), we now would like to test the proposal, and for that we will use a set of trigonometric functions with parametrized complexity. The set in dimension D is defined by (15)fnD(x→)=∏j=1Dsin(2πnxj), with n taking integer values n=1,2,…, even if real values can be also considered (e.g., n=1/λ). Dividing the D-dimensional hypercube by using a grid of spacing 1/2n leads to a function that cancels at the borders of the hypercubes of side h=1/2n, taking alternatively the values ±1 on nearest neighbour cells. This function is precisely the well-known parity Boolean function, having a very high complexity among the set of Boolean functions [19]. Measured by the first term of the GC measure, the parity function achieves maximum complexity of 1, and thus, given a value of the discretization spacing of h=1/N, it makes sense to consider only values of n up to a maximum value nmax⁡=1/2h=N/2.

From the definition of the first term (𝒞1′) of the continuous GC measure (CGC) (9), the complexity of the set of trigonometric functions defined by (15) can be obtained: (16)𝒞1′[fnD]=2DnπD-1. We observe that the complexity of the set of functions grows linearly to n, which is proportional to the density of points where the function cancels, a sensitive measure of the variation of the function.

The family of functions (15) can be generalized to consider different variation indexes according to the spatial direction; namely, (17)fnD(x→)=∏j=1Dsin(2πnjxj), where n=(n1,n2,…,nD). The complexity 𝒞1′ can also be easily computed and leads to (18)𝒞1′[fnD]=2DπD-11D∑j=1Dnj. We use the family of functions (15) to compare the behavior of the discrete and continuous complexity measures introduced in the previous section. To do that we computed numerically the discrete complexities 𝒞1 and 𝒞2 as a function of n/nmax⁡ for D=1 and 2, for a fixed value of the discretization h. Figure 1 shows the complexity values obtained for the continuous and discrete first terms (𝒞1′h and 𝒞1′, resp.) for one and two dimensions (Figures 1(a) and 1(b)), noting that for relatively low values of n/nmax⁡, that is, when h≪1/2n, the agreement is quite good, while for larger values, the discrete version underestimates the true complexity. A similar behaviour is observed for both plotted dimensions, noting that as the dimension increases the maximum complexity decreases by a factor 2D/πD-1 (cf. (18)). The evaluation of the second term of the continuous complexity measure (𝒞2′) is more cumbersome but it can be obtained with the aid of numerical integration software. In particular, for D=2, the calculations lead to (19)𝒞2′[fn2]=2(1+2π)n. Figure 2 shows the results for the second term of the complexity measure for the 2D set of functions. In the figure h𝒞2′[fn] and 𝒞2[fn] are shown as a function of n/nmax⁡. The continuous complexity 𝒞2′ grows linearly according to what has been obtained in (19), showing a different behaviour with respect to the discrete version counterpart with a nonmonotonic curve. The quadratic-like shape of 𝒞2 (in Boolean space) has been previously analyzed [19] and its behaviour independently of 𝒞1 does not hold for the continuous case. The fact that the value of 𝒞2′ is proportional to 𝒞1′ (for the set of sinusoidal benchmark functions, cf. (15)) implies that the second term does not contain independent information from what is provided by the first term.

A comparison of the continuous and discrete versions of the first-order term generalization complexities for the D=1 and D=2 set of functions from (15) using N=100. The discrete GC C1 is computed over a grid with spacing h and so the continuous input complexity C1′ is plotted multiplied by h.

(a) (b)

Figure 2

Comparison of the second terms of the complexities in their continuous and discrete versions, h𝒞2′ and 𝒞2, for the two-dimensional set of trigonometric functions from (15) as a function of n/nmax⁡.

3. Use of Walsh Functions for Testing and Estimation of GC

The set of Walsh functions introduced by Walsh in 1923 [22] is a set of orthonormal binary functions with continuous input. Walsh functions have been widely applied in signal processing [23, 24] and are also well known because their relationship to the Hadamard transform [25]. The approach developed in the previous section cannot be applied to a set of patterns (the standard case for practical problems) as it requires knowing the analytic expression of the underlying function. In this section, we first compute the complexity of the set of Walsh functions showing that it leads to sensitive results for the estimation of GC. After this test, we apply the set of Walsh functions for carrying out the approximation of the GC for a set of patterns. The choice of the set of Walsh functions is motivated by the fact that the original GC defined in Boolean space can be computed almost straightforwardly for this set given its discrete output. Also, the intrinsic discretization of the input space as the order of the Walsh functions is increased favors their application to continuous input problems.

3.1. The GC of the Set of Walsh Functions

The proposed complexity measure (9) can be applied to the set of Walsh functions by introducing an appropriated limit procedure. Let us consider first the one-dimensional case, namely, the set of Walsh functions Wn(x) defined on the real interval [0,1], where the index n=0,1,2,… is chosen so that it coincides with the number of nodes of the function. For instance, W0(x)=1 for all x, W1(x)=1 if 0≤x<1/2, W1(x)=-1 if 1/2≤x<1, and so forth.

We will introduce a set of continuous parametric functions Gn(x,β) to approach the Walsh functions. Gn(x,β) can be constructed in such a way that it has the same nodes as Wn(x); it is differentiable in the neighborhood of all the nodes of Wn(x) and lim⁡β→∞Gn(x,β)=Wn(x). The functions Gn(x,β) can be constructed by combining sigmoidal functions centered at the nodes of Wn(x) and constant functions taking values ±1 between them, joined smoothly by any interpolation procedure, such as a spline or polynomial method. Figure 3 shows two Walsh functions approximated by using hyperbolic tangent functions combined with constant ones.

Approximation of two Walsh functions (W1(x) and W2(x)) using hyperbolic tangent functions combined with constant ones (G1(x) and G2(x)).

(a) (b)

Let us consider for simplicity a finite set of Walsh functions up to order N=2m (for some fixed integer value of m). Then, the location of the nodes of every one of these functions belong to the set of values xi*=i/N, i=1,2…,N-1. Let [a,b] be an arbitrary interval enclosing only one particular node xi*. Then the following properties hold: (20)lim⁡β→∞∫ab‍∂Gn(x,β)∂xdx=lim⁡β→∞(Gn(b,β)-Gn(a,β))=±2,lim⁡β→∞Gn(x,β)=0 if x≠xi*. Hence, we can write (21)∂Gn(x,β)∂x=2∑i=1N-1ginϕ(x-xi*,β), where the coefficients gin can take the values 0 (if Wn has no node at xi*) and gin=±1; otherwise ϕ(x,β) is a real function sharp peaked around x=0 which satisfies lim⁡β→∞ϕ(x,β)=δ(x), δ(x) being a Dirac delta function [26]. Then, we can define the complexity of the Walsh functions as (22)𝒞1′[Wn(x)]≡lim⁡β→∞𝒞1′[Gn(x,β)].

From (5), (21), and (22), it follows that 𝒞1′[Wn(x)]=n. The extension to higher dimension is straightforward. Let Wn(x→)=∏j=1D‍Wnj(xj) be a D-dimensional Walsh function, where n=(n1,…,nD) is a set of one-dimensional Walsh indexes, defined as before. From (9) we obtain (23)𝒞1′[Wn(x→)]=1D∑j=1Dnj.

3.2. GC Estimation for a Set of Data Points Using the Base of Walsh Functions

Suppose that we want to compute the coefficients, Cn, for a given function F using a set of Walsh functions Wn(x→) defined in the [0,1]D(24)F(x→)≃∑n=0N-1CnWn(x→) given a limited set of sampling data points (fj,x→j,j=1,…,M). We will solve the estimation of the coefficients solving a minimization problem of the square error (S): (25)S(C→)=∑j=1M[fj-F(x→j)]2=∑j=1M[fj-∑n′=0N-1Cn′Wn′(x→j)]2, where C→ ≡ (C0,C1,…,CN-1).

To find the minimum of the error function, S, we compute the first derivative and make it equal to 0: (26)∂S∂Cn=-2∑j=1M[fj-∑n′=0N-1Cn′Wn′(x→j)]Wn(x→j)=0, from which (27)∑j=1MfjWn(x→j)=∑n′=0N-1Cn′∑j=1MWn′(x→j)Wn(x→j). Define the vector A→ ≡ (a0,a1,…,aN-1) as (28)an ≡∑j=1MfjWn(x→j) and matrix B={bn,n′} with (29)bn,n′ ≡∑j=1MWn′(x→j)Wn(x→j). Equation (27) takes the lineal form A→ = BC→, whose solution is given by (30)C→ = B-1A→.

A practical issue of the previous procedure is the computational cost involved; as for D-dimensional input data a matrix of size NhD×NhD has to be inverted (cf. (30)), where Nh=1/h is the maximum spacing used for the construction of the 1D set of Walsh functions. Nevertheless, such computation has to be done only once for given values of D and h, being independent of the data.

Once the Walsh coefficients of a function (or data) have been obtained, the CGC can be approximated by the same limiting procedure of the previous section. For instance, in one dimension we have (31)CGCw[F]=lim⁡β→∞𝒞1′[∑n=0N-1CnGn(x,β)]=lim⁡β→∞∫01‍|∑n=0N-1Cn∑i=1N-1ginϕ(x-xi*,β)|dx=∑i=1N-1|∑n=0N-1Cngin|, where we have used (21). For an expansion of a D-dimensional function on a finite set of N Walsh functions Wn with nj=0,1,…,N′ (j=1,…,D, N=N′D), we obtain similarly (32)CGCw[F]=1D∑i=1N′-1∑j=1D|∑nCnginj|, where CGCw indicates the approximation of the CGC using the set of Walsh basis functions. We carried out an experiment where we analyzed the accuracy of the proposed approximation to obtain a similar graph to the one shown in Figure 1(a), indicating that the approximation is working correctly. The fact that the graph obtained is almost exact to the one obtained in Figure 1(a) is consistent with what can be expected, as both are discrete approximations of the continuous value of the complexity.

4. Application to Real-World Input Data

In order to test practically the developed procedures, we first construct a model based on the extension of the complexity measure proposed previously, to then apply this model for the estimation of adequate neural network architecture to real-world problems. The model was estimated using the set of trigonometric functions defined by (15) for D=4. For each of the analyzed data set we calculated the complexity with the above method and we found values in the range between 0 and 0.5, and the generalization ability was computed for a set of single hidden layer neural architectures with a number of neurons in the hidden layer between 2 and 50, choosing the one that leads to the lowest validation error computed in a cross-validation procedure to avoid overfitting (early stopping), where the training is performed by the standard back-propagation algorithm. From the obtained number of neurons for each of the analyzed cases, a quadratic fitting was applied to obtain the final model, shown in Figure 4 by the solid line.

Figure 4

The model constructed for N=4 input dimensions and its application to estimate an adequate size neural network for three test benchmark functions. The continuous line represents the model estimated from a set of trigonometric functions of variable complexity and the blue dashed line indicates the size estimated by the model (using the Y-axis values), while the red dashed line is the best size obtained from exhaustive numerical simulations.

Figure 4 shows the application of the developed method, described in Section 3.2, to obtain the value of CGC for a given data set. Using the constructed model (the solid line in the Figure 4), it is then possible to use the obtained CGC value to get an estimate of an adequate neural architecture to implement the function. The figure also shows the best architecture found by intensive numerical simulations (see Table 1 for the numerical values).

Table 1

Results of the application of the model constructed by approximating the CGC of 10 benchmark data sets from the UCI repository.

ID	Data set	CGC w	N h est	N h Best
f 1	Balance Scale^2,3,4,5	0.001	6.08	4
f 2	Ecoli^2,4,6,7	0.03	7.1	10
f 3	Blood^1,2,3,4	0.06	8.47	4
f 4	TicTacToe^1,2,5,8	0.1	9.84	4
f 5	Liver Disorders^1,2,5,6	0.11	10.8	4
f 6	Mammografic^2,3,4,5	0.18	14.5	13
f 7	Hayes-Roth^2,3,4,5	0.22	16.9	4
f 8	Spectf^2,3,5,7	0.26	17.8	23
f 9	Vertebral Column^1,2,3,4	0.36	26.6	24
f 10	Haberman^1,2,3,4	0.43	31.8	26

The table shows the identifier of the function, the name of the data set with superscripts indicating the 4 input used variables, the estimated CGCw, the estimated size of an adequate neural network according to the model (Nhest), and the best architecture found from intensive simulations (NhBest).

Table 1 shows the results obtained by applying the developed method to 10 four-dimensional benchmark data sets. The data set problems are taken from the UCI repository and for each problem 4 input variables were selected. The columns show the identifier of the function, the name of the benchmark dataset with the 4 input variables used (indicated as a superscript), the estimated Generalization complexity obtained from (22), the number of neurons in the hidden layer estimated by the model (Nhest), and the best number of neurons found from exhaustive simulations (NhBest). The results obtained shown a quite good correlation between the estimated and best found values (ρ=0.84, P value = 0.002), suggesting the validity of the approach, even if there are some cases, like the function indicated in the table by f7 for which the estimation is not extremely accurate. Nevertheless, some discrepancies are always expected as the problem of choosing an adequate neural architecture is a complex problem with no exact solution, as it depends on the particular set of patterns presented and the training process used, and thus it is an intrinsically noisy process.

5. Discussion and Conclusions

We have introduced in this work an extension for the generalization complexity (GC) measure for continuous input data. The analysis of the new measure on a parametrized complexity set of trigonometric functions shows that the new proposal is consistent with the expected results and with the spirit of the original measure, as the GC essentially measures for a set of data the output variations as the inputs are modified. Nevertheless, a difference between the continuous and discrete cases exists in relationship to the role of the second term of the GC, as in the continuous case this term is no longer independent from the first term (at least for the set of trigonometric functions), and thus it does not add extra information about the complexity of the data. We have also introduced an approach based on the use of the set of Walsh functions for computing the CGC measure for data expressed as a set of patterns, the typical case in most practical applications. By fitting a model that relates architecture size to function complexity, a model is built and then it is applied to the problem of selecting an adequate neural network architecture in ten real-world benchmark problems. The application of the method to the benchmark data shows that the estimated neural architectures are quite close to the optimal values, indicating the suitability of the developed approach to the architecture selection problem. The method is clearly more efficient than the trial-and-error alternative for choosing a proper neural network architecture, as the computationally heavy part of the procedure is related to a matrix inversion that has to be done only once for a given dimension and thus, once computed, it can be reused with different data sets. The GC measure provides an estimate of the complexity of the data, and as such can possibly be used not only for the case of choosing the adequate architecture for neural networks, but also when using other predictive models (like SVM, decision trees, etc.), for example, for choosing the magnitude of the penalization term of the model complexity (regularization).

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors acknowledge support from CICYT (Spain) through Grants TIN2008-04985 and TIN2010-16556 (including FEDER funds), from Junta de Andalucía through Grants P08-TIC-04026 and P10-TIC-5770, and from CONICET (Argentina) and SECyT Universidad Nacional de Córdoba (Argentina).

Baum

E. B.

Haussler

What size net gives valid generalization?

Neural Computation 1990 1 1 151 160

Barron

A. R.

Approximation and estimation bounds for artificial neural networks

Machine Learning 1994 14 1 115 133

2-s2.0-0001325515

10.1007/BF00993164

Camargo

L. S.

Yoneyama

Specification of training sets and the number of hidden neurons for multilayer perceptrons

Neural Computation 2001 13 12 2673 2680

2-s2.0-0035650656

10.1162/089976601317098484

Scarselli

Chung Tsoi

Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results

Neural Networks 1998 11 1 15 37

2-s2.0-0345195977

10.1016/S0893-6080(97)00097-X

Hunter

Pukish

M. S.

III Kolbusz

Wilamowski

B. M.

Selection of proper neural network sizes and architectures—a comparative study

IEEE Transactions on Industrial Informatics 2012 8 2 228 240

2-s2.0-84859886103

10.1109/TII.2012.2187914

Mirchandani

Cao

On hidden nodes for neural nets

IEEE Transactions on Circuits and Systems 1989 36 5 661 664

2-s2.0-34250807964

Arai

Bounds on the number of hidden units in binary-valued three-layer neural networks

Neural Networks 1993 6 6 855 860

2-s2.0-0027290318

Zhang

Yang

Bounds on the number of hidden neurons in three-layer binary neural networks

Neural Networks 2003 16 7 995 1002

2-s2.0-0742307322

10.1016/S0893-6080(03)00006-6

Bacauskiene

Cibulskis

Verikas

Wang

Zurada

J. M.

B.-L.

Yin

Selecting variables for neural network committees

Advances in Neural Networks—ISNN 2006 3971

Springer

837 842 Lecture Notes in Computer Science

10.1007/11759966_123

Yuan

H. C.

Xiong

F. L.

Huai

X. Y.

A method for estimating the number of hidden neurons in feed-forward neural networks based on information entropy

Computers and Electronics in Agriculture 2003 40 1–3 57 64

2-s2.0-0042740660

10.1016/S0168-1699(03)00011-5

Liu

Starzyk

J. A.

Zhu

Optimizing number of hidden neurons in neural networks

Proceedings of the IASTED International Conference on Artificial Intelligence and Applications (AIA '07)

February 2007

Anaheim, Calif, USA

ACTA Press

121 126

2-s2.0-38349106530

Wegener

The Complexity of Boolean Functions 1987

John Wiley & Sons

Hastad

Almost optimal lower bounds for small depth circuits

Advanced Computer Research 1989 5 143 170

Parberry

Circuit Complexity and Neural Networks 1994

MIT Press

T. K.

Basu

Complexity measures of supervised classification problems

IEEE Transactions on Pattern Analysis and Machine Intelligence 2002 24 3 289 300

2-s2.0-0036522441

10.1109/34.990132

Basu

T. K.

Data Complexity in Pattern Recognition (Advanced Information and Knowledge Processing) 2006

New York, NY, USA

Springer

Sánchez

J. S.

Mollineda

R. A.

Sotoca

J. M.

An analysis of how training data complexity affects the nearest neighbor classifiers

Pattern Analysis and Applications 2007 10 3 189 201

2-s2.0-34547399424

10.1007/s10044-007-0061-2

Duch

Jankowski

Maszczyk

Make it cheap: learning with o(nd) complexity

Proceedings of the International Joint Conference on Neural Networks (IJCNN '12)

2012

1 4

Franco

Generalization ability of Boolean functions implemented in feedforward neural networks

Neurocomputing 2006 70 1–3 351 361

2-s2.0-33646505428

10.1016/j.neucom.2006.01.025

Franco

Anthony

The influence of oppositely classified examples on the generalization complexity of Boolean functions

IEEE Transactions on Neural Networks 2006 17 3 578 590

2-s2.0-33646527957

10.1109/TNN.2006.872352

Gómez

Franco

Jerez

J. M.

Neural network architecture selection: can function complexity help?

Neural Processing Letters 2009 30 2 71 87

2-s2.0-70349602258

10.1007/s11063-009-9108-2

Walsh

J. L.

A closed set of normal orthogonal functions

The American Journal of Mathematics 1923 45 5 24

Beauchamp

K. G.

Walsh Functions and Their Applications 1975

Academic Press

Evans

W. A.

Sine-wave synthesis using walsh functions

IEE Proceedings G 1987 134 1 1 6

2-s2.0-0023288336

Pratt

W. K.

Kane

Andrews

H. C.

Hadamard transform image coding

Proceedings of the IEEE 1969 57 58 68

Mickens

R. E.

Mathematical Methods for the Natural and Engineering Sciences 2004 65

World Scientific

Series on Advances in Mathematics for Applied Sciences