Neural Discriminant Models, Bootstrapping, and Simulation

,


Introduction
The neural network model is considered for the multiclass classification problem of assigning each observation into one of multiclass, which is referred to as a multiple-group neural discriminant model.As two-class problems are much easier to solve, we focus on neural networks for multiclass classification with respect to statistical techniques in order to derive the maximum likelihood estimators (MLE) [1][2][3][4][5][6][7].Statistical techniques are formulated in terms of the principle of the likelihood of the neural discriminant model, in which the connection weights of the network are treated as unknown parameters.
Besides the theoretical and empirical properties of the bootstrapping [8,9] in the multiple-group neural discriminant model, there are at least two other reasons to use a bootstrap procedure.First, the criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units.A number of model selection procedures (i.e., methods for the selection of the optimum number of hidden units), such as Akaike information criterion (AIC), BIC (Baysesian information criterion) and cross-validation [10][11][12][13] have been proposed.The bootstrap method, however, provides the percentile for the deviance, allowing evaluation of the overall goodness-offit and estimation of the bias of the excess error in prediction based on the selected model.Therefore, there is no extra cost for subsequence inference via the bootstrap samples generated for model selection.If a model is selected by a cross-validation method and the bootstrap is used for the subsequence inference, the extra cost of computations is required in resampling for cross-validation.Second, the bootstrap procedures developed in the multiple-group neural discriminant model can be extended, without any theoretical derivation, to more complicated problems such as the generalized additive models (GAM) [14,15], support vector machines (SVM) [16][17][18][19], and vector generalized additive models (VGAM) [20].
The remainder of this paper is organized as follows.In Section 2 we focus on the selection of the optimum number of hidden units and evaluation of the overall goodness-of-fit with the optimum number of hidden units.A neural network can approximate any reasonable function with arbitrary precision if the number of hidden units tends to infinity [21].The output of the network fits the training sample too closely if the number of hidden units is increased and the noise is modeled in addition to the desired underlying function.The bootstrapping is also adapted in order to provide estimates of the bias of the excess error in a prediction rule constructed with training samples [22,23].Simulated data from known (true) models are used to demonstrate the approximate realization of continuous mapping by neural networks in Section 3. In Section 4 the methods are illustrated using a thyroid disease database in order to show that the overfitting leads poor generalization.Apartment house data of the metropolitan area station with four-class classification are also analyzed in order to assess the bootstrapping by comparing leaving-one-out CV.Finally, in Section 5 we discuss the relative merits and limitations of the methods.

Multiple-Group Neural Discriminant Model
2.1.1.Statistical Inference.The functional representation of the neural network model is considered, as shown in Figure 1.The connection weight between the ith unit in the input layer (i = 0, . . ., I) and the jth unit in the hidden layer ( j = 1, . . ., H) is α i j .Similarly, the weight between the jth unit in the hidden layer ( j = 0, . . ., H) and the kth unit in the output layer (k = 1, . . ., K) is β jk .The input to the jth hidden unit is a linear projection of the input vector x = (x 1 , . . ., x I ), that is, where α 0 j is a bias.This is the same idea as incorporating the constant term in the design matrix of a regression by including a column of 1's [1].The output of the jth hidden unit is where f (•) is a nonlinear activation function.The most commonly used activation function is the logistic (sigmoid) function: The input to the kth output unit is where β 0 j is a bias.The activation function of network outputs for the mutually exclusive groups can be achieved using the softmax activation (normalized exponential) function: which can be regarded as a multiclass generalization of logistic function.
From ( 1)-( 6), o k can be written in the form The output o K to the Kth group can be calculated as From . Thus the number of unit for output layer is 2 (= K − 1).By setting the teach value the log likelihood function for the total sample size D is ln L(θ; x, t) where are the teach and output vectors, respectively, for the dth observation, t = (t 1 , t 2 , . . ., t D ), x = (x 1 , x 2 , . . ., x D ), α = {α i j }, and β = {β jk }.As usual, the negative log likelihood gives the cross-entropy error function.The unknown parameters θ = {α, β} can be estimated by maximizing the log likelihood ((10) with output (7)) by use of batch backpropagation including momentum, in which the training values for unknown parameters are chosen at random.The number of parameters included in the multiple-group neural discriminant model is Input layer logit transformation hidden layer softmax transformation output layer Figure 1: Single hidden layer neural network model.

Determination of the Optimum Number of Hidden
Units.The criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units.In conventional statistics, various criteria have been developed for assessing the generalization performance.AIC provides us with a decision as to which of several competing network architectures are best for a given problem.However, the usage of AIC may not be justified theoretically when considering a neural network as an approximation to an underlying model [7,24].A bootstrap type nonparametric resampling estimator of Kullback-Leibler information by Ishiguro and Sakamoto [25], Konishi and Kitagawa [26], Ishiguro et al. [27], Kullback and Leibler [28], and Shibata [29] and Shao [30] can provide an alternative to AIC computed from a skewed discrete distribution.Let the training samples X = {X 1 , X 2 , . . ., X d ,. . ., X D }, X d = {x d , t d }, and x d = {x d 1 , x d 2 , . . ., x d I } for d = 1, 2, . . ., D be independently distributed in an unknown distribution F. Let F be the empirical distribution function that places a mass equal to 1/D at each point X 1 , X 2 , . . ., X d , . . ., X D .We propose the bootstrap sampling algorithm given as follows.
Step 1. Generate B samples X * , each of size D, drawn with replacement from the training sample Step 2. For each bootstrap sample X * b , b = 1, 2, . . ., B, fit a model to obtain the estimator θ(X * b ).
Step 3. The bootstrap estimator of bias C * is given as where C * is the average of differences between log likelihood on the bootstrap sample ln L{X * b ; θ(X * b )} and that on the training sample ln L{X; θ(X * b )}, given θ(X * b ).Thus, Extended Information Criterion (EIC) proposed by Ishiguro et al. [27] is defined as EIC approach selects the number of hidden units with the minimum value of (12) as Shibata [29] and Shao [30] point out that this method is asymptotically equivalent to leavingone-out CV and AIC.Note that the bootstrap algorithm requires refitting of the model (retraining the network) B times [31].The number of replications B is in the range 20 ≤ B ≤ 200, and so B = 200 bootstrap replications are used.The competing networks share the same architecture with the only exception being the number of hidden units.

Bootstrapping the Deviance.
No standard procedure by which to assess the overall goodness-of-fit of the multiplegroup neural discriminant model has been proposed.By introducing the maximum likelihood principle, the deviance allows us to test the overall goodness-of-fit of the model: where ln L(X; θ) denotes the maximized log likelihood under a current neural discriminant model.Since the log likelihood for the full model ln is zero by using the definition 0 ln 0 = 0, we have Note that the deviance is two times log likelihood Equation (10).The greater the deviance, the poorer the fit of the model.However, the deviance given in ( 14) is not even approximately distributed as χ 2 for the case in which binary (Bernoulli) responses are available [32][33][34][35].We therefore provide the bootstrap estimator of the percentile (i.e., the critical point) for the deviance given in ( 14) according to the following algorithm.
Step 1. Generate B (= 200) bootstrap samples X * drawn with the replacement from the training sample X with the optimum number of hidden units which was determined by the way in Section 2.1.2.
Step 2. For the bootstrap sample X * b , b = 1, 2, . . ., B, the deviance given in ( 14) is computed as This process is independently repeated B times, and the computed values are arranged in ascending order.
Step 3. The value of the jth order statistic Dev * of the B replications can be taken as an estimator of the quantile of order j/(B + 1).
Step 4. The estimator of the 100(1−α)-th percentile (i.e., the 100α% critical point) of Dev * is used to test the goodnessof-fit of the model using a specified significance level α = 1 − j/(B + 1).If the value of the deviance given in ( 14) is greater than the estimate of the percentile, then the model fits poorly.

Excess Error Estimation.
Let error rate e(F; X) be the probability of incorrectly predicting the outcome of a new observation drawn from an unknown distribution F, given a prediction rule on a training sample X.This error rate is defined as the actual error rate, which is of interest in performance assessment of prediction rules.
Let F be the empirical distribution function that places a mass equal to 1/D at each point X 1 , X 2 , . . ., X d , . . ., X D .We apply a prediction rule η to this training sample X and form the realized prediction rule η F (x 0 ) for a new observation X 0 = {x 0 , t 0 }.Let Q(t 0 , η F (x 0 )) indicate the discrepancy between an observed value t 0 and its predicted value η F (x 0 ).Let error rate e( F; X), referred to as the apparent error rate, be the probability of incorrectly predicting the outcome for the sample drawn from the empirical distribution of the training sample, F. Because the training sample is used for both forming and assessing the prediction rule, this proportion (i.e., apparent error rate) underestimates the actual error rate.The difference e( F; X) − e(F; X) is the excess error.The expected excess error (i.e., bias) of a given prediction rule [22,23,36,37] is When the prediction rule by multiple-group neural discriminant model is allowed to be complicated, overfitting becomes a real danger, and excess error estimation becomes important.Thus we will consider the bootstrapping to estimate the expected excess error when fitting a multiplegroup neural discriminant model to the data.The algorithm can be summarized as follows.
Step 1. Generate bootstrap samples X * from F as described in Section 2.1.2.Let F * be the empirical distribution of Step 2. For each bootstrap sample X * , fit a model to obtain the estimator θ(X * ) and construct the realized prediction rule η F * based on X * .
Step 3. The bootstrap estimator of the expected excess error in ( 16) is given by where Step Step 5.The actual error rate with bootstrap bias correction is

Simulation Study
Since the model generally does not encompass unknown functions, but rather only approximations thereof, the model is inherently misspecified.Therefore, we demonstrate results from some Monte Carlo simulations to evaluate the performance.The criterion based on bootstrapping is demonstrated to be favorable when selecting the optimum number of hidden units [38][39][40][41].Vach et al. [41] investigated how regression functions can be approximate specific regression from the class and pointed out that the comparison using members of this is a little bit unfair.We thus show the superiority of neural network model by using the function of the existence of several local extrema.The influence can be illustrated through a simple simulation using a neural network model with two inputs x 1 and x 2 , because we can visualize the contour plot of unknown population.

Two-Class Classification.
The influence can be illustrated through a simple simulation using a neural network model with two inputs, one output and a varying number of hidden units.For two independent continuous covariates x 1 and x 2 , we simulated the following known (true) model: Training and test samples of size 1000 were considered in the present study.Input data (x 1 , x 2 ) are chosen from data that are uniformly distributed over [0, 1] × [0, 1], and the binary response y is labeled with 1 if f (x 1 , x 2 ) > (1/2) and otherwise with 0. Figure 2 shows the distribution of the covariates (x 1 , x 2 ) and the class membership indicator in the training sample.
EIC values with B = 200 replications based on bootstrapping pairs for the training sample are shown in Figure 3 after fitting the neural discriminant models having one to five hidden units.For the purpose of comparison, the values of AIC and BIC are also provided.In the case of the simulation study, the known (true) model given in (22) is included in the population.Thus, the differences between EIC and AIC values are slight.
Using the simulated training sample, the feed-forward neural networks were fit to the known (true) model given in (22).The tendency of mapping performed by neural networks with hidden units h = 1, 2, 3 to implausibly fit the function given in (22) can also be illustrated.
The bootstrap estimate of the 95th percentile Dev * (i.e., the 5% critical point) for the training sample with four hidden units is Dev * = 203.10.Comparison to the deviance given in ( 14) (Dev = 39.40)suggests that the multiple-group neural discriminant model fits the data fairly well because Dev = 39.40 is far from the 5% critical point Dev * = 203.10.
The actual error rate with the bootstrap bias correction given in (20) for the multiple-group neural discriminant models with four hidden units is calculated as e boot = 0.009.Figure 4 illustrates the apparent error rates observed in the training sample-and test sample-based error rates.The apparent error rates for both samples decreased with the increase in the number of hidden units from h = 1 to h = 4 and then remained constant.

ISRN Artificial Intelligence
Number of hidden units Figures 3 and 4 are based on only one simulated data set.However, the efficacy of the bootstrap procedures would be more convincingly illustrated in a simulation study based on multiple samples.Figure 5 shows the average values of EIC, AIC, and BIC based on multiple samples with 100 replications after fitting the neural discriminant models having one to five hidden units.Figure 6 shows the box-andwhisker plots for EIC in order to evaluate the standard errors and other statistics.Figure 7 illustrates the mean apparent error rates observed in multiple test samples with 100 replicates.Figure 8 also shows the box-and-whisker plots for the mean apparent error rates in multiple test samples with 100 replicates.For the purpose of comparison, the estimates of the actual error rates with bootstrap bias correction for the training sample [42] are also shown in Figure 7.
It is concluded that EIC identifies the optimal number of hidden units (i.e., 4) more often than AIC.In addition, the differences between the average values of EIC and AIC are somewhat similar to Figure 3, and the average values of the bootstrap-corrected estimate of the prediction error rate vary around the average apparent error rates for the multiple test samples.
In this paper, training and test samples of size 400 were considered.The apparent error rates for training and test samples of several models are given in Table 1.From Table 1, it is found that the apparent error rates of training and test sample for multiple-group neural discriminant model is the smallest.

Results and Discussion
Prediction accuracy (error rate) is the most important consideration in the development of prediction model.The assessment of goodness-of-fit is a useful exercise.In particular the goodness-of-fit and error rate from the training data are meaningful because of overfitting issue.The main purpose is to predict the future samples accurately.In other words, in real applications, the test sample population may be different from the training samples.A benchmark Table 2 is a list of the first-five observations for the 21 attributes and the group with respect to the training sample.The training sample is used to determine the neural network model structure.Table 3 is a list of the first-five observations for 21 attributes and the group with respect to the test sample.The goal of discrimination is to assign new observations to one of the mutually exclusive groups.The data in Tables 2 and 3 include six continuous and 15 binary attributes.Fisher's discriminant model assumed that the inputs are normal distributed.However, it is worth noting that the posterior class probabilities for neural discriminant model can be given by maximizing log likelihood Equation (10) without the normal distributed assumption for inputs.
A thyroid disease database has been used as a benchmark test for the neural network model shown in Figure 1 with I = 21 and K = 3. EIC values are shown in Figure 9 after fitting the multiple-group neural discriminant models having one to four hidden units.In this case, the true model is not included in the population.For the purpose of comparison, AIC and BIC values are also provided.
Figure 9 indicates that the minimum EIC value is obtained for the model having two hidden units, which has an apparent error rate e( F; X) of 0.0090.Figure 10 shows a histogram of the bootstrap replications R * b that are used to estimate the expected excess error.The values of the mean and standard deviation of R * b are −0.0033 and 0.0022, respectively.The actual error rate with the bootstrap bias correction given in (20) for the multiple-group neural discriminant models with two hidden units is calculated as e boot = 0.012.Alternatively, if the deviance Equation ( 14) asymptotically follows the χ 2 distribution with D − p = 3772 degrees of freedom under the null hypothesis that the model is correct, the probability density function of the χ 2 distribution with 3772 degrees of freedom is shown in Figure 13.However, because of large sample size D = 3772, the distribution is extremely skewed.By comparing Figure 13 with Figure 11, it is found that the distribution of deviance Equation ( 14 not close to those of the χ 2 distribution with 3772 d.f., that is, E[χ 2 ] = d.f.= 3772 and Var[χ 2 ] = 2 × d.f.= 7544.It should be noted that the deviance asymptotically follows χ 2 distribution for grouped binary (i.e., binomial) response and a set of predictor variables, as described in Tsujitani and Aoki [44].The apparent error rates after fitting the multiple-group neural discriminant models having one to four hidden units are shown in Figure 14. Figure 14 indicates that (i) the multilayer feedforward neural network can approximate virtually any function up to some desired level of approximation with the number of hidden units increased ad libitum for the training sample, (ii) the actual error rate for the test sample is the smallest when the number of hidden units is two, and (iii) a neural network with a large number of hidden units has a higher error rate for the test sample, because the noise is modeled in addition to the underlying function.
Although the model fits the training sample as well as possible by increasing the number of hidden units, the model does not generalize very well to the test sample, which is the goal.The apparent error rates for training and test samples of several models are given in Table 4: (i) the multigroup logistic discriminant model with linear effect [6,45] by use  of library{VGAM} in free software R [15], (ii) multiplegroup logistic discrimination models with linear + quadratic effects, (iii) the tree-based model with mincut = 5, minsize = 10, mindev = 0.01 as tuning parameters [46] by use of library{rpart} in R, (iv) the nearest neighbor smoother using a nonparametric method to derive the classification criterion [6,47] by use of library (knn) in R, (v) the kernel smoother [47] using normal distribution and a radius r = 1.1 to specify a kernel density by use of library{ks} in R, (vi) the support vector machine using the "one-againstone" approach [48,49] by use of library{e1071} in R, (vii) the proportional odds model [14], and (viii) VGAM based on the proportional odds model with optimum smoothing parameters selected by leaving-one-out cross-validation [20] by use of library{VGAM} in R. From Table 4, it is found that multiple-group neural discriminant model (h = 2) has the smallest error rate for test sample preserving relatively small error rate for training sample.In order to overcome the stringent assumption of the additive and purely linear effects of the covariates, multiple-group logistic discrimination models with linear and quadratic effects were included.The improvement obtained by the inclusion of the quadratic effect is slight.It should be noted that the apparent error rates for training of VGAM are the smallest, but that for test samples are large.This overfitting leads poor generalization.For example, the estimated smooth function of the covariate "age" for VGAM in Figure 15 shows the overfitting.
Table 5 is apartment house data for assessment of land value by the metropolitan area stations, of the metropolitan area stations with four-class classification [50].By using the four covariates (average price of house built for sale, average house rent, yield, assessment of station value by the number of passengers getting on and off), and assessment of land value by the metropolitan area stations may be grouped into four categories: (i) the most comfortable, (ii) very comfortable, (iii) s little comfortable, (iv) not comfortable.
Figure 16 indicates the values of EIC, AIC, and leavingone-out CV (See the Appendix).The leaving-one-out CV is also included in order to assess the bootstrapping.The minimum EIC and leaving-one-out CV values are obtained for the model having two hidden units.However, the number of hidden unit with the minimum AIC value is three.The actual error rates in the case using EIC and leaving-one-out CV with two hidden units are 0.276 and 0.273, respectively.The bootstrapping is assessed from the point of leaving-oneout CV.The apparent error rates for training samples of several models are given in Table 6.From Table 6, it is found that multiple-group neural discriminant model (h = 2) has the smallest error rate.

Conclusions
We discussed the learning algorithm by maximizing the log likelihood function.Statistical inference based on the likelihood approach for the multiple-group neural discriminant model was discussed, and a method for estimating bias on the expected log likelihood in order to determine the optimum number of hidden units was suggested.The key idea behind bootstrapping is to focus on the optimum tradeoff between the unbiased approximation of the underlying model and the loss in accuracy caused by increasing the number of hidden units.In the context of applying bootstrap methods to a multiple-group neural discriminant model, this paper considered three methods and performed experiments using two data sets to evaluate the methods.The three methods are bootstrap pairs sampling algorithm, goodness-of-fit statistical test, and excess error estimation algorithm.
There are two broad limitations to our approach.First, the use of batch backpropagation algorithm including momentum prevents an maximum likelihood estimates from getting trapped in a local minimum, not global minimum.So far, our discussion of neural networks has focused on the maximum likelihood to determine the network parameters (weights and biases).However, a Bayesian neural network approach [51] might provide a more formal framework in which to incorporate a prior parameter distribution.Second, our neural network models assumed the independence of the predictor variables x = (x 1 , . . ., x I ).More generally, it may be preferable to visualize interactions between predictor variables.The smoothing spline ANOVA models can provide an excellent means for data of mutually exclusive groups and a set of predictor variables [43,52].We expect that flexible  methods for discriminant model using machine learning theory [47,[53][54][55] such as penalized smoothing splines and support vector machine [17][18][19] will be very useful in these real-world contexts.

Appendix Leaving-One-Out CV
An alternative model selection strategy for the bias correction Equation ( 14) of the log likelihood is leaving-one-out CV for a multiple-group neural discriminant model, which is asymptotically equivalent to TIC [29].Let the training sample X = {X 1 , X 2 , . . ., X d , . . ., X D } be independently distributed in an unknown distribution.We then obtain the leaving-one-out CV algorithm.
Step 1. Generate the training samples X The leaving-one-out CV criterion finds an appropriate degree of complexity by comparing the predictive probability for different model specifications.Anders and Korn [24] have shown that the CV criterion does not rely on any probabilistic assumption based on the properties of maximum likelihood estimators for misspecified models and is not affected by identification problems.

4 .
Repeat Step 1-Step 3 for bootstrap samples X * b , b = 1, 2, . . ., B(= 200) to get R * b .The bootstrap estimator of the expected excess error can be obtained as b

Figure 2 :
Figure 2: Contour plots: (a) darker grey scale levels represent lower probabilities of y = 0 and (b) • and • show the class membership indicators y = 0 and y = 1, respectively, for the covariates (x 1 , x 2 ).

Figure 3 :
Figure 3: EIC, AIC, and BIC values for the simulation using only one training sample (note that the series of EIC and AIC are indistinguishable).

Figure 4 :Figure 5 :
Figure 4: Apparent error rates for simulated data after fitting neural networks with one to five hidden units.

Figure 8 :
Figure 8: Box-and-whisker plots for the mean apparent error rates in multiple test samples with 100 replicates.

Table 3 :Figure 9 :
Figure 9: EIC, AIC and BIC values for the training sample of a thyroid disease database.

Figure 15 :
Figure 15: the estimated smooth function of the covariate "age" for VGAM.

Figure 16 :
Figure 16: EIC, AIC and (leaving-one-out) CV values for the training sample of apartment house data.

Step 2 .Step 3 .
[d] = {X 1 , X 2 , . . ., X d−1 , X d+1 ,. ..,XD }, d = 1, 2, . . ., D. The subscript [d] of a quantity indicates the deletion of the dth data point X d from the training sample X.Using each training sample, fit a model.Then, estimate unknown parameters denoted by θ(X [d] ) and predict the output o d k[d] for the deleted sample point X [d] .The average predictive log likelihood of the deleted sample of convention, the cross-validation criterion is often stated as that of minimizing CV = −2 ln

Table 2 :
List of the first-five observations for 21 attributes and the group with respect to the training sample.

Table 4 :
Comparison of various discriminant methods for a thyroid disease database.

Table 5 :
Apartment house data for assessment of land value by the metropolitan area stations.

Table 6 :
Comparison of various discriminant methods for apartment house data.