Adaptive Linear and Normalized Combination of Radial Basis Function Networks for Function Approximation and Regression

This paper presents a novel adaptive linear and normalized combination (ALNC) method that can be used to combine the component radial basis function networks (RBFNs) to implement better function approximation and regression tasks. The optimization of the fusion weights is obtained by solving a constrained quadratic programming problem. According to the instantaneous errors generated by the component RBFNs, the ALNC is able to perform the selective ensemble of multiple leaners by adaptively adjusting the fusion weights from one instance to another. The results of the experiments on eight synthetic function approximation and six benchmark regression data sets show that theALNCmethod can effectively help the ensemble systemachieve a higher accuracy (measured in terms of mean-squared error) and the better fidelity (characterized by normalized correlation coefficient) of approximation, in relation to the popular simple average, weighted average, and the Bagging methods.


Introduction
Function approximation has been used in a variety of disciplines such as data mining, system identification, and forecasting [1].Given a finite data set, the essential task of a function approximation problem is to interpret the appropriate relationship between multidimensional explanatory variables and the corresponding responses.Function approximation problems can be categorized into two major types [2].First, for known target functions, the approximation theory investigates how to estimate the parameters of certain functions or how to closely match a target function via a particular class of rational functions with some desirable properties (inexpensive computation, continuity, integral or differential, limit values, etc.) [3].Second, if the specific expression of the target function is unknown, instead, only a series of observations in the form of input-response pairs are available.To perform such an approximation, several numerical analysis techniques, for example, interpolation, extrapolation, regression analysis, and curve fitting, could be considered.
For two decades, artificial neural networks with the inner neurons activated by radial basis functions have been extensively applied in numerous practical applications [4][5][6].The radial basis function network (RBFN) works by performing a nonlinear transformation from the inputs to a high-dimensional hidden space and produces the response through a linear output layer [7].It has been justified that any continuous function on a compact interval can be interpolated toward an arbitrary accuracy by a welldevised RBFN with a sufficiently large number of hidden neurons [8,9].However, there still lacks a rigorous theoretical framework that specifies the routine to determine an optimal RBFN structure with the stated approximation properties.Furthermore, when dealing with high-dimensional training data, the RBFN approximation is subject to the risks of overfitting and "curse of dimensionality" [10].
Recently, ensemble methods have been recommended by the machine learning community [11][12][13][14][15][16].Based on the principle of "divide and conquer" [10], an ensemble system combines a finite number of component neural networks (CNNs) to provide a consensus decision, with the aim to achieve some favorable performance (lower generalization error or higher accuracy superior to any single learning machine acting solely) [17].The schemes of ensemble systems commonly follow the generative and nongenerative styles 2 Mathematical Problems in Engineering [18].The generative ensembles employ the resampling or filtering techniques to boost the training data with different underlying distributions.On the other hand, the nongenerative ensembles combine the CNNs trained from the same data set by using appropriate decision fusion strategies [15,19] or combination rules [12,14,20].
The pioneering generative ensemble algorithms are Boosting [21] and Bagging (the acronym for "bootstrap aggregating") [22].Boosting family algorithms repeatedly train a particular weak-learning machine with different distributed training data sets and then combine the local decisions.Freund and Schapire [21] proposed the AdaBoost algorithm in order to find a typical mapping function or hypothesis with a low error rate in relation to a given probability distribution of the training data.In spite of the effectiveness, the Boosting algorithms still have some drawbacks to implement regression tasks.First, the regression data have to be divided into many classification sets such that the number of classification instances becomes intensively larger in the boosted iterations.Second, the cost function has to be modified from one iteration to another, in order to adapt the boosted data sets.Therefore, the Boosting algorithms are very sensitive to noisy data and outliers [23].The Bagging algorithm, on the other hand, introduces the bootstrap resampling procedure [24] into the neural network aggregation, with the purpose to increase the diversity [22].By averaging the CNNs, the variance of Bagging ensemble would become much smaller than any of the CNNs.However, according to remarks of Breiman's work [22], Bagging stable component learners in an ensemble system can only slightly improve accuracy but would lead to greater computation expense.
From the last decade, nongenerative ensemble methods have also received broad attentions [20,[25][26][27][28][29][30][31][32].Linear combinations are most frequently used in real world applications, in virtue of the simplicity and modest computation expense.Opitz and Maclin [33] suggested using the simple average (SA) combination rule for regression and the majority vote (MV) rule for classification.Ueda [20] used the optimal linear weights to combine neural network classifiers to improve classification performance.Fumera and Roli [25] provided the theoretical analysis of linear combinations and proposed a weighted average (WA) rule for multiple classifier systems.However, the effectiveness of aforementioned linear combination methods may be more or less affected by their inherent flaws in design [32].For example, the SA can only work well when the component learners are with similar error rates, because it treats all the component learners equally.The linear weights derived by the WA method are based on the assumption that the component learners produce independent and identically distributed errors [34].The theoretical superiority of the WA method is not yet guaranteed in practice, because the weights estimation may become rapidly skewed with small-size or noisy data sets [25].Tresp and Taniguchi [35] suggested using the inverse of the variance depending on the input variables to generate the nonconstant weights in the linear combination.The variancebased weighting method is able to dynamically adjust the fusion weights with respect to different distributions of input variables.However, Ueda [20] pointed out that such a method has some serious pitfalls in the minimization of classification errors.On the other hand, the effectiveness of linear combinations may also be affected by some CNNs with poor performance.Zhou et al. [36] suggested that it may be better to select some CNNs with reliable performance over the training data to constitute an ensemble system.It is true that the CNNs with poor performance may mislead the ensemble system toward a higher accuracy, but it does not imply that these CNNs are useless at all.For example, a CNN may provide an excellent generalization for some parts of the data set but failed for the other parts.Such a phenomenon usually occurs due to overfitting; that is, the neural network is overtrained with poor generalization.Now there arises a question: can we retain a CNN in the ensemble system only when it produces good approximation for some parts of a function domain and discard it if it fails to attain the desired generalization accuracy?In the present study, we propose an adaptive linear and normalized combination (ALNC) method that can effectively combine the CNNs with the dynamic and normalized fusion weights, which can be obtained by solving a constrained quadratic programming problem.

Adaptive Linear and Normalized Combination (ALNC)
Suppose that an ALNC ensemble system contains  component RBFN approximators in total (see Figure 1).The th component RBFN produces the approximation output,   (x  ),  = 1, . . ., , in response to the th input instance vector x  ,  = 1, . . ., .The ALNC system provides the ensemble approximation,  ALNC (x  ), by linearly combining the component RBFNs with the adaptive (instance-varying) normalized weights   (x  ), which can be formulated as The fusion weights are subject to the nonnegativity and normalization constraints [32,[37][38][39][40], which can be written as Similar to most of the linear combination methods reported in the literature, the aim of the ANLC ensemble system is to provide the optimal solution of the nonnegative and normalized weights.The difference between the approximation of the th component RBFN,   (x  ), and the target function response, (x  ), is measured with the squared error as Then, the squared error of the ALNC ensemble approximation,  ALNC (x  ), is estimated in a similar way.Since the nonnegative weights are normalized in the ALNC ensemble, the target function response (x  ) can be split and combined with the weighting multipliers,   (x  ); that is, (x  ) = ∑  =1   (x  )(x  ).Thus, the approximation error term of the ALNC ensemble system,  ALNC (x  ), is derived as follows: By virtue of ( 2) and ( 4), the minimization of the ALNC ensemble error on the th input data is equivalent to the constrained quadratic programming (CQP) problem specified as follows: We may use the Lagrange multiplier method [41] to solve this CQP problem by defining the cost function as where the nonnegative coefficient (x  ) denotes the Lagrange multiplier, the value of which varies from one input data vector to another.According to the weak Lagrangian principle [41], the optimum solution, {w * (x  ),  * (x  )}, is the stationary point of the cost function given in (6) and satisfies the following unique equations [40]: Then, the optimal solution of the ALNC weights,  *  (x  ), can be derived by solving the CQP problem as Note that the optimal fusion weights are determined by the errors of component RBFNs, when the RBFN parameters are specified and the target function response is given.Concerning the error term of the ALNC ensemble system, substituting (8) into (4) yields It is clear that  2  (x  ) and  2 ALNC (f  ) are both nonnegative; that is,  2  (x  ) ≥ 0 and To compare the ALNC ensemble error with a component RBFN error, we may compute the division operator as According to (10), it can be inferred that the ALNC ensemble system, with the optimal fusion weights, is more likely to outperform any of its component RBFNs.

Experiments
3.1.Data Description.The data sets tested in our experiments are twofold.The first eight data sets are synthetic functions without noise, among which the Zigzag, Rhythm, SinCos, and ExpSin are two-dimensional functions and the 3-D Mexican Hat, Gabor, SwingCos, and Exponential are multivariate functions with two independent variables.The details of particular function expressions, domains, and the size of samples, with regard to these eight synthetic sets, are specified in Table 1, in which U[, ] indicates a uniform distribution over the interval from  to .
The other six benchmark multivariate regression sets listed in Table 2 were obtained from the University of California at Irvine (UCI) machine learning repository [42] and Carnegie Mellon University (CMU) StatLib library (available online at http://lib.stat.cmu.edu/datasets/),respectively.
Abalone.The task is to predict the age of abalone from seven physical measurements (the nominal attribute "sex" in the original UCI data set was not included in the set of input attributes in our experiments).
Housing.The Housing data set, which contains 2 integers and 11 continuous attributes, describes the housing values in suburbs of Boston.
Auto-MPG.This data set concerns city-cycle fuel consumption in miles per gallon.From the original data set, the string attribute "car name" (unique for each instance) and six instances with missing values were removed in our experiments.
Stock.The data set describes daily stock prices of 10 aerospace companies from January 1988 through October 1991.The task is to predict the stock price of the first company from the other nine.
Bolt.This is a relatively small data set retrieved from a trial on the effects of machine adjustments on the time to count bolts (a type of automotive accessory).The task is to predict Table 1: Description of the synthetic data sets for function approximation.

Function name Target function expression Distribution of independent variables
Size of samples

Settings of Component Radial Basis Function Networks.
The details of the component RBFNs involved in the ALNC ensemble system are presented as follows.The number of the sensory neurons in the input layer is equal to the dimensions of independent variables.The radial basis function kernel function is defined as where c  denotes the center vector for the th hidden neuron and  is the spread parameter that determines the width of the area in the input space to which each hidden neuron responds.The output layer is linear, and the responses of the component RBFNs are sent to the succeeding linear combination.We employed a total of 30 component RBFNs to approximate the synthetic functions.The first 10 component RBFNs had the same spread parameter of 1.0, and the number of hidden neurons increased from 1 to 10, for each neural network.Regarding the second 10 component RBFNs (with the spread of 2.0) and the remaining 10 component RBNFs (with the spread of 3.0), their hidden neuron numbers were incremental from 11 to 20 and from 21 to 30, respectively.In the regression experiments, the ensemble system combined three component RBNFs.The number of hidden neurons was equal to the dimensions of the input independent variables, and the spread parameter varied from 1.0 to 3.0 for each network.Each component RBFN was trained with the orthogonal least-squares algorithm [43], which offers a systematic method for center selection and it significantly reduces the complexity of the RBFN.

Other Experiment Settings.
For the purpose of approximation performance comparison, we also implemented the popular SA and WA combination rules on the synthetic data sets and the Bagging algorithm on the benchmark regression data sets.In the regression experiments, the ALNC method used the same bootstrap resampling procedure as the Bagging for fair comparison purpose.The ALNC ensemble with such a data preprocessing procedure is presented as Bootstrap-ALNC hereafter.
The computer programs were performed on a laptop with a CPU processor of 1.86 GHz speed and 1.5 GB RAM memory.Each experiment was repeatedly carried out for 50 times to provide the results in statistical sense.In order to compare the computational efficiency, we also recorded the computation time consumption (CTC) in milliseconds (ms) of each fusion method.

Quantitative Performance Evaluation Criteria. The approximation accuracy in our experiments was measured in terms of mean-squared error (MSE
where (x  ) denotes the approximation of the component RBFNs, the Bagging, or Bootstrap-ALNC.The similarity between the approximator output and the target function, referred to as the approximation fidelity, was computed with the normalized correlation coefficient (NCC) as

Results
Figure 2 illustrates the approximation curves provided by the SA, WA, and ALNC methods, with regard to the synthetic two-dimensional functions.Compared with the output curves of the SA, the curves produced by the WA and ALNC methods are closer to the target functions in the input range from 1.8 to 2.5 for the Zigzag function, from 7 to 13 for the Rhythm function, from 1.8 to 3.2 for the SinCos function, and from 1.5 to 2 for the ExpSin function, respectively.Three ensemble methods provide different 3-D surface approximation results, with respect to the functions of 3-D Mexican Hat (see Figure 3), Gabor (see Figure 4), SwingCos (see Figure 5), and Exponential (see Figure 6), respectively.
Concerning the 3-D Mexican Hat, the peak area approximated by the SA method, as depicted in Figure 3(b), is slightly skewed versus the target function in Figure 3(a), whereas such skewness does not appear in the results of the WA and ALNC methods.For the Gabor, the output surface shape of the SA fusion is severely distorted, as shown in Figure 4(b).The surfaces produced by the WA fusion, plotted in Figure 4(c), and the ALNC method, plotted in Figure 4(d), are much better than those of the SA method.In addition, the WA fusion surface is even smoother than the ALNC output surface.From Figures 5(b) and 5(c), we may observe that the output surface of the SA or WA fusion fails to match the SwingCos function such that several local crests and troughs in the approximated surface central region are missing.On the contrary, the ALNC method is able to predict the locations of these local crests and troughs, although its numerical estimate is not very precise.According to Figure 6, the surface regions around the crest and trough predicted by the SA and WA fusion methods are fluctuating, whereas the ALNC provides an approximation surface relatively closer to the Exponential target function in Figure 6(d).
The quantitative results on the synthetic data sets listed in Table 3 indicate that the MSE and NCC values obtained with all three ensemble methods are remarkably better than those of the component RBFNs on average, especially for the Zigzag, SinCos, ExpSin, and Gabor data sets.Even when approximating the SinCos function, the WA and ALNC both perfectly achieve almost zero error and 100% output fidelity.In addition, the ALNC consistently outperforms the SA fusion and is also superior to the WA method in most experiments (see the best performance results highlighted in Table 3).
Considering the benchmark regression data sets, Figure 7 and Table 4 show that the MSEs produced by the Bootstrap-ALNC in all six experiments are consistently lower than those of the prevailing Bagging or a single RBFN on average, especially the Bootstrap-ALNC reductions versus the Bagging the MSE values of 8.59% (2.8209/32.8325),6.7% (28.5683/426.5834),and 17.54% (3.8174/21.7593)on the Stock, Bolt, and CPS-85-Wages data sets, respectively.The NCC improvements of the Bootstrap-ALNC over the Bagging are also noticeable in Figure 8 and Table 4.Such results indicate that the proposed ALNC method, along with the bootstrap resampling procedure, is competent to solve practical regression problems.
The CTC parameter indicates how much a fusion method would occupy the CPU computing resources, which involves the total elapsed time of the training of component RBFNs and the optimization of fusion weights.Thus, we only list the CTC values of the fusion methods in Tables 3 and 4. By comparing the CTC results on different data sets, we may find that the ALNC method consumed the least CPU execution time for almost all the function approximation and regression data sets, whereas the WA method occupied the most CPU resources.The WA fusion method would require some additional CPU execution time because it has to estimate the error distributions of the component RBFNs.The ALNC method can directly compute the fusion weights with the instantaneous errors of the component RBFNs to improve the efficiency.From Table 3, the CTC values of the SA method on the ExpSin and SwingCos data sets are smaller than those of the ALNC method.But it is worth noting that the ALNC fusion produces much better ExpSin approximation curve (see Figure 2) and SwingCos surface (see Figure 5) than the SA method.

Discussion
The performance evaluation results measured by the MSE and NCC parameters have demonstrated the effectiveness of the proposed ALNC method for function approximation and regression.It is worth noting that the MSE values could be influenced by the dependent variable scales; for example, the MSE results on the Bolt data set (Bagging: 426.5834;Bootstrap-ALNC: 398.0151) are much larger than those on the Abalone data set (Bagging: 4.6753; Bootstrap-ALNC: 4.4646); however, the performance improvement trends (the ALNC versus the SA or WA, the Bootstrap-ALNC versus the Bagging) are evident.Such amelioration effects of function approximation and regression are reflected qualitatively in the geometric patterns of the two-dimensional function curves in Figure 2 and three-dimensional surface plots in Figures 3-6.On the other hand, although some criticisms on the MSE criterion exist in the literature [44], this metric still has several excellent properties such as simplicity, valid Euclidean distance measure, and energy of the data errors, which makes the MSE widely used as a favorable metric in optimization, statistics, and data analysis.
From Table 3, it can be observed that the Bagging ensemble can only slightly improve the performance of the component RBFNs.The robustness of the RBFN is considered as the primary cause of such a phenomenon.According to the remarks of Breiman's work [22] "Bagging stable learners is not a good idea, " because the robustness of the stable learners leads to more computational complexity but with little performance amelioration rewards.
The simple average (SA) and weighted average (WA) fusion strategies combine the component learners with the fixed or predefined fusion weights.In the meantime, they neglect that in some situations individual learners may be able to produce good approximations for some particular    portions of the data set but incompetent for the rest of the data.The ALNC method is an instance-varying technique that concentrates its ensemble capability more on the local data points.This method can adaptively adjust the fusion weights, which explores the highest potential of the component learners toward precise approximations from one input instance to another.Thus, the function approximation or regression task over the entire data set can be split into several subtasks, in which the fusion strategy can seek for the most competent local learners.Nevertheless, despite its effectiveness, the ALNC method has some limitations.
Because the ALNC method is still a supervised learning technique, the ALNC method is only limited to be suited for the function approximation and regression applications, rather than solving prediction or forecast problems, in which the desired references are not available to optimize the fusion parameters.

Conclusion
The adaptive linear and normalized combination (ALNC) is able to adaptively combine the component learners with the optimized fusion weights by solving the constrained quadratic programming problem.Depending on the performance of component learners on a specific input instance, the ALNC can exclusively select the best component learner or discard the worse one in the ensemble so as to provide the best approximation result.The property of instancevarying fusion weights allows the ALNC method to focus more on local approximation, although it sometimes leads to a slightly wrinkled approximated surface output.In general, the experimental results of low error rate and high fidelity percentage on the synthetic and benchmark data sets demonstrated the effectiveness and prominent advantages of the proposed ALNC method.Furthermore, the ALNC method is also promising for practical applications in the fields of pattern recognition, although it is known that the mean-squared error is not a very suitable performance measure for classification problems [45].The future work could be directed toward an extended ensemble learning algorithm in accordance with other performance evaluation criteria, for the design of multiple classifier systems.

Figure 1 :
Figure 1: Illustration of adaptive linear and normalized combination (ALNC) of radial basis function networks for function approximation.

Figure 8 :
Figure 8: Bar graphics of the regression normalized correlation coefficients (in percentage) produced by the component radial basis function networks (RBFNs), the Bagging, and the Bootstrap-ALNC on the data set: (a) Abalone, (b) Housing, (c) Auto-MPG, (d) Stock, (e) Bolt, and (f) CPS-85-Wages.

Table 2 :
Description of the regression data sets.
CPS-85-Wages.The CPS-85-Wages data were obtained from the Current Population Survey (CPS) of 534 people.Such a survey provides the information on wages and other aspects of the workers, including years of education, region of residence, gender, years of work experience, union membership, age, race, occupational status, sector, and marital status.

Table 3 :
Results of function approximation experiments in terms of mean-squared error (MSE), normalized correlation coefficient (NCC), and computation time consumption (CTC in ms).

Table 4 :
Regression results in terms of mean-squared error (MSE), normalized correlation coefficient (NCC), and computation time consumption (CTC in ms).