1. Introduction

MPE

Mathematical Problems in Engineering

1563-5147 1024-123X

Hindawi Publishing Corporation

346949

10.1155/2013/346949

346949

Research Article

Fast Linear Adaptive Skipping Training Algorithm for Training Artificial Neural Network

0000-0002-9319-8874

Manjula Devi

Kuppuswami

Suganthe

R. C.

Ker-Wei

Department of Computer Science and Engineering

Kongu Engineering College

Erode 638 052

India

kongu.ac.in

2013

25 6 2013

2013 12 04 2013 27 05 2013

2013

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Artificial neural network has been extensively consumed training model for solving pattern recognition tasks. However, training a very huge training data set using complex neural network necessitates excessively high training time. In this correspondence, a new fast Linear Adaptive Skipping Training (LAST) algorithm for training artificial neural network (ANN) is instituted. The core essence of this paper is to ameliorate the training speed of ANN by exhibiting only the input samples that do not categorize perfectly in the previous epoch which dynamically reducing the number of input samples exhibited to the network at every single epoch without affecting the network’s accuracy. Thus decreasing the size of the training set can reduce the training time, thereby ameliorating the training speed. This LAST algorithm also determines how many epochs the particular input sample has to skip depending upon the successful classification of that input sample. This LAST algorithm can be incorporated into any supervised training algorithms. Experimental result shows that the training speed attained by LAST algorithm is preferably higher than that of other conventional training algorithms.

1. Introduction

An artificial neural network (ANN) is a nonlinear knowledge processing model that have been successfully utilized training models for solving supervised pattern recognition tasks [1, 2] due to its ability to generalize the real-world problems. The two phases of ANN operation are training (or learning) phase and testing phase. Among these two phases, the training phase is extremely time-consuming phase of an ANN. Training speed is the one of the most significant things to be examined in training a neural network as the training phase generally consumes excessively high training time. The training speed of the neural network is determined by the rate at which the ANN trains. The factors that influence the training rate of ANN are size of the neural network, size of training dataset, initial weight value, problem category, and training algorithm. Among these factors, the size of training dataset is examined and the way in which it affects the training rate is discussed. In order to generalize the neural network well, both underfitting and overfitting of the training dataset should be avoided. Ample amount of training dataset is required to elude overfitting. However, large amount of training data normally requires very long training time [3] which affects the training rate.

Speeding the ANN training is still a focus of research attention in neural network to improve network for faster processing. Many research works have been explored on different amendments by estimating optimal initial weight [4, 5], adaptive learning rate and momentum [6], and using second order algorithm [7–9] in favor of ameliorating the training speed and maintaining generalization. First, proper initialization of neural network initial weights reduces the iteration number in the training process thereby increasing the training speed. Many weight initialization methods have been proposed for initialization of neural network weights. Nguyen and Widrow allocate the nodes’ initial weight within the specified range which results in the reduction of the epoch number [4]. Varnava and Meade Jr formulated a new initialization method by approximating the networks parameter using polynomial basis function [5]. Second, the learning rate is used to control the step size for reconciling the network weights. The constant learning rate secures the convergence but considerably slows down the training process. Hence, several methods based on heuristic factor have been proposed for changing the training rate dynamically. Behera et al. applied convergence theorem based on Lyapunov stability theory for attaining the adaptive learning rate [6]. Last, second order training algorithms employ the second order partial derivatives information of the error function to perform network pruning. This algorithm is very apt for training the neural network that converges quickly. The most popular second order methods such as conjugate gradient (CG) methods, quasi-Newton (secant) methods, or Levenberg-Marquardt (LM) method are considered popular choices for training neural network. Nevertheless, it is not certain that these methods are very computationally expensive and require large memory particularly for large networks. Ampazis and Perantonis presented Levenberg-Marquardt with adaptive momentum (LMAM) and optimized Levenberg-Marquardt with adaptive momentum (OLMAM) second order algorithm that integrates the advantages of the LM and CG methods [7]. Wilamowski and Yu incorporated some modification in LM methods by rejecting Jacobian matrix storage and also replacing Jacobian matrix multiplication with the vector multiplication [8, 9] which results in the reduction of memory cost for training very huge training dataset.

All of the previously mentioned efforts are focused on speeding the training process by reducing the total number of epochs or by converging quickly. But each and every technique employs all the input samples in the training dataset to the network for classification at each and every single epoch. If a large amount of training data with high dimension is rendered for classification, then the mentioned technique introduces a problem by slowing down classification. There is a real fact that the correctly classified input samples are not involved in the weight updation since the error rate is calculated based on the misclassification rate. So, the intention of this research is to impart a simple and new algorithm for training the ANN in a fast manner. The core idea of LAST is when an input pattern is categorized perfectly by the network, and that particular pattern will not be presented again for the subsequent n epochs (epoch is one thorough cycle of populating the network with the entire training samples once). Only the patterns that do not categorize perfectly will be presented again for the next epoch which has reduced the total number of trained input samples employed by the network. Thereby, reducing the total amount of training dataset has reduced the training time which improved the training rate.

The gist of this research paper is systematized as follows. Section 2 describes how the new LAST algorithm is incorporated in standard BPN algorithm. Section 3 exhibits the experimental results. Finally, in Section 4, the overall conclusions of this research are drawn.

2. The Modified BPN with LAST Algorithm 2.1. LAST Neural Network Architecture

The LAST algorithm that is incorporated in the prototypical multilayer feedforward neural network architecture is sketched in Figure 1.

Figure 1

LAST incorporated in neural network architecture.

Such network is furnished with n input nodes, p hidden nodes, and m output nodes which are normally aligned in layers. Let P symbolize the number of training patterns. The input presented to the network is given in the form of matrix X with p rows and n columns. The number of network’s input nodes is equivalent to the P, column value of the input matrix, X. Each row in the Matrix X, is considered to be a real-valued vector x∈𝔑n+1 which is symbolized by {x0,x1,x2,…,xn} where x0 is a bias signal. The summed real-valued vector z∈𝔑p+1 generated from the hidden layer is symbolized by {z0,z1,z2,…,zp} with z0 being the bias signal. The estimated output real-valued vector y∈𝔑m by the output layer is symbolized by {y1,y2,…,ym} and the corresponding target vector t∈𝔑m is symbolized by {t1,t2,…,tm}. Let (it) signify the (it)th iteration number.

The network parameter symbols employed in this algorithm are addressed here. Let fN(x) and fL(x) be the nonlinear logistic activation function and linear activation function of the hidden and output layer, respectively.

Since the network is fully interconnected, each layer nodes is integrated with all the node in the next layer. Let vij be the n×p matrix carries input-to-hidden weight coefficient for the link from the input node i to the hidden node j and voj the bias weight to the hidden node j. Let wjk be the p×m matrix hidden-to-output weight coefficient for the link from the hidden node j to the output node k and wok the bias weight to the output node k.

2.2. LAST Essence

In BPN algorithm, each output unit correlates its computed activation yk with its target value tk to assess the associated error for that pattern with that unit. The factor δ is computed for the patterns that do not exhibit the correct value of yk. The weights and biases are renovated according to the δ factor.

The core idea of LAST is when an input pattern is categorized perfectly by the network, and that particular pattern will not be presented again for the subsequent n epochs (epoch is one thorough cycle of populating the network with the entire training samples once). Only the patterns that do not categorize perfectly will be presented again for the next epoch which will decrease the training time largely.

2.3. Working Principle of LAST

The working principle of the LAST algorithm that is incorporated in the BPN algorithm is documented as follows.

Weight Initialization

Step 1.

Determine the magnitude of the connection initial weights (and biases) to the disseminated values within the precised range and also the learning rate, α.

Step 2.

While the iteration terminating criterion is attained, accomplish Steps 3 through 17.

Step 3.

Iterate through the Steps 4 to 15 for each input training vector to be classified whose probability value, prob, is 1.

Furnish the Training Pattern

Step 4.

Trigger the network by rendering the training input matrix to the input nodes in the network input layer.

Feedforward Propagation

Step 5.

Disseminate the input training vector from the input layer towards the subsequent layers.

Step 6.

Hidden Layer Activation net value is as follows.(a)

Each hidden node (zj, j=1,2,…,p) input is aggregated by multiplying input values with the corresponding synaptic weights (1)zinj(it)=voj(it)+∑i=1nxi(it)·vij(it).

(b)

Apply nonlinear logistic sigmoid activation function to estimate the actual output for each hidden node j, 1≤j≤p. Consider (2)zj(it)=11+e-zinj.

Attaining the differential for the aforementioned activation function, (3)∂(zj(it))∂x =zj(it)×(1-zj(it)).

Step 7.

Output Layer Activation net value is next.(a)

Each output node (yk, k=1,2,…,m) input is aggregated by multiplying input values with the corresponding synaptic weights (4)yink(it)=wok(it)+∑j=1pzj(it)·wjk(it).

(b)

Apply non-linear logistic sigmoid activation function to estimate the actual output for each output node k, 1≤k≤m. Consider (5)yk(it)=11+e-yink.

Attaining the differential for the aforementioned activation function, (6)∂(yk(it))∂x=yk(it)×(1-yk(it)).

Accumulate the Gradient Components (Back Propagation)

Step 8.

For each output unit k, 1≤k≤m, the error gradient calculation for the output layer is formulated as (7)δk(it)=yk(it)·[1-yk(it)]·[tk-yk(it)].

Step 9.

For each hidden unit j, 1≤j≤p, the calculation of error gradient for the hidden layer is formulated as (8)δj(it)=[∑k=1mδj(it)·wjk(it)]zj(it)·[1-zj(it)].

Weight Amendment Using Delta-Learning Rule

Step 10.

For each output unit, Consider the following.

The weight amendment is yielded by the following updating rule: (9)Wjk(it+1)=Wjk(it)+ΔWjk(it), where (10)ΔWjk(it)=α(it)·δk(it)·zj(it).

The bias amendment is yielded by the following updating rule: (11)Wok(it+1)=Wok(it)+ΔWok(it), where (12)ΔWok(it)=α(it)·δk(it).

Step 11.

For each hidden unit, Consider the following.

The weight amendment is yielded by the following updating rule (13)Vij(it+1)=Vij(it)+ΔVij(it), where (14)ΔVij(it)=α(it)δj(it)xi(it).

The bias amendment is yielded by the following updating rule: (15)Voj(it+1)=Voj(it)+ΔVoj(it), where (16)ΔVoj(it)=α(it)δj(it).

LAST Algorithm Steps

Step 12.

Measure the dissimilarity between the target and true value of each input sample (xi, i=1,2,…,n) which imitates the utter error value (17)|tk-yk(it)|, k=1,2,…,m.

Step 13.

Accomplish collation between the utter error value, |tk-yk|, and error threshold, dmax, |tk-yk(it)|<dmax. If so, go to Step 14. Otherwise, go to Step 16.

Step 14.

Compute the possibility value for presenting the input sample in the next iteration: (18)prob(xi)={0,if xi is classified correctly andnumber of epochs is less than n,1,otherwise.

Step 15.

Calculate n, number of epochs to be skipped. (i)

Initialize the value of n to zero for a particular sample xi.

(ii)

If xi is classified correctly, then increment the value of n by c, where c→ Linear Skipping Factor.

Step 16.

Construct the new probability-based training dataset to be presented in the next epoch.

Step 17.

Inspect for the halting condition such as applicable mean square error (MSE), elapsed epochs, and desired accuracy.

3. Experimental Results

The proposed LAST algorithm has been analyzed for the categorization problem concomitant with two-class and multiclass. The real-world workbench data sets wielded for training and testing the algorithms are Iris data set, Waveform data set, Heart data set and Breast Cancer data set which are possessed from the UCI (University of California at Irvine) Machine Learning Repository [10]. The concrete quantity of the data sets used in the research is provided in Table 1.

Table 1

Concrete quantity of the data sets used in the research.

Datasets	No. of attributes	No. of classes	No. of instances
Iris	4	3	150
Waveform	21	3	5000
Heart	13	2	270
Breast cancer	31	2	569

The magnitude of the initial weights is materialized with uniform random values in the range −0.5 to +0.5 using the Nguyen-Widrow (NW) initialization method for faster learning. All the bias nodes are enforced with the unit value.

3.1. Multiclass Problems 3.1.1. Iris Data Set

The real database of Iris flowering plant consists of measurements of 150 flower samples. For each flower, the four facets weighed for each flower are positioned here: a flower Sepal Length and Width and a flower Petal Length and Width. In fact, these four facets are involved in the categorization of each flower plant into apposite Iris flower genus: Iris Setosa, Iris Versicolour, and Iris Virgincia. The 150 flower samples are equally scattered amidst the three iris flower classes. Iris setosa is linearly separable from the other 2 genera. But Iris Virgincia and Iris Versicolour are nonlinearly detachable. Out of these 150 flower samples, 90 flower samples are employed for training and 60 flower samples for testing.

The neural network is orderly structured with 4, 5, and 1 neurons in the input, hidden, and output layers, respectively, for training Iris flowering plant database with the step size of 0.3 and momentum constant of 0.8. Such skeleton is put into training for 675 epochs by exploiting BPN and LAST algorithms.

The visual representation of the number of trained input samples and training time seized by BPN and LAST algorithms at every single epoch is laid out in Figures 2 and 3, respectively.

Figure 2

IRIS dataset: epoch-wise input samples taken by BPN and LAST algorithms.

Figure 3

IRIS dataset: epoch-wise training time taken by BPN and LAST algorithms.

The empirical outcomes of BPN and LAST algorithms are rendered in Table 2. From this table, the LAST algorithm yield improved computational training speed in terms of the total number of trained input samples as well as total training time. Also the LAST algorithm is yielded with the performance of identical recognition accuracy.

Table 2

Result comparison of BPN and LAST algorithms for the IRIS dataset.

Neural network algorithm	Network topology	Number of epochs	Total number of input samples	Training time (in sec.)	Accuracy (%)
BPN	4 × 5 × 1	675	60750	0.036189	91.67
LAST	4 × 5 × 1	675	24148	0.016641	91.67

3.1.2. Waveform Data Set

The Waveform database generator data set is holding the measurements of 5000 wave’s samples. The 5000 wave’s samples are equally distributed, about 33%, into three wave families [10]. These waves are recognized with the 21 numerical features. Among these 5000 wave’s samples, 4800 waves and 200 waves are randomly selected to form the training set and testing set.

A 3-layer feedforward neural network, containing 21, 10, and 1 units in the input, hidden, and output layers, respectively, is trained with the learning rate parameters α=1e-2 and a momentum constant of 0.7. The previous structure is put into training for 675 epochs by exploiting BPN and LAST algorithms.

The visual representation of the number of trained input samples and training time seized by BPN and LAST algorithms at every single epoch is laid out in Figures 4 and 5, respectively.

Figure 4

Waveform dataset: epoch-wise input samples taken by BPN and LAST algorithms.

Figure 5

Waveform dataset: epoch-wise training time taken by BPN and LAST algorithms.

The empirical results of BPN and LAST algorithms are rendered in Table 3. From this table, the LAST algorithm yield improved computational training speed in terms of the total number of trained input samples as well as total training time. Also the LAST algorithm is yielded with the performance of high recognition accuracy.

Table 3

Result comparison of BPN and LAST algorithms for the waveform dataset.

Neural network algorithm	Network topology	Number of epochs	Total number of input samples	Training time (in sec.)	Accuracy (%)
BPN	21 × 10 × 1	815	3912000	0.005806	97.00
LAST	21 × 10 × 1	815	2035031	0.000328	97.50

3.2. Two-Class Problems 3.2.1. Heart Data Set

The Statlog Heart disease database consists of 270 patient’s samples. Each patient’s characteristics are recorded using 13 attributes. These 13 features are involved in the detection of the presence and absence of heart disease for the patients.

The neural network is orderly structured with 13, 5, and 1 neurons in the input, hidden, and output layers, respectively, for training Breast Cancer database with the step size of 0.3 and momentum constant of 0.9. Such skeleton is put into training for 619 epochs by exploiting BPN and LAST algorithms.

The visual representation of the number of trained input samples and training time seized by BPN and LAST algorithms at every single epoch is laid out in Figures 6 and 7, respectively.

Figure 6

Heart dataset: epoch-wise input samples taken by LAST and BPN algorithms.

Figure 7

Heart dataset: epoch-wise training time taken by BPN and LAST algorithms.

The empirical results of BPN and LAST algorithms are rendered in Table 4. From this table, the LAST algorithm yield improved computational training speed in terms of the total number of trained input samples as well as total training time. Also the LAST algorithm is yielded with the performance of high recognition accuracy.

Table 4

Result comparison of BPN and LAST algorithms for the Heart dataset.

Neural network algorithm	Network topology	Number of epochs	Total number of input samples	Training time (in sec.)	Accuracy (%)
BPN	13 × 5 × 1	964	212080	0.024903	90.00
LAST	13 × 5 × 1	964	98976	0.004097	92.00

3.2.2. Breast Cancer Data Set

The Wisconsin Breast Cancer Diagnosis Dataset contains 569 patient’s breasts samples among which 357 diagnosed as benign and 212 diagnosed as malignant class. Each patient’s characteristics are recorded using 32 numerical features.

The neural network is orderly structured with 31, 15, and 1 neurons in the input, hidden, and output layers, respectively, for training Breast Cancer database with the step size of 0.3 and momentum constant of 0.9. Such skeleton is put into training for 619 epochs by exploiting BPN and LAST algorithms.

The visual representation of the number of trained input samples and training time seized by BPN and LAST algorithms at every single epoch is laid out in Figures 8 and 9, respectively.

Figure 8

Breast Cancer dataset: epoch-wise input samples taken by LAST and BPN algorithms.

Figure 9

Breast Cancer dataset: epoch-wise training time taken by LAST and BPN algorithms.

The empirical results of BPN and LAST algorithms are rendered in Table 5. From this table, the LAST algorithm yield improved computational training speed in terms of the total number of trained input samples as well as total training time. Also the LAST algorithm is yielded with the performance of equal recognition accuracy.

Table 5

Result comparison of BPN and LAST algorithms for the Breast Cancer dataset.

Neural network algorithm	Network topology	Number of epochs	Total number of input samples	Training time (in sec.)	Accuracy (%)
BPN	31 × 15 × 1	619	247600	0.023736	95.27
LAST	31 × 15 × 1	619	39388	0.013930	95.27

3.2.3. Result Comparison

From Tables 2, 3, 4, and 5, it is concluded that the proposed LAST algorithm attains the higher training performance in terms of trained input samples and time. The comparison results of the training time for the previously mentioned dataset are consolidated in Figure 10. From this figure, the total training time of LAST algorithm is reduced to an average of nearly 50% of BPN algorithm.

Figure 10

Total training time taken by BPN and LAST algorithm.

4. Conclusion

A simple and new Linear Adaptive Skipping Training algorithm for training MFNN is systematically investigated in order to speed up the training phase of MFNN. The previously empirical results demonstratively proved that the LAST algorithm dynamically reduces the total number of training input samples presented to the MFNN at every single cycle. Thus decreasing the size of the training set can reduce the training time, thereby increasing the training speed. Finally, the proposed LAST algorithm seems to be faster than the standard BPN algorithm in training MFNN, and also the LAST Algorithm can be incorporated with any supervised task for training all real-world problems.

Both algorithms were implemented on a machine with the aforementioned configuration: Intel Core I5-3210M processor, CPU speed of 2.50 GHz, and 4 GB of RAM. The MATLAB version used for implementation is R2010b.

Mehra

Wah

B. W.

Artificial Neural Networks: Concepts and Theory 1992

IEEE Computer Society Press

Lippmann

R. P.

An introduction to computing with neural nets

IEEE ASSP Magazine 1987 4 2 4 22

2-s2.0-0023331258

10.1109/MASSP.1987.1165576

Owens

A. J.

Empirical modeling of very large data sets using neural networks

Proceedings of the International Joint Conference on Neural Networks (IJCNN '2000)

July 2000

302 307

2-s2.0-0033720235

Nguyen

Widrow

Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights

Proceedings of the International Joint Conference on Neural Networks (IJCNN '90)

June 1990

San Diego, Calif, USA

21 26

2-s2.0-0025536870

Varnava

T. M.

Meade

A. J.

Jr.

An initialization method for feedforward artificial neural networks using polynomial bases

Advances in Adaptive Data Analysis 2011 3 3 385 400

10.1142/S1793536911000684

MR2855172

Behera

Kumar

Patnaik

On adaptive learning rate that guarantees convergence in feedforward networks

IEEE Transactions on Neural Networks 2008 17 5 1116 1125

2-s2.0-65449120407

10.1109/TNN.2008.925640

Ampazis

Perantonis

S. J.

Two highly efficient second-order algorithms for training feedforward networks

IEEE Transactions on Neural Networks 2002 13 5 1064 1074

2-s2.0-0036738759

10.1109/TNN.2002.1031939

Wilamowski

B. M.

Improved computation for levenbergmarquardt training

IEEE Transactions on Neural Networks 2010 21 6 930 937

2-s2.0-77953120155

10.1109/TNN.2010.2045657

Wilamowski

B. M.

Neural network training with second order algorithms

Advances in Intelligent and Soft Computing 2012 99 463 476

2-s2.0-81855212479

10.1007/978-3-642-23172-8_30

Asuncion

Newman

D. J.

UCI Machine Learning Repository

School of Information and Computer Science, University of California, Irvine, Calif, USA ,2007, http://www.ics.uci.edu/~mlearn/MLRepository.html