Fast Linear Adaptive Skipping Training Algorithm for Training Artificial Neural Network

Artificial neural network has been extensively consumed training model for solving pattern recognition tasks. However, training a very huge training data set using complex neural network necessitates excessively high training time. In this correspondence, a new fast Linear Adaptive Skipping Training (LAST) algorithm for training artificial neural network (ANN) is instituted.The core essence of this paper is to ameliorate the training speed of ANN by exhibiting only the input samples that do not categorize perfectly in the previous epoch which dynamically reducing the number of input samples exhibited to the network at every single epoch without affecting the network’s accuracy. Thus decreasing the size of the training set can reduce the training time, thereby ameliorating the training speed.This LAST algorithm also determines howmany epochs the particular input sample has to skip depending upon the successful classification of that input sample. This LAST algorithm can be incorporated into any supervised training algorithms. Experimental result shows that the training speed attained by LAST algorithm is preferably higher than that of other conventional training algorithms.


Introduction
An artificial neural network (ANN) is a nonlinear knowledge processing model that have been successfully utilized training models for solving supervised pattern recognition tasks [1,2] due to its ability to generalize the real-world problems.The two phases of ANN operation are training (or learning) phase and testing phase.Among these two phases, the training phase is extremely time-consuming phase of an ANN.Training speed is the one of the most significant things to be examined in training a neural network as the training phase generally consumes excessively high training time.The training speed of the neural network is determined by the rate at which the ANN trains.The factors that influence the training rate of ANN are size of the neural network, size of training dataset, initial weight value, problem category, and training algorithm.Among these factors, the size of training dataset is examined and the way in which it affects the training rate is discussed.In order to generalize the neural network well, both underfitting and overfitting of the training dataset should be avoided.Ample amount of training dataset is required to elude overfitting.However, large amount of training data normally requires very long training time [3] which affects the training rate.
Speeding the ANN training is still a focus of research attention in neural network to improve network for faster processing.Many research works have been explored on different amendments by estimating optimal initial weight [4,5], adaptive learning rate and momentum [6], and using second order algorithm [7][8][9] in favor of ameliorating the training speed and maintaining generalization.First, proper initialization of neural network initial weights reduces the iteration number in the training process thereby increasing the training speed.Many weight initialization methods have been proposed for initialization of neural network weights.Nguyen and Widrow allocate the nodes' initial weight within the specified range which results in the reduction of the epoch number [4].Varnava and Meade Jr formulated a new initialization method by approximating the networks parameter using polynomial basis function [5].Second, the learning rate is used to control the step size for reconciling the network weights.The constant learning rate secures the convergence but considerably slows down the training process.Hence, several methods based on heuristic  factor have been proposed for changing the training rate dynamically.Behera et al. applied convergence theorem based on Lyapunov stability theory for attaining the adaptive learning rate [6].Last, second order training algorithms employ the second order partial derivatives information of the error function to perform network pruning.This algorithm is very apt for training the neural network that converges quickly.The most popular second order methods such as conjugate gradient (CG) methods, quasi-Newton (secant) methods, or Levenberg-Marquardt (LM) method are considered popular choices for training neural network.Nevertheless, it is not certain that these methods are very computationally expensive and require large memory particularly for large networks.Ampazis and Perantonis presented Levenberg-Marquardt with adaptive momentum (LMAM) and optimized Levenberg-Marquardt with adaptive momentum (OLMAM) second order algorithm that integrates the advantages of the LM and CG methods [7].Wilamowski and Yu incorporated some modification in LM methods by rejecting Jacobian matrix storage and also replacing Jacobian matrix multiplication with the vector multiplication [8,9] which results in the reduction of memory cost for training very huge training dataset.
All of the previously mentioned efforts are focused on speeding the training process by reducing the total number of epochs or by converging quickly.But each and every technique employs all the input samples in the training dataset to the network for classification at each and every single epoch.If a large amount of training data with high dimension is rendered for classification, then the mentioned technique introduces a problem by slowing down classification.There is a real fact that the correctly classified input samples are not involved in the weight updation since the error rate is calculated based on the misclassification rate.So, the intention of this research is to impart a simple and new algorithm for training the ANN in a fast manner.The core idea of LAST is when an input pattern is categorized perfectly by the network, and that particular pattern will not be presented again for the subsequent  epochs (epoch is one thorough cycle of populating the network with the entire training samples once).Only the patterns that do not categorize perfectly will be presented again for the next epoch which has reduced the total number of trained input samples employed by the network.Thereby, reducing the total amount of training dataset has reduced the training time which improved the training rate.
The gist of this research paper is systematized as follows.Section 2 describes how the new LAST algorithm is incorporated in standard BPN algorithm.Section 3 exhibits the experimental results.Finally, in Section 4, the overall conclusions of this research are drawn.

The Modified BPN with LAST Algorithm
2.1.LAST Neural Network Architecture.The LAST algorithm that is incorporated in the prototypical multilayer feedforward neural network architecture is sketched in Figure 1.
Such network is furnished with  input nodes,  hidden nodes, and  output nodes which are normally aligned in layers.Let  symbolize the number of training patterns.The input presented to the network is given in the form of matrix  with  rows and  columns.The number of network's input nodes is equivalent to the , column value of the input matrix, .Each row in the Matrix , is considered to be a real-valued vector  ∈ N +1 which is symbolized by { 0 ,  1 ,  2 , . . .,   } where  0 is a bias signal.The summed real-valued vector  ∈ N +1 generated from the hidden layer is symbolized by { 0 ,  1 ,  2 , . . .,   } with  0 being the bias signal.The estimated output real-valued vector  ∈ N  by the output layer is symbolized by { 1 ,  2 , . . .,   } and the corresponding target vector  ∈ N  is symbolized by { 1 ,  2 , . . .,   }.Let (it) signify the (it)th iteration number.
The network parameter symbols employed in this algorithm are addressed here.Let   () and   () be the nonlinear logistic activation function and linear activation function of the hidden and output layer, respectively.
Since the network is fully interconnected, each layer nodes is integrated with all the node in the next layer.Let V  be the  ×  matrix carries input-to-hidden weight coefficient for the link from the input node  to the hidden node  and V  the bias weight to the hidden node .Let   be the  ×  matrix hidden-to-output weight coefficient for the link from the hidden node  to the output node  and   the bias weight to the output node .

LAST Essence.
In BPN algorithm, each output unit correlates its computed activation   with its target value   to assess the associated error for that pattern with that unit.The factor  is computed for the patterns that do not exhibit the correct value of   .The weights and biases are renovated according to the  factor.
The core idea of LAST is when an input pattern is categorized perfectly by the network, and that particular pattern will not be presented again for the subsequent  epochs (epoch is one thorough cycle of populating the network with the entire training samples once).Only the patterns that do not categorize perfectly will be presented again for the next epoch which will decrease the training time largely.

Working Principle of LAST.
The working principle of the LAST algorithm that is incorporated in the BPN algorithm is documented as follows.

Weight Initialization
Step 1. Determine the magnitude of the connection initial weights (and biases) to the disseminated values within the precised range and also the learning rate, .
Step 2. While the iteration terminating criterion is attained, accomplish Steps 3 through 17.
Step 3. Iterate through the Steps 4 to 15 for each input training vector to be classified whose probability value, prob, is 1.

Furnish the Training Pattern
Step 4. Trigger the network by rendering the training input matrix to the input nodes in the network input layer.

Feedforward Propagation
Step 5. Disseminate the input training vector from the input layer towards the subsequent layers.
Step 6. Hidden Layer Activation net value is as follows.
(a) Each hidden node (  ,  = 1, 2, . . ., ) input is aggregated by multiplying input values with the corresponding synaptic weights (b) Apply nonlinear logistic sigmoid activation function to estimate the actual output for each hidden node , 1 ≤  ≤ .Consider ( Attaining the differential for the aforementioned activation function, Step 7. Output Layer Activation net value is next. (a) Each output node (  ,  = 1, 2, . . ., ) input is aggregated by multiplying input values with the corresponding synaptic weights (b) Apply non-linear logistic sigmoid activation function to estimate the actual output for each output node , 1 ≤  ≤ .Consider Attaining the differential for the aforementioned activation function, Accumulate the Gradient Components (Back Propagation) Step 8.For each output unit , 1 ≤  ≤ , the error gradient calculation for the output layer is formulated as Step 9.For each hidden unit , 1 ≤  ≤ , the calculation of error gradient for the hidden layer is formulated as Weight Amendment Using Delta-Learning Rule Step 10.For each output unit, Consider the following.
The weight amendment is yielded by the following updating rule: where The bias amendment is yielded by the following updating rule: where Step 11.For each hidden unit, Consider the following.
The weight amendment is yielded by the following updating rule where The bias amendment is yielded by the following updating rule: where

LAST Algorithm Steps
Step 12. Measure the dissimilarity between the target and true value of each input sample (  ,  = Step 14. Compute the possibility value for presenting the input sample in the next iteration: Step 15.Calculate , number of epochs to be skipped. (i) Initialize the value of  to zero for a particular sample   .
(ii) If   is classified correctly, then increment the value of  by , where  → Linear Skipping Factor.
Step 16.Construct the new probability-based training dataset to be presented in the next epoch.
Step 17. Inspect for the halting condition such as applicable mean square error (MSE), elapsed epochs, and desired accuracy.

Experimental Results
The proposed LAST algorithm has been analyzed for the categorization problem concomitant with two-class and multiclass.The real-world workbench data sets wielded for training and testing the algorithms are Iris data set, Waveform data set, Heart data set and Breast Cancer data set which are possessed from the UCI (University of California at Irvine) Machine Learning Repository [10].The concrete quantity of the data sets used in the research is provided in Table 1.
The magnitude of the initial weights is materialized with uniform random values in the range −0.5 to +0.5 using the Nguyen-Widrow (NW) initialization method for faster learning.All the bias nodes are enforced with the unit value.The visual representation of the number of trained input samples and training time seized by BPN and LAST algorithms at every single epoch is laid out in Figures 2 and 3, respectively.

Multiclass Problems
The empirical outcomes of BPN and LAST algorithms are rendered in Table 2. From this table, the LAST algorithm yield improved computational training speed in terms of the total number of trained input samples as well as total training time.Also the LAST algorithm is yielded with the performance of identical recognition accuracy.

Waveform Data Set.
The Waveform database generator data set is holding the measurements of 5000 wave's samples.The 5000 wave's samples are equally distributed, about 33%, into three wave families [10].These waves are recognized with the 21 numerical features.Among these 5000 wave's samples, 4800 waves and 200 waves are randomly selected to form the training set and testing set.A 3-layer feedforward neural network, containing 21, 10, and 1 units in the input, hidden, and output layers, respectively, is trained with the learning rate parameters  = 1 − 2 and a momentum constant of 0.7.The previous structure is put into training for 675 epochs by exploiting BPN and LAST algorithms.
The visual representation of the number of trained input samples and training time seized by BPN and LAST algorithms at every single epoch is laid out in Figures 4 and 5, respectively.
The empirical results of BPN and LAST algorithms are rendered in Table 3. From this table, the LAST algorithm yield improved computational training speed in terms of the total number of trained input samples as well as total training time.Also the LAST algorithm is yielded with the performance of high recognition accuracy.in the detection of the presence and absence of heart disease for the patients.The neural network is orderly structured with 13, 5, and 1 neurons in the input, hidden, and output layers, respectively, for training Breast Cancer database with the step size of 0.3 and momentum constant of 0.9.Such skeleton is put into training for 619 epochs by exploiting BPN and LAST algorithms.

Two-Class
The visual representation of the number of trained input samples and training time seized by BPN and LAST algorithms at every single epoch is laid out in Figures 6 and 7, respectively.
The empirical results of BPN and LAST algorithms are rendered in Table 4. From this table, the LAST algorithm yield improved computational training speed in terms of the total number of trained input samples as well as total training time.Also the LAST algorithm is yielded with the performance of high recognition accuracy.previously mentioned dataset are consolidated in Figure 10.From this figure, the total training time of LAST algorithm is reduced to an average of nearly 50% of BPN algorithm.Both algorithms were implemented on a machine with the aforementioned configuration: Intel Core I5-3210M processor, CPU speed of 2.50 GHz, and 4 GB of RAM.The MATLAB version used for implementation is R2010b.

Figure 1 :
Figure 1: LAST incorporated in neural network architecture.

3. 1 . 1 .
Iris Data Set.The real database of Iris flowering plant consists of measurements of 150 flower samples.For each flower, the four facets weighed for each flower are positioned here: a flower Sepal Length and Width and a flower Petal Length and Width.In fact, these four facets are involved in the categorization of each flower plant into apposite Iris flower genus: Iris Setosa, Iris Versicolour, and Iris Virgincia.The 150 flower samples are equally scattered amidst the three iris flower classes.Iris setosa is linearly separable from the other 2 genera.But Iris Virgincia and Iris Versicolour are nonlinearly detachable.Out of these 150 flower samples, 90 flower samples are employed for training and 60 flower samples for testing.The neural network is orderly structured with 4, 5, and 1 neurons in the input, hidden, and output layers, respectively, for training Iris flowering plant database with the step size of 0.3 and momentum constant of 0.8.Such skeleton is put into training for 675 epochs by exploiting BPN and LAST algorithms.

Figure 5 :
Figure 5: Waveform dataset: epoch-wise training time taken by BPN and LAST algorithms.

Figure 9 :
Figure 9: Breast Cancer dataset: epoch-wise training time taken by LAST and BPN algorithms.

AFigure 10 :
Figure 10: Total training time taken by BPN and LAST algorithm.

Table 1 :
Concrete quantity of the data sets used in the research.

Table 2 :
Result comparison of BPN and LAST algorithms for the IRIS dataset.
Problems 3.2.1.Heart Data Set.The Statlog Heart disease database consists of 270 patient's samples.Each patient's characteristics are recorded using 13 attributes.These 13 features are involved Waveform dataset: epoch-wise input samples taken by BPN and LAST algorithms.

Table 3 :
Result comparison of BPN and LAST algorithms for the waveform dataset.

Table 4 :
Cancer Data Set.The Wisconsin Breast Cancer Diagnosis Dataset contains 569 patient's breasts samples among which 357 diagnosed as benign and 212 diagnosed as malignant class.Each patient's characteristics are recorded using 32 numerical features.The neural network is orderly structured with 31, 15, and 1 neurons in the input, hidden, and output layers, respectively, Heart dataset: epoch-wise input samples taken by LAST and BPN algorithms.Result comparison of BPN and LAST algorithms for the Heart dataset.

Table 5 .
From this table, the LAST algorithm yield improved computational training speed in terms of the total number of trained input samples as well as total training time.Also the LAST algorithm is yielded with the performance of equal recognition accuracy.3.2.3.Result Comparison.From Tables 2, 3, 4, and 5, it is concluded that the proposed LAST algorithm attains the higher training performance in terms of trained input samples and time.The comparison results of the training time for the Breast Cancer dataset: epoch-wise input samples taken by LAST and BPN algorithms.

Table 5 :
Result comparison of BPN and LAST algorithms for the Breast Cancer dataset.