Kernel-based neural network (KNN) is proposed as a neuron that is applicable in online learning with adaptive parameters. This neuron with adaptive kernel parameter can classify data accurately instead of using a multilayer error backpropagation neural network. The proposed method, whose heart is kernel least-mean-square, can reduce memory requirement with sparsification technique, and the kernel can adaptively spread. Our experiments will reveal that this method is much faster and more accurate than previous online learning algorithms.
Adaptive filter is the heart of most neural networks [
After the introduction of the kernel, kernel least-mean-square (KLMS) [
Two main drawbacks of kernel-based methods are selecting proper value for kernel parameters and series expansions whose size equals the number of training data, which make them unsuitable for online applications.
This paper concentrates only on Gaussian kernel (for similar reasons to those discussed in [
Use of cross-validation is one of the simplest methods to tune this parameter which is costly and cannot be used for datasets with too many classes. So, the parameters are chosen using a subset of data with a low number of classes in [
We proposed an adaptive kernel width learning in KNN method to maintain the online nature of it without any preprocess and reach convergence. We use the gradient property of KNN in order to estimate the best kernel width value during the process. Therefore, KNN method with adaptive Kernel width remains online and improves its accuracy as compared to the versions with fixed kernel width.
In other sides, it is needed to decrease computational complexity of kernel-based method to be useful in online application. Usually, used pruning [
This paper is organized as follows. In Section
In this section, a short review of LMS and KLMS algorithms is presented. We introduce notations in Table
Notations.
Description | Examples | |
---|---|---|
Scalers | Small |
|
Vectors | Small |
|
Matrixes | Capital |
|
Time or iteration | Indices in parentheses |
|
Components of vector | Subscript indices |
|
The main purpose of the LMS algorithm is to find a proper weight vector, which can reduce the MSE of the system output based on a set of examples
The LMS algorithm can learn linear patterns very well but it is poor in learning nonlinear patterns. To overcome this problem, Puskal derived LMS algorithm directly in the kernel feature space [
This section includes five parts for better presentation of the proposed kernel-based neural network. First, KLMS based neuron is explained; then kernel adaptation and stepsize of adaptation and termination condition are discussed, and final subsection includes the sparsification.
The KLMS neural network (KNN) performs the classification task by adding a nonlinear logistic function to KLMS structure; Figure
Structure of neural network KLMS.
Similar to the KLMS algorithm, we perform a gradient search in order to find the optimum weight. If
According, what was said, finding a proper kernel play an important role in kernel-based learning. The best kernel function for learning each dataset is different. One solution for improving kernel function is finding the best kernel parameters. We try to determine the best kernel width
The goal of LMS family’s methods is to reduce mean square error, and this aim will be achieved by using gradient search. It can be proved that the KNN, such as KLMS algorithm, converges when there are infinite numbers of samples. If kernel space structure and cost function were derivable, then gradient methods can be used for finding kernel’s parameters. Due to the derivability of Gaussian kernel function and MSE cost function, gradient search method can be used for updating kernel weight to reach the least mean square error. If the proposed modified KNN cost function is defined as
learning step learning kernel width step primal kernel width sparsity threshold
% compute distances of dictionary instances to tth instance % compute kernel vector, output and error for tth instance % compute coefficient % save % update kernel width % decrease % update kernel width % add % update Coefficients
% manipulated step size
% exit training process
There are some problems with choosing a fixed
We can track
When MSE is an acceptable range (
Growing network with arriving training inputs is the other drawback of kernel-based online learning algorithms. In this section, we use a proper criterion to cope with this problem and to produce sparse approximation of functions based on RKHS [
In online coherence-based sparsification algorithm using coherence criterion, whenever a new data pair
(a) If
(b) If
Two experiments have been designed. In the first experiment effects of kernel width parameter on performance of the proposed KNN method with adaptive kernel width and the KNN method with fixed kernel width were demonstrated. In the second experiment, simulations on some classification problems to compare performance of the proposed method with other online classification methods were conducted.
This experiment is an example that visualizes effect of choosing initial
Spiral dataset by 800 instances.
Evolution curves of the MSE of KNN and AKNN methods for (a)
Evolution curves of the
The empirical feature space preserves the geometrical structure of
Classification result of KNN method for different values of
Two-dimensional projection in empirical feature space for different initial values of
Figure
Figure
The experiments have been carried out to evaluate the performance of the proposed method in comparison with a number of online classification methods. They include the kernel perception algorithm [
Datasets used in the experiments.
Dataset | No. of instances | No. of features |
---|---|---|
Small | ||
Sonar | 208 | 60 |
Ionosphere | 351 | 34 |
Pima | 768 | 8 |
Medium | ||
German | 1,000 | 24 |
Splice | 1,000 | 60 |
Cloud | 2048 | 10 |
Large | ||
Football | 4288 | 13 |
Spambase | 4601 | 58 |
MITFace | 6,977 | 361 |
Mushrooms | 8,124 | 112 |
To make a fair comparison, all algorithms adopt the same experimental setup. For all algorithms in comparison, we set the penalty parameter
We scale all training and testing data to be in
Table
Evaluation of online learning algorithms with sparsing ability on the large datasets.
Algorithm | Football | Spambase | ||||||
---|---|---|---|---|---|---|---|---|
Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | |
Perceptron | 21.493 | 6.002 | 21.493 | 41.043 | 25.325 | 8.179 | 25.325 | 40.881 |
ROMMA | 45.605 | 12.550 | 20.282 | 40.905 | 46.360 | 15.474 | 23.403 | 39.708 |
ALMA | 23.807 | 6.646 | 20.430 | 40.416 | 28.203 | 9.440 | 23.306 | 39.361 |
PA | 49.531 | 13.864 | 20.256 | 40.694 | 54.262 | 17.776 | 23.757 | 38.731 |
DOUL | 45.457 | 19.883 | 21.775 | 40.041 | 49.744 | 27.720 | 23.422 | 39.382 |
AKNN |
|
|
|
|
|
|
|
|
| ||||||||
Algorithm | MITFace | Mushrooms | ||||||
Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | |
| ||||||||
Perceptron |
|
15.333 | 20.764 | 50.366 | 41.085 | 40.077 | 41.085 | 47.587 |
ROMMA | 32.428 | 24.932 | 19.347 | 50.724 | 61.679 | 60.266 | 41.437 | 49.569 |
ALMA | 21.138 | 15.822 | 18.888 | 50.365 | 43.330 | 42.153 | 40.501 | 45.335 |
PA | 43.635 | 32.526 | 19.828 | 48.674 | 62.218 | 61.115 | 41.234 | 47.378 |
DOUL | 39.419 | 49.529 | 19.279 | 50.193 | 63.467 | 192.678 | 41.075 | 47.082 |
AKNN |
37.366 |
|
|
|
|
|
|
|
Evaluation of online learning algorithms on the small and medium datasets.
Algorithm | Sonar | Ionosphere | ||||||
---|---|---|---|---|---|---|---|---|
Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | |
Perceptron | 35.722 |
|
35.722 | 39.952 |
|
|
20.675 | 48.117 |
ROMMA | 69.946 | 0.066 | 33.048 | 39.405 | 68.829 | 0.084 | 22.829 | 42.260 |
ALMA | 42.139 | 0.042 | 33.961 | 34.643 | 45.949 | 0.057 | 23.601 | 46.907 |
PA | 77.112 | 0.069 | 33.636 | 37.571 | 67.785 | 0.094 |
|
45.470 |
DOUL | 71.551 | 0.078 | 31.818 | 39.428 | 68.449 | 0.114 | 24.727 | 48.126 |
KNN | 98.716 | 0.056 | 27.166 | 26.857 | 99.968 | 0.139 | 27.331 | 39.395 |
AKNN | 100 | 0.059 | 28.343 | 25.357 | 95.032 | 0.132 | 32.894 | 39.361 |
AKNN |
|
0.038 |
|
|
38.418 | 0.077 | 45.338 |
|
| ||||||||
Algorithm | Pima | German | ||||||
Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | |
| ||||||||
Perceptron | 30.679 | 0.313 | 30.679 | 45.955 | 40.289 | 0.637 | 40.289 | 35.0 |
ROMMA | 51.185 | 0.527 | 30.824 | 45.957 | 99.922 | 1.574 | 34.555 | 33.7 |
ALMA | 32.442 | 0.345 | 29.003 | 43.481 | 99.722 | 1.574 | 34.600 | 34.4 |
PA | 61.257 | 0.609 | 29.176 | 44.400 | 99.911 | 1.578 | 34.544 | 33.3 |
DOUL | 55.390 | 0.635 | 31.228 | 49.086 | 99.911 | 2.539 | 34.544 | 31.4 |
KNN | 97.471 | 0.489 | 25.462 |
|
100 | 0.824 | 45.022 | 39.6 |
AKNN | 94.957 | 0.494 |
|
24.735 | 100 | 0.876 | 31.733 |
|
AKNN |
|
|
28.367 | 25.916 |
|
|
|
29.9 |
| ||||||||
Algorithm | Splice | Cloud | ||||||
Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | Density (%) | Training time (s) | Training mistake (%) | Test mistake (%) | |
| ||||||||
Perceptron |
|
0.715 | 46.033 | 49.3 |
|
|
0.222 | 25.243 |
ROMMA | 99.611 | 1.538 | 43.389 | 46.4 | 14.248 | 1.191 | 0.108 | 25.441 |
ALMA | 95.933 | 1.479 | 44.022 | 43.9 | 0.494 | 0.101 | 0.125 | 24.661 |
PA | 98.455 | 1.507 | 43.378 | 45.5 | 3.076 | 0.332 | 0.108 | 25.292 |
DOUL | 98.455 | 2.305 | 43.378 | 45.6 | 3.076 | 0.339 | 0.108 | 23.536 |
KNN | 36.000 |
|
|
42.4 | 44.927 | 0.697 | 0.054 |
|
AKNN | 92.000 | 0.765 | 39.289 | 41.7 | 44.927 | 0.737 | 0.108 |
|
AKNN |
81.188 | 0.694 | 40.155 |
|
0.488 | 0.144 |
|
|
According to the experimental results shown in Tables although there is little difference in the training mistake rate among all methods, the proposed method has the best the proposed method achieves significantly smaller it can be seen that the perceptron, ALMA, and AKNN
So, we can say that among all online learning, the AKNN
The goal of the present paper was to present a novel adaptive kernel least-mean-square neural network method. This method (AKNN) has an adaptive kernel width and sparsification processes simultaneously. We briefly touched history of learning algorithms based on LS. Then, we proposed an adaptive kernel that iteratively decreases least mean square error in form (
Our future works deal with the following questions: can we use a more efficient function as the sigmoid function in the KNN neuron? Can we adapt other kernel functions to use in neuron KNN? How to improve the performance of AKNN on imbalanced data and noisy data?