Online Sequential Projection Vector Machine with Adaptive Data Mean Update

We propose a simple online learning algorithm especial for high-dimensional data. The algorithm is referred to as online sequential projection vector machine (OSPVM) which derives from projection vector machine and can learn from data in one-by-one or chunk-by-chunk mode. In OSPVM, data centering, dimension reduction, and neural network training are integrated seamlessly. In particular, the model parameters including (1) the projection vectors for dimension reduction, (2) the input weights, biases, and output weights, and (3) the number of hidden nodes can be updated simultaneously. Moreover, only one parameter, the number of hidden nodes, needs to be determined manually, and this makes it easy for use in real applications. Performance comparison was made on various high-dimensional classification problems for OSPVM against other fast online algorithms including budgeted stochastic gradient descent (BSGD) approach, adaptive multihyperplane machine (AMM), primal estimated subgradient solver (Pegasos), online sequential extreme learning machine (OSELM), and SVD + OSELM (feature selection based on SVD is performed before OSELM). The results obtained demonstrated the superior generalization performance and efficiency of the OSPVM.


Introduction
In many real applications, such as text mining, visual tracking, and dynamical interest perception, there are always two problems: (1) new data arriving sequentially and (2) the data which is in high-dimensional space. For the first problem, many online sequential algorithms have been proposed [1][2][3][4][5][6][7][8][9][10][11][12][13]. SGBP [1] is one of the main variants of BP for sequential learning applications in which the network parameters are learned iteratively on the basis of first-order information. Crammer and Lee [7] proposed a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. Hoi et al. [8] proposed an online multiple kernel classification algorithm which learns a kernel-based prediction function by selecting a subset of predefined kernel functions in an online learning fashion. Wang et al. [9] proposed a Fourier online gradient descent algorithm that applies the random Fourier features for approximating kernel functions. Zhao et al. [14] proposed a fast bounded online gradient descent algorithm for scalable kernel-based applications that aims to constrain the number of support vectors by a predefined budget. Zhang et al. [11] proposed an online kernel learning algorithm which measures the difficulty in correctly classifying a training example by the derivative of a smooth loss function and gave more chance to a difficult example to be a support vector than an easy one via a sampling scheme. Shalev-Shwartz et al. [12] proposed a simple and effective stochastic subgradient descent algorithm primal estimated subgradient solver (Pegasos) for solving the optimization problem cast by Support Vector Machines (SVMs). Wang et al. [13] proposed an adaptive multihyperplane machine (AMM) model that consists of a set of linear hyperplanes (weights), each assigned to one of the multiple classes and predicts based on the associated class of the weight that provides the largest prediction. Wang et al. [10] proposed a budgeted stochastic gradient descent (BSGD) approach for training SVMs which keeps the number of support vectors bounded during training through several budget maintenance strategies. OSELM [15] is a very fast sequential algorithm derived from batch extreme learning machine (ELM) [16] in which the input weights are randomly generated and the output weights are determined 2 Computational Intelligence and Neuroscience by incremental least square. The aforementioned algorithms have their own advantages, respectively, in solving online learning problems for new data. However, they all thought that data preprocessing is independent on the model online learning. Different to these approaches, we propose an online learning algorithm OSPVM (online sequential projection vector machine) based on batch-PVM which enjoys the properties of combining data preprocessing (data centering and dimension reduction) and the model learning as a total. In our earlier work we have proposed incremental PVM [17] which can learn PVM incrementally; however, it cannot update data mean automatically. Data mean update is very important for improving the generalized performance of OSPVM. When new samples arrive, if the data mean is not updated, the components (features) obtained by SVD/PCA will shift and degrade the generalized performance. The proposed OSPVM algorithm enjoys three prosperities: (1) the mean of data can be updated dynamically, (2) projection vectors can be updated incrementally to capture more useful features from new data, and (3) the number of hidden nodes can be adjusted adaptively to ensure enough learning capability.
The paper is organized as follows. Section 2 gives a brief review of the batch-PVM. Section 3 presents the derivation of OSPVM. Performance evaluation of OSPVM is shown in Section 4 based on the benchmark problems in different areas. Conclusions based on the study and experiments are made in Section 5.

Review of Projection Vector Machine
This section briefly reviews the batch-PVM developed by Deng et al. [18] to provide the necessary background for the development of OSPVM in Section 3. In order to make it easy to read, some symbols are defined: (iii) A = (∑ =1 x )/ : mean vector of A; (iv) 1 1× = [1, 1, . . . , 1] 1× .

Single Hidden Layer Feedforward Neural Network (SLFN).
] ∈ R and t = [ 1 , 2 , . . . , ℓ ] ∈ R ℓ , a standard SLFN with̃hidden nodes and activation function ( ) are mathematically modeled as where w = [ 1 , 2 , . . . , ] ∈ R is the input weight vector connected with the th hidden nodes and the input nodes, ∈ R is the threshold of th hidden nodes, and = [ 1 , 2 , . . . , ℓ ] ∈ R ℓ is the output weight vector connecting with the th hidden nodes and the output nodes.
w ⋅ x denotes the inner product of w and x . If is treated input weights and denoted as ( +1) , then w can be extended to w = [ 1 , 2 , . . . , , ( +1) ] ∈ R +1 and the sample x is extended to [x ; 1]. Equation (1) can be transformed as The above equations can be written compactly as where To train an SLFN, one may wish to find specific W, to minimize the following cost function: Gradient-based learning algorithms [19] are generally used to search (W, ) by minimizing (W, ), but they are timeconsuming and maybe stop at a local minima. Extreme learning machine (ELM) [16,20] randomly chooses input weights W and analytically determines the output weights by Moore-Penrose generalized inverse. ELM can learn hundreds of times faster than gradient-based learning algorithms. But for the high-dimension and small-sample data, ELM will become unstable seriously especially when the data is sparse (there are many zero features). In order to tackle this problem, we have proposed batch projection vector machine (Batch-PVM) [18].

Batch Projection Vector Machine (Batch-PVM).
Batch-PVM combines SLFN together with SVD seamlessly, in which the input weights of SLFNs are calculated from SVD. Given data X, through data centralization and extension, the data is transformed as [X − 1 ×1 X; 1] and its low rank SVD is where is the truncated rank, U is the projection vectors by which the data [X − 1 ×1 X; 1] is mapped into low-dimension space: Since the role of input weights of SLFN can be treated as dimension reduction, thus they can be directly obtained by Computational Intelligence and Neuroscience 3 Naturally, the number of hidden nodes is determined bỹ = .
The problem becomes linear problem, and thus the output weights can be obtained by The experimental results in many classification and regression problems show that Batch-PVM is faster and more accurate than the familiar two-stage methods in which dimension reduction and SLFN training are independent. The batch-PVM assumes that all the training data (samples) are available, but, in real applications, some training data has been accumulated but at the same time new data will arrive chunk-by-chunk or one-by-one (a special case of chunk). The batch-PVM has to be modified for this case so as to make it able to learn online sequentially [21,22].

The Proposed Online Sequential Algorithm
The seamless combination of dimension reduction and SLFN training facilitates the design of sequential online learning. Once the SVD is updated for new samples, the dimension reduction projection matrix and all the parameters {W, ,̃} of SLFN can be updated conveniently.

Data Mean and Projection Vectors
Update. Assume that training samples ℵ = {(x , t )} =1 have been available so far, the inputs and targets are denoted as A = [x 1 , x 2 , . . . , x ] and T = [t 1 , t 2 , . . . , t ] , respectively. By centralization (subtracting the mean of the inputs) and extension, the data can be transformed aŝ The SVD ofÂ with the truncated rank iŝ Assume that th chunk of data ℵ = {(x , t )} =1 is presented where the new inputs and targets are denoted as B = [x 1 , x 2 , . . . , x ] and T = [t 1 , t 2 , . . . , t ], respectively, and the horizontal concatenation of A and B is denoted as C = [A, B ] ×( + ) . The update task is to get the new mean C and SVD of [C− 1 ( + )×1 C; 1]; that is, There are many sophisticated algorithms that have been developed to efficiently update SVD as more data arrive [23]. However, most approaches assume that the sample mean is fixed when updating the eigenbasis or equivalently that the data is inherently zero-mean. This assumption does not hold in many applications. New samples will lead to the change of data mean and thus the mean needs to be recomputed before updating SVD. One approach proposed by Hall et al. [24] considered the change of the mean while updating SVD as one set of new data arrives. However the high computational cost is a bottleneck of this method applied to many applications. Here we will extend Sequential Karhunen-Loeve [25] algorithm to make it suitable for updating SVD efficiently with mean update simultaneously. First we update the mean. The mean vector of A and It is not difficult to find that Since the SVD ofÂ has been known, this means that we can compute the SVD of [Â,B ] by incremental algorithm [18]: DenoteV =V −V 1 1×( + ) ; then we haveV =V + V 1 1×( + ) . Therefore, 4 Computational Intelligence and Neuroscience Substituting it into (15), we have It is obvious that the SVD of [C − 1 ( + )×1 C; 1] can be calculated based on the SVD ofÛΛV . Perform QR-decomposition ofV,V =QṘ.
Substituting (19) into (18) we have Perform SVD onÛΛṘ : Substituting (21) into (20) we get the SVD of [C − C1 1×( + ) ; 1]: Go back to the SVD of [Â,B ]. LetB be component ofB orthogonal to U ; that is, We can get the following partitioned form: ]. The SVD of M can be computed in constant time regardless of the following: So we get the SVD of [Â,B ],

Hidden Nodes Update
Adaptively. The number of hidden nodes is very important for SLFN [6]. Too many hidden nodes lead to overfitting while too few hidden nodes might lead to insufficiency of learning capability. When new training samples are presented the hidden nodes should be added to ensure the SLFN model possesses enough learning capability. OP-ELM [26] ranked the hidden nodes by multiresponse sparse regression (MRSR) and then make the final decision over the appropriate number of nodes by Leave-One-Out (LOO) validation method. I-ELM [27] increase random hidden nodes one-by-one until the residual error is smaller than one given threshold value. EI-ELM [28] selected the optimized random hidden nodes from one random hidden nodes set before increasing hidden node one-by-one. C-ELM [29] associate each model term to a regularized parameter; as a result, insignificant ones are automatically penalized and unselected. Since, in PVM, the number of hidden nodes̃is equal to the target low rank of SVD, we will adopt accumulation ratio of principle components to determine the number of nodes. The accumulation ratio is defined by [30] as follows: where denotes the singular value constituting the singular value diagonal matrix Λ = Diag{ 1 , 2 , . . . , },̃denotes the number of hidden nodes, and is number of nonzero singular values. By choosing one proper valuẽthat makes (̃) < hold, where is a given threshold value, we can get the new number of hidden nodes. The new input weights can be updated by The output weight is updated by The algorithm can be summarized as Algorithm 1.

Theoretical Analysis: OSPVM versus OSELM.
It is very difficult to prove OSPVM is better than OSELM strictly. So here we just give some theoretical analysis about OSPVM being better than OSELM from feature learning opinion.
As discussed in literature [31], minimizing reconstruction error is one very important condition to learn useful features. Reconstruction error of OSELM can be written as where X ∈ R × is inputs ( is the number of instances and is the dimensionality of data), X rand ∈ R × ( is the number of hidden nodes) is input weights which are random values, and ‖⋅‖ 2 is Frobenius norm. Reconstruction error of OSPVM can be written as W SVD is input weights and obtained by singular value decomposition (SVD) as follows: Computational Intelligence and Neuroscience 5 Initial Phase: Given the initial training data A, the accumulation ratio .
(1) Compute the data mean A and getÂ = [A − A1 1× ; 1]; (2) Compute SVD ofÂ :Â svd ← UΛV ; (3) Get the hidden nodes̃by making (̃) > ; (4) Obtain input weights W = Ũ; (5) Compute the output weights = (Λ̃Ṽ) † T Online learning phase: Given the th chunk of data B , ‖X − ŨS̃Ṽ‖ 2 is the error of optimized rank-̃approximation of X; that is, ‖X − ŨS̃Ṽ‖ 2 is the minima of reconstruction error with rank̃. Therefore, the reconstruction error OSELM of OSELM must be larger than that of OSPVM: OSELM > OSPVM . In summary, when OSELM and OSPVM are with the same number of hidden nodes ≪ , OSPVM is always smaller than OSELM . Another condition to obtain better generalization performance is to make the hidden nodes̃as few as possible (Occam's Razor theory). Considering these two conditions, we can get the inferences: (1) when OSPVM and OSELM are with the same number of hidden nodes and satisfying ≪ , the reconstruction error of OSPVM is smaller than OSELM ( OSPVM < OSELM ). This will help OSPVM to obtain better generalization performance in general, and (2) for the same reconstruction error OSPVM always needs less hidden nodes than OSELM. According to Occam's Razor theory, OSPVM will produce better generalization performance than OSELM with less hidden nodes. Next, we briefly explain why OSPVM is better than SVD + OSELM in generalization performance in most cases. Similar to OSPVM, SVD + OSELM represents the data by SVD to obtain more useful features. However, SVD + OSELM discards the projection vectors obtained by SVD and still uses randomly values as input weights. In contrast, OSPVM uses the resulted projection vectors as input weights and thus can avoid the instability of random weights. So OSPVM can produce better generalization performance than SVD + ELM in most cases.

Datasets and Experimental Settings.
We select OSELM, BSGD, AMM, and Pegasos to compare with OSPVM on various UCI benchmark problems as shown in Table 1. For fair comparison, the feature selection by SVD is first conducted before these algorithms. The number of reduced 6 Computational Intelligence and Neuroscience dimensions and the number of hidden nodes̃are both gradually increased by an interval of 5 and the nearly optimal combinations ( ,̃) are selected by cross-validation method. OSELM code is downloaded from ELM homepage (http://www.ntu.edu.sg/home/egbhuang/). BSGD, AMM, and Pegasos are downloaded from the BudgetedSVM website (http://www.dabi.temple.edu/budgetedsvm/). OSPVM and SVD + OSELM are implemented by ourselves. For OSELM and Batch-PVM, the number of hidden nodes is gradually increased by an interval of 5 and the nearly optimal one is then selected by cross-validation method. For OSPVM, the accumulation rate threshold is chosen in the range of [0.95, 0.99] by cross-validation method for every especial application. The activation functions for OSELM, OSPVM, SVD + OSELM, and Batch-PVM are all set as sigmoid function ( ) = 1/(1 + − ). For BSGD we set the kernel as Gaussian kernel K(x , x ) = exp(−(1/ )‖x − x ‖ 2 ), the budget maintenance strategy is set as "merging" which is more accurate than another alternate "removing," and the number of budgeted support vectors is determined by crossvalidation method. For AMM, the limit on the number of weights per class in AMM is determined by cross-validation method, and the learning rate is set to 0.0001. All the simulations are running in MATLAB 7, Pentium i7 920@2.67 GHZ CPU, and 6 G RAM environment. Average results of 20 trials of simulations for each fixed size of SLFN are obtained and then finally the best performance including training accuracy, testing accuracy, training time, testing time, and -test is reported in this paper. -test [32] is used to evaluate the performance difference of the algorithms. Denoting testing accuracies on the five datasets of th algorithm as a = ,1 , ,2 , . . . , ,5 , value can be computed as follows: where a and a denote mean value of a and a , V 2 and V 2 represent the variance of a and a , and and denote the number of datasets (here = = 5). By checking -table, we can obtain the significant level . Notice that the smaller the value the more significant the difference.
OSPVM is first compared with Bach-PVM, BSGD, AMM, and Pegasos in this section. The number of hidden nodes, training time, testing time, training accuracy, and testing accuracy are reported in Table 2. The -test results including value and significant level are summarized in Table 3. We can find from Table 2 that OSPVM can achieve nearly the same generalization to Batch-PVM while the training time is longer than Batch-PVM. The 16-by-16 mode is faster than one-by-one. Taking "Face" dataset as an example, the training time of OSELM is about 1.5 seconds and 13.07 seconds in 16-by-16 and 1-by-1 model, respectively. The reason lies in the fact that the bigger the chunk size, the fewer the update frequency. Batch-PVM just needs 0.46 seconds for "Face" dataset. In fact, Batch-PVM is one extreme case that initial data is entire data and does not need any update. For new samples, OSPVM can learn incrementally while Batch-PVM has to be retrained from the start. Taking "Face" dataset as an example, the average updating time of OSPVM for every sample is around 1.5/200 = 0.0075 seconds, while, for Batch-PVM, since it has to be retrained from the start, the updating time for every sample will be about 0.460 seconds. OSPVM is much faster than Batch-PVM in updating time for each sample. Table 2

One-by-One.
In this section we will compare OSPVM, OSELM, and SVD + OSELM in one-by-one case. Their training and testing accuracy are reported in Table 4, values are shown in Table 6 and training time and testing time are reported in Table 5. As observed from Tables 4 and 5, although OSELM can learn at the fastest speed, OSPVM can produce better generalization performance than OSELM with = 0.950 and > 0.1. OSPVM obtained improved performance in most cases compared to SVD + OSELM while saving training time. Taking "Face" dataset as an example, SVD + OSELM takes 22.40 s to produce 91.0% accuracy while OSPVM takes 13.07 s to reach 91.2% accuracy. The reason lies in the fact that OSPVM can learn useful features similar to SVD + ELM and remove the redundancy between dimension reduction and neural network training. For SVD + OSELM, two control parameters including target dimensions and the number of hidden nodes need to be tuned, while for OSPVM only one parameter needs to be determined. This will make OSPVM more simple to determine parameter settings and more convenient for usage in real applications than SVD + OSELM. As shown in Table 7 where the hidden nodes and target dimension are reported, OSPVM needs less hidden nodes than OSELM and SVD + OSELM. This means that OSPVM can achieve better responding ability than other algorithms.

Chunk-by-Chunk.
The performance of OSPVM, SVD + OSPVM, and OS-ELM in chunk-by-chunk mode (here we select 16-by-16 as an example) is reported in Tables 8, 9, 10, and 11. The results are similar to one-by-one model. Table 9 shows that OSPVM needs longer training time than OSELM but shorter training time than SVD + OSELM. Tables 8, 10, and 11 show that OSPVM obtained better generalization performance and more compact structure than OSELM and SVD + OSELM in most cases. This means that OSPVM can improve the stability of OSELM in solving small-sample and high-dimensional problems and inherits the advantage of OSELM in aspect of learning efficiency.
Computational Intelligence and Neuroscience 7 Note: since OSPVM is equivalent to PVM rather than an approximation, if it has the same experimental setting (same number of hidden nodes and same training and testing splits), OSPVM and PVM would obtain the same performance (training accuracy and testing accuracy).      Figure 1(a) shows the curve of hidden nodes changing with increase of training samples. We can find that the hidden nodes of OSPVM grow adaptively when the new samples (chunk size is 40) are presented. Figure 1(b) shows the curve of training accuracy and testing accuracy change with increase of the samples. We can observe that the cover capability (training accuracy) and generalized performance (testing accuracy) of the model always remain stable.

Equivalence of OSPVM and PVM.
Data mean update together with projection vectors update is to ensure the obtained OSPVM is an accurate model which is equivalent to PVM rather than an approximation (if there is no data mean update, an approximate model would be obtained). This means that if having the same parameter setting (same number of hidden nodes, same training and testing splits, etc.), OSPVM and PVM would obtain the same performance (training accuracy and testing accuracy). To verify the equivalence of them, we run these two algorithms at the same setting on the benchmarks. From the results shown in Table 12, it can be found that OSPVM will obtain the same training accuracy and testing accuracy as PVM. This illustrates from experimental aspect that OSPVM is equivalent to PVM instead of an approximation and thus can obtain the same generalized ability.

The Influence of Mean Update to Generalized Performance of OSPVM.
To display the influence of the mean update to the generalized performance of OSPVM, we run OSPVM with two different settings, respectively, that is, "with mean update" and "no mean update," on the same datasets including Face, Secom, Arcene, Dexter, and Multi.fea. For "with mean update" setting, the data is centralized to mean and dynamically adjusted as well when the subsequent chunk of data arrives. The variation curves of   the testing accuracy with respect to the chunk of training data under these two different settings are illustrated in Figure 2 (labeled as "with mean update" and "no mean update," resp.). It can be found that, on each dataset, OSPVM with mean update always obtains better generalized performance than no mean update. Take Face dataset as an example, on the first 40 training samples, OSPVM with mean update attains 73.5% in terms of testing accuracy while "no mean update" attains 72.3%. Along with the arrival of the subsequent training data, OSPVM with mean update is also always superior to no mean update. In time of the last chunk of data arrival, the obtained testing accuracy "with mean update" reaches 94% while "no mean update" reaches 90%. From the point of view of theoretical analysis, the performance improvement is possibly due to two aspects: (i) From principle component analysis perspective, the useful features are those directions with maximum variance [33]. In order to capture these directions, the data should be firstly centralized because, if there is no centralization, the first obtained direction which is from the origin to the centre will be shifted and the successive directions are also shifted consequently. (ii) On the other hand, from multivariate probability distribution perspective [34], the datasets are usually treated as a multivariate Gaussian distribution that is represented as the amount of the mean plus the variation along the principal vectors. By centering the data to the mean, the variational component of the data can be cancelled out and thus capture purely variational component of the data.
These experimental and theoretical analyses show that mean update has important positive influence to the generalized performance of OSPVM. With help of mean update, OSPVM can process dynamical data more adaptively and effectively.

Conclusion and Future Work
In this paper, an effective online sequential learning algorithm (OSPVM) has been proposed for high-dimensional and no-stationary data. Data mean, projection vectors, and neural network model can be updated simultaneously by one time pass of new samples. The algorithm can handle the new data arriving by one-by-one and chunk-by-chunk. Apart from setting the threshold value of accumulation ratio, no other parameter needs to be determined. Performance of OSPVM including training time and generalized performance is compared with some several typical online learning algorithms on real world benchmark problems. The results show that OSPVM can produce better generalization performance with more compact network structure than other algorithms in most cases. In our next work, we would further study how to improve computational efficiency to make it suitable for large data analytic. Additionally we would study more smart method to determine the threshold value of accumulation ratio adaptively.