A Logical Framework of the Evidence Function Approximation Associated with Relevance Vector Machine

Relevance vector machine (RVM) is a Bayesian sparse kernel method for regression in statistical learning theory, which can avoid principal limitations of the support vector machine (SVM) and result in faster performance on test data while maintaining comparable generalization error. In this paper, we develop a logic framework of the evidence function approximation associated with RVMbased on Taylor expansion instead of traditional technology called “completing the square.”While constructing the termof completing the square, we have to find the term of completing the square by making use of some skill, which in practice increases the difficulty in dealing with the evidence function approximation associated with RVM..e logical framework in this paper based on Taylor expansion shows some advantages compared with the conventional method of completing the square, which is easier to be enforced due to the clear logical framework and avoids the difficulty in looking for the term of completing the square intentionally. From symmetry of covariance in a multivariate Gaussian distribution and algebraic knowledge, we derive approximation and maximization of the evidence function associatedwith RVM,which is consistent with the previous result using the novel logical framework. Finally, we derive the EMalgorithm for RVM, which is also consistent with the previous result except that we use the precision matrix as the covariance.


Introduction
When fitting polynomial curve to a given training data, we encounter the problem of choosing the order of the polynomial, which make us discuss an important concept called model selection. Model selection is an intractable problem in machine learning field since we have to face all kinds of complexity of the model according to the complexity of the problem being solved. We try our best to look for suitable approaches to fix the problem. For example, to avoid overfitting problem, we add regularization term [1,2] to the sum of the squares of the errors between the prediction for each data point and the corresponding target values or partition the available data into a training set which determines the coefficient and a separate validation set, used to optimize the model complexity, also called a hold-out set. However, this is too wasteful of valuable training data and we have to look for more sophisticated methods. Model selection from a Bayesian perspective is an alternative technology [3] whose merit is that the overfitting problem with maximum likelihood can be avoided by marginalizing (summing for discrete case or integrating for continuous case) over the model parameters compared with making point estimates and models can be compared directly on the training data without the need for a validation set, which allows all available data to be used to train, avoids a lot of training operations for each model associated with crossvalidation, and multiple complexity parameters can be determined simultaneously as part of the training process.
Bayesian model comparison simply involves the use of probabilities to depict uncertainty in the model choice by a consistent application of the sum and product rules of probability [4]. e core of model selection is to evaluate evidence function, assuming that all models are given equal prior probability. Relevance vector machine (RVM) [5,6] is a Bayesian sparse kernel method for classification and regression that has many characteristics of the support vector machine (SVM) while avoiding its main limitation. Relevance vector machine (RVM) generally leads to much sparser models than support vector machine (SVM), which can result in faster performance, which can be applied in automatic emulation and get better performance [7]. In [5], the evidence function approximation associated with relevance vector machine needs some skill to compute and makes use of the logical framework called "completing the square" [8]. However, the method of completing the square has difficulty in looking for the term of completing the square, especially in complex problems in the machine learning field. e paper will develop a novel logical framework based on Taylor expansion, which is different from the classic technique of completing the square over parameter of the model for evaluating the evidence function associated with relevance vector machine shortly. e method presented in this paper is convenient to evaluate the evidence function with relevance vector machine (RVM).

Lemma 2. If A and B are matrices of size N × M, then
(1) Specially, |I N + ab T | � |1 + a T b|, where a and b are N-dimensional column vectors. .
Lemma 5. If x and y are n-dimensional column vectors, then Tr(xy T ) � x T y, where Tr(·) denotes the trace of a matrix. e proofs of the abovementioned four lemmas refer to [9][10][11].
ese symbolic powers act as operators on a function of n variables, where ∇ is the gradient operator.
For example, we consider a two-dimensional case.
We also note that e proof of Lemma 6 is found in [12].
where A is an n × n inverse and symmetric matrix, b and w � (w 1 , w 2 , · · · , w n ) T are n-dimensional column vectors, and c is a constant. en, the Taylor expansion of f(w) at the stationary point w 0 � (1/2)A − 1 b is given by where the element of Hessian matrix H is given by Proof. e Taylor formula (9) implies where h � w − w 0 .
□ For obtaining the stationary point for f(w), setting ∇f(w) � 2Aw + b � 0, we obtain the stationary point From the abovementioned operations, we obtain (11). e equality of (11) plays a central role in our discussion because it is clearly different from conventional technology called "completing the square." We will build up a novel logical framework and discuss its merit compared with the traditional method when dealing with evaluating the evidence function with relevance vector machine (RVM) shortly.

Linear Basis Function Models and Bayesian Linear
Regression. e simplest linear model for regression is the one that includes a linear combination of the input variables: However, a linear combination of the input variables imposes significant limitations when we use the model. So, we extend the model by using linear combinations of fixed nonlinear functions of the input variables, which take the form where ϕ i (x) is defined as basis function and w 0 is called a bias parameter. For convenience, we define an additional dummy "basis function" ϕ 0 (x) � 1 and rewrite where w � (w 0 , · · · , w M−1 ) T and ϕ(x) � (ϕ 0 (x), · · · , ϕ M−1 (x)) T [8]. e target variable t is given by a deterministic function y(x, w) with additive Gaussian noise such that where ε is a zero mean Gaussian random variable with precision (inverse variance) β. So, we obtain Now, we consider a data set of inputs X � x 1 , · · · , x N with corresponding target value t � t 1 , · · · , t N . From the assumption that these data points are drawn independently from the Gaussian distribution (17), we obtain the following likelihood function: Since (18) is the exponential of a quadratic function of w, we can define a corresponding conjugate prior given by From (18) and (19), we obtain the posterior distribution in the form where where Φ is an N × M matrix, called the design matrix, whose elements are given by e proofs of (21) and (22) in detail refer to Appendix A.

Bayesian Model Comparison and Evidence Approximation.
Assume that we face a set of L models M i , where i � 1, · · · , L. Our task is to compare them and choose an optimal model based on some standard from the Bayesian perspective, which can avoid the overfitting phenomenon associated with maximum likelihood. e uncertainty is expressed through a prior probability distribution p(M i ). Given a training set, we suppose that all models are given the equal prior probability, which is feasible because there should be no preference for L given models in practice. We then wish to evaluate the posterior distribution: where p(M i |D) is defined as model evidence, which is also called the marginal likelihood since it can be viewed as a likelihood function over the space of models, in which the parameters have been marginalized out. Now, we introduce prior distribution over the hyperparameters α and β to complete a fully Bayesian treatment of the linear basis function model and we can obtain prediction by marginalizing with respect to the hyperparameters α and β and the parameter w. However, the complete marginalization over all of these variables is analytically intractable. We then discuss an estimation, in which we determine the hyperparameters α and β by maximizing the marginal likelihood function obtained by first integrating over the parameters w. e framework is viewed in the statistics field as empirical Bayes [13,14], or type-2 maximum likelihood [15], or generalized likelihood [16], and is also called the evidence approximation in the machine learning literature [8,17,18].
Here, if we introduce hyperprior over the α and β, then we obtain predictive distribution by marginalizing over the hyperparameters α and β and the parameter w as follows: where p(t | x, w, β) is defined by (17) and p(w | t, α, β) is given by (20) with m N and S N defined by (21) and (22).
If the posterior distribution p(α, β | t) is sharply peaked around α and β, we can obtain the predictive distribution simply by marginalizing over w in which α and β are set to the value α and β: (26) e posterior distribution for α and β is given by We can obtain the values α and β by maximizing the marginal likelihood function p(t | α, β) in the evidence framework if the prior is relatively flat. We continue evaluating the marginal likelihood for the linear basis function and obtain its maxima which will allow us to determine the values of these hyperparameters from the training data set only.

Evaluation of the Evidence Function with RVM.
Relevance vector machine (RVM) is a Bayesian sparse kernel method for regression. e relevance vector machine for regression is a linear model of the form given by (15). e following likelihood function is given by (18). e weight prior takes the form where α � (α 1 , · · · , α M ) T and α i is hyperparameter and represents the precision of the corresponding parameter w i . By making use of (18), we obtain Similarly, by making use of (28), we obtain We obtain the posterior distribution for the weights, which is also Gaussian and takes the form where the covariance and mean are given by where A � diag(α i ). e proofs of (32) and (33) in detail refer to Appendix B. We can use evidence approximation, which is also known as type-2 maximum likelihood, to determine the values of hyperparameters α and β. We first obtain the marginal likelihood function by integrating out the weight parameters: and then maximize the marginal likelihood function. We substitute (32) and (33) into (34) and then obtain the following result: where we have defined For the stationary point of E(w), we first derive the gradient of ∇E(w) by using Corollary 1: Let From (39), we obtain 4 Mathematical Problems in Engineering From (11), (38), and (40), it follows that together with We substitute (41) into (35) and then get the following form: From (32) and (33), we obtain We substitute (33) into (44) and obtain We expand (45) and then obtain We multiply both sides of (46) by m T and derive From (42), we obtain where we make use of (32) and (47). According to Lemma 1, we obtain By making use of (48) and (49), we obtain where we have used the following result according to Lemma 2 and (33): From (50), we obtain Mathematical Problems in Engineering together with C � β − 1 I N + ΦA − 1 Φ T . From (43), we can then write the log of the marginal likelihood in the form:

Maximization of the Evidence Function with RVM.
Firstly, we consider the maximizing p(t | X, α, β) with respect to α i . By making use of (33) and Lemma 3, we obtain From (54) and Lemma 4, we obtain where we have used and (33). From (55), we obtain d ln|Σ| where Σ ii is the i th component of the posterior covariance Σ defined by (33). From (42), we obtain zE(m) From (57) and (58), we easily obtain the stationary point of (53) with respect to α i such that where m i is the i th component of the posterior mean m defined by (32). After multiplying both sides of (59) by 2α i and arranging, we have the following result: Similarly, we can maximize the log marginal likelihood (53) with respect to β. Firstly, we obtain the derivative of ln|Σ| with respect to β using (33) and Lemma 3: By making use of (33) and Lemma 4, we obtain d ln|Σ| dβ Since we rearrange and get the following result: (64) By using definition of matrix A, (63) and (64), we obtain d ln|Σ| dβ From (42), we obtain zE(m) zβ From (65) and (66), we obtain the stationary point of (53) with respect to β such that By multiplying both sides of (67) by β and arranging, we obtain e result of (61) and (68) is consistent with [5,8]. We note that (61) and (68) are implicit solutions for α and β since both c i and 1/β depend on α and β. To address the problem, an iterative procedure is devised. We use (32) and (33) to evaluate the mean and covariance of the posterior distribution by choosing initial value α and β, respectively, and then alternately re-evaluate the hyperparameters using (61) and (68). Again, learning continues to use (32) and (33) to re-evaluate the mean and covariance of the posterior distribution, until a suitable convergence standard is satisfied.

EM Algorithm for RVM
A powerful and elegant technique for finding likelihood solutions for probabilistic models with latent variables is called the expectation-maximization algorithm or EM algorithm which also forms the basis for derivation of the variational framework [19,20]. e integration with respect to the weights to obtain the marginal likelihood can make use of the EM algorithm for maximization with the weights as latent variables very naturally. Since it is most common to use the EM for maximization, L will now become the (positive) log evidence. To derive the EM algorithm for the RVM, we firstly obtain the log marginal likelihood by marginalizing the weights from the joint distribution over t and w: where the joint distribution over t and w is both equal to the marginal likelihood times the posterior on w and to the likelihood multiplied by the prior on w: By defining a variational probability distribution q(w) over the weights and using Jensen's inequality, we obtain a lower bound on L: e EM algorithm achieves the maximization of the log marginal likelihood L by iteratively maximizing the lower bound. In the Expectation step (E-step), F is maximized with respect to the variational distribution q(w) for fixed parameters α and β, and in the Maximization step (M-step), F is maximized with respect to the hyperparameters α and β for fixed q(w). Insight into how to perform the E-step can be gained by rewriting the lower bound F as where KL(q(w)‖ p(w | t, X, α, β)) is the Kullback-Leibler (KL) divergence between the variational distribution q(w) and the posterior distribution of the weights. e KL divergence is always a positive number and is only equal to zero if the two distributions are identical. e E-step corresponds thus to making q(w) equal to the posterior on w, which implies F � L and q(w) � p(w | t, X, α, β). Since the posterior is Gaussian, the E-step reduces to computing its mean m and covariance Σ defined by (32) and (33).
To perform the M-step, we rewrite F in a different way: where H(q(w)) is entropy of q(w), which is an irrelevant constant because it does not depend on α or β, which implies zH(q(w))/zα i � 0 and zH(q(w))/zβ � 0. erefore, the M-step is to maximize the average of the log joint distribution of t and w over q(w) with respect to the parameters α and β: From the first term of (74), we obtain where we have used q(w)w � 1.
From the third term of (75) and Lemma 5, we obtain where E q (·) denotes the expectation of a random variable over the distribution q(w) and we have used E q (ww T ) � mm T + Σ, where the mean m and covariance Σ is defined by (32) and (33). From (75) and (76), we obtain Tr(AΣ).

(77)
We start to calculate the second term of (74) and obtain Mathematical Problems in Engineering From the third term of (78), we obtain where we have used Lemma 5 and E q (ww T ) � mm T + Σ.
From (78) and (79), we obtain q(w)ln p(t | X, w, β)dw From (77) and (80), we obtain We easily obtain the stationary point of (81) with respect to α i and β such that So, we obtain the update rules as follows: where m i is the i th component of the posterior mean m defined by (32) and Σ ii is the i th component of the posterior covariance Σ defined by (33). By making use of (64) and (65), we further obtain the following result for (85): where From the abovementioned discussion, we obtain the update rules by using the EM algorithm to maximize the log marginal likelihood, which is also consistent with Quiñonero Candela's result, except that we use the precision matrix as the covariance [21].

Conclusions
From the abovementioned discussion, we can draw the conclusion that the novel logical framework presented in this paper based on Taylor expansion shows some advantages compared with the conventional method of completing the square, which is easier to be implemented due to the clear logical framework, and avoids the limitation in seeking the term of completing the square intentionally. e approximation and maximization of the evidence function associated with RVM is consistent with Tipping's result of (61) and (68) using the novel logical framework. We obtain the update rules of (84) and (85) by using the EM algorithm for maximization of the log marginal likelihood, which is also consistent with Quiñonero Candela's result, except that we use the precision matrix as the covariance.

A. The Proofs of (21) and (22)
A quadratic form that belongs to the exponent term in a normal distribution is given for determining the corresponding mean and covariance of the distribution. e problem can be solved simply by noting that the negative exponent in a normal distribution N(x | μ, Σ) can be written as By setting ∇f(x) � Σ − 1 (x − μ) � 0, we obtain x � μ, which implies the stationary point of f(x) equals to the mean μ of the normal distribution N(x | μ, Σ). In addition, we obtain Σ − 1 � ∇∇f(x). Next, we will use the rules to solve the mean and covariance of a normal distribution.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.