This paper mainly focuses on the least square regression problem for the α-mixing and ϕ-mixing processes. The standard bound assumption for output data is abandoned and the learning algorithm is implemented with samples drawn from dependent sampling process with a more general output data condition. Capacity independent error bounds and learning rates are deduced by means of the integral operator technique.

1. Introduction and Main Results

The aim of this paper is to study the least square regularized regression learning algorithm. The main novelty of this problem here is the unboundedness and dependence of the sampling process. Let X be a compact metric space (usually a subset of ℝn) and Y=ℝ. Suppose that ρ is a probability distribution defined on Z=X×Y. In regression learning, one wants to learn or approximate the regression function fρ:X→Y given by
(1)fρ(x)=𝔼(y∣x)=∫Yydρ(y∣x),x∈X,
where ρ(y∣x) is the conditional distribution of y for given x. fρ is not directly computable because ρ is unknown in fact. Instead we learn a good approximation of fρ from a set of observations z={(xi,yi)}i=1m∈Zm drawn according to ρ.

The learning algorithm studied here is based on a Mercer kernel K:X×X→ℝ which is a continuous, symmetric, and positive semidefinite function. The RKHS HK associated with the Mercer kernel K is the completion of span {Kx=K(·,x):x∈X} with the inner product satisfying 〈K(x,·),K(x′,·)〉K=K(x,x′). The learning algorithm is a regularization scheme in HK given by
(2)fz,λ=argminf∈HK{1m∑i=1m(f(xi)-yi)2+λ∥f∥K2},
where λ>0 is a regularization parameter.

Error analysis for learning algorithm (2) has been studied in a lot of literatures [1–4], which focused on independent samples. In recent years, there are some studies relaxing the independent restriction and turning to the dependent sampling learning [5–8]. In [8] the learning performance of regularized least square regression was studied with the mixing sequences, and the result for this setting was refined by an operator monotone inequality in [7].

For a stationary real-valued sequence {zi}i≥1, the σ-algebra generated by the random variables za,za+1, …, zb is denoted by ℳab. The uniformly mixing condition (or ϕ-mixing condition) and the strongly mixing condition (or α-mixing condition) are defined as follows.

The lth ϕ-mixing coefficient for the sequence is defined as
(3)ϕl=supk≥1supA∈ℳ1k,B∈ℳk+l∞|P(A∣B)-P(A)|.
The process {zi}i≥1 is said to satisfy a uniformly mixing condition (or ϕ-mixing condition) if ϕl→0, as l→∞.

The lth α-mixing coefficient for random sequence {zi}i≥1 is defined as
(4)αl=supk≥1supA∈ℳ1k,B∈ℳk+l∞|P(A∩B)-P(A)P(B)|.
The random process {zi}i≥1 is said to satisfy a strongly mixing condition (or α-mixing condition) if αl→0, as l→∞.

By the fact P(A∩B)=P(A∣B)P(B), α-mixing condition is weaker than ϕ-mixing condition. Many random processes satisfy the strongly mixing condition, for example, the stationary Markov process which is uniformly pure nondeterministic, the stationary Gaussian sequence with a continuous spectral density that is bounded away from 0, certain ARMA processes, and some aperiodic, Harris-recurrent Markov processes; see [5, 9] and the references therein.

In this paper we follow [7, 8] to consider α-mixing and ϕ-mixing processes, estimate the error bounds, and derive the learning rates of algorithm (2), where the output data satisfy the following unbounded condition.

Unbounded Hypothesis. There exist two constants M>0 and p≥2 such that
(5)𝔼|y|p≤M.

The error analysis for the algorithm (2) was usually presented under the standard assumption that |y|≤M almost surely with some constant M>0. This standard assumption was abandoned in [10–14]. In [10] the authors introduced the condition
(6)∫Y(exp{-|y-fℋ|2M}-|y-fℋ(x)|M-1)dρ(y∣x)≤Σ22M2
for almost every x∈X and some constants M,Σ>0, where fℋ is the orthogonal projection of fρ onto the closure of HK in LρX2(X). In [11–13] the error analysis was conducted in another setting satisfying the following moment hypothesis; that is, there exist constants M~>0 and C^>0 such that ∫Y|y|ldρ(y∣x)≤C^l!M~l for all l∈ℕ,x∈X. Notice that with different constants the moment hypothesis and (6) are equivalent in the case fℋ∈L∞(X) [13]. Obviously, our unbounded hypothesis is a natural generalization of the moment hypothesis. An example for which unbounded hypothesis (5) is satisfied but moment hypothesis failed has been given in [15]. It mainly studies the half supervised coefficient regularization with indefinite kernels and unbounded sampling, where the unbounded condition is ∫Zy2dρ≤M^2 for some constant M^>0.

Since ℰ(fρ)=minℰ(f), where the generalization error ℰ(f)=∫Z(f(x)-y)2dρ, the goodness of the approximation of fρ by fz,λ is usually measured by the excess generalization error ℰ(fz,λ)-ℰ(fρ)=∥fz,λ-fρ∥ρX2. Denoting
(7)κ∶=supx∈XK(x,x)<∞,
the reproducing property in RKHS HK yields that ∥f∥∞≤κ∥f∥K for any f∈HK. Thus, the distance between fz,λ and fρ in HK can be applied to measure this approximation as well when fρ∈HK.

The noise-free limit of algorithm (2) takes the form
(8)fλ∶=argminf∈HK{∥f-fρ∥ρX2+λ∥f∥K2},
thus the error analysis can be divided into two parts. The difference between fz,λ and fλ is called the sample error, and the distance between fλ and fρ is called the approximation error. We will bound the error in LρX2(X) and HK, respectively. Estimate of the sample error is more difficult because fz,λ changes with the sample z and cannot be considered as a fixed function. The approximation error does not depend on the samples, which has been studied in the literature [2, 3, 7, 16, 17].

We mainly devote the next two sections to estimating the sample error with more general sampling processes. Our main results can be stated as follows.

Theorem 3.

Suppose that the unbounded hypothesis holds, LK-rfρ∈LρX2(X) for some r>0, and the ϕ-mixing coefficients satisfy a polynomial decay, that is, ϕi≤ai-t for some a>0 and t>0. Then, for any 0<η<1, one has with confidence 1-η,
(9)∥fz,λ-fρ∥ρX=O(m-θmin{t/2,1}(logm)3/4),
where θ is given by
(10)θ={3r4(r+1)if0<r<12,r2r+1if12≤r<1,13ifr≥1.

Moreover, when r>1/2, one has with confidence 1-η,
(11)∥fz,λ-fρ∥K=O(m-θ′min{t/2,1}(logm)1/2),
where θ′ is given by
(12)θ′={2r-12(2r+1)if12<r<32,14ifr≥32.

Theorem 3 proves the asymptotic convergence of algorithm (2) with the samples satisfying a uniformly mixing condition. Our second main result considers this algorithm with α-mixing process.

Theorem 4.

Suppose that the unbounded hypothesis with p>2 holds, LK-rfρ∈LρX2(X) for some r>0, and the α-mixing coefficients satisfy a polynomial decay, that is, αl≤bl-t for some b>0 and t>0. Then, for any 0<η<1, one has with confidence 1-η,
(13)∥fz,γ-fρ∥ρX=O(m-ϑmin{(p-2)t/p,1}(logm)1/2),
where ϑ is given by
(14)ϑ={pr2(2r+p-1)if0<r<12,0<t<pp-2;3pr2(4r+3p-2)if0<r<12,t≥pp-2;r2r+1if12≤r<1;13ifr≥1.

Moreover, when r>1/2, with confidence 1-η,
(15)∥fz,γ-fρ∥K=O(m-ϑ′min{(p-2)t/p,1}(logm)1/2),
where ϑ′ is given by
(16)ϑ′={2r-14r+2if12<r<32,14ifr≥32.

The proof of these two theorems will be given in Sections 2, 3, and 4, and notice that the log term can be dropped when t≠2. Our error analysis reveals some interesting phenomena for learning with unbounded and dependent sampling.

Smoother target function fρ (i.e., r becomes larger) implies better learning rates. Stronger dependence between samples (i.e., t becomes smaller) implies that they contain less information and hence lead to worse rates.

The learning rates are improved as the dependence between samples becomes weaker and r becomes larger but they are no longer improved after some constant t,r. This phenomenon is called saturation effect, which was discussed in [18–20]. In our setting, saturation effects include saturation for smoothness of function fρ mainly relative to the approximation error and saturation for dependence between samples. An interesting phenomenon revealed here is that when α-mixing coefficients satisfy αl≤O(l-t),l∈ℕ for some t>0, the saturation for dependence between samples is t=p/(p-2) for p>2, which is dependent on the unbounded condition parameter p.

For ϕ-mixing process, the learning rates have nothing to do with unbound condition parameter p since 𝔼(y-fλ(x))2 is bounded by 𝔼y2<∞. But for α-mixing process, to derive the learning rate, we have to estimate 𝔼|y-fλ(x)|p with p>2.

Under α-mixing condition, when t>p/(p-2) and r≥1/2, the influence of the unbounded condition becomes weak. Recall that the learning rate derived in [8] is O(m-r/(1+2r)) for 1/2≤r≤1, t≥1. It implies that when t is large enough, our learning rate for unbounded samples is as sharp as that for the uniform bounded sampling.

In this section, we would apply the integral operator technique in [7] to handle the sample error with ϕ-mixing condition. However, different from the uniform bounded case the learning performance of the unbounded sampling is not measured directly. Instead, the expectations are estimated first and then the bound for the sample error can be obviously deduced by Markov inequality:

To this end, define the sampling operator Sx:HK→l2(x) as Sx(f)=(f(xi))i=1m, where x is the set of input data {x1,…,xm}. Then its adjoint is SxTc=∑i=1mciKxi for c∈l2(x). The analytic expression of optimization solution fz,λ,fλ was given in [3],
(17)fz,λ=(1mSxTSx+λI)-11mSxTy,fλ=(LK+λI)-1LKfρ,
where LK:LρX2(X)→LρX2(X) is the integral operator defined as
(18)LKf(x)=∫XK(x,t)f(t)dρX(t),foranyx∈X.

For a random variable ξ with values in a Hilbert space ℋ and 0≤u≤+∞, denote the uth moment as ∥ξ∥u=(𝔼∥ξ∥ℋu)1/u if 1≤u<∞ and ∥ξ∥∞=sup∥ξ∥ℋ. Lemma 5 is due to Billingsley [21].

Lemma 5.

Let ξ and η be random variables with values in a separable Hilbert space ℋ measurable σ-field 𝒥 and 𝒟 and having finite pth and qth moments, respectively, where p,q≥1 with p-1+q-1=1. Then
(19)|𝔼(ξ,η)-(𝔼ξ,𝔼η)|≤2ϕ1/p(𝒥,𝒟)∥ξ∥p∥η∥q.

Lemma 6.

For an ϕ-mixing sequence {xi}, one has
(20)𝔼∥LK-1mSxTSx∥2≤κ4m(1+4∑i=1m-1ϕi1/2).

Proof.

With the definition of the sample operator, we have
(21)LK-1mSxTSx=LK-1m∑i=1mKxi⊗Kxi.
Letting η(x)=Kx⊗Kx, then η(x) is an HS(ℋK)-valued random variable defined on X. Note that 𝔼η(x)=LK∈HS(ℋK), and ∥LK∥HS≤κ2,∥η(x)∥HS≤κ2. We have
(22)𝔼∥LK-1mSxTSx∥2≤𝔼∥𝔼η-1m∑i=1mη(xi)∥HS2=1m∥η∥22+1m2∑i≠j𝔼〈η(xi),η(xj)〉HS-∥LK∥HS2.

By Lemma 5 with p=q=2, for i≠j,
(23)𝔼〈η(xi),η(xj)〉HS≤〈𝔼η(xi),𝔼η(xj)〉HS+2ϕ|i-j|1/2∥η∥22≤∥LK∥HS2+2κ4ϕ|i-j|1/2.
Thus the desired estimate can be obtained by plugging (23) into (22).

Proposition 7.

Suppose that the unbounded hypothesis holds with some p≥2 and that the sample sequence {(xi,yi)}i=1m satisfies an ϕ-mixing condition and LK-rfρ∈LρX2(X) with r>0. Then one has
(24)𝔼∥fz,λ-fλ∥ρX≤C(λ-1/2m-1/2+λ-1m-3/4(1+4∑i=1m-1ϕi1/2)1/4)≤×1+4∑l=1m-1ϕi1/2,
where C is a constant only dependent on κ,M.

Proof.

By [7, Theorem 3.1], we have
(25)𝔼∥fz,λ-fλ∥ρX≤(λ-1/2+λ-1(𝔼∥LK-1mSxTSx∥2)1/4)≤×𝔼∥1m∑l=1m-1ξ(zi)-LK(fρ-fλ)∥K2,
where ξ(z)=(y-fλ(x))Kx is a random variable with values in HK, and 𝔼ξ=LK(fρ-fλ). A similar computation together with the result of Lemma 6 leads to
(26)𝔼∥1m∑i=1mξ(zi)-LK(fρ-fλ)∥K2≤1m(1+4∑i=1m-1ϕi1/2)∥ξ∥22.

It suffices to estimate ∥ξ∥2. By Hölder inequality, there is
(27)𝔼y2≤(𝔼|y|p)2/p≤M2/p,∥fρ∥ρX2=∫Xfρ2(x)dρX=∫X(∫Yydρ(y∣x))2dρX≤∫Zy2dρ≤M2/p.
Thus 𝔼(y-fρ(x))2=𝔼y2-∥fρ∥ρX2≤M2/p and
(28)𝔼(fρ(x)-fλ(x))2=∥λ(λI+LK)-1fρ∥ρX2≤∥fρ∥ρX2≤M2/p,
which implies
(29)∥ξ∥22=𝔼((y-fλ(x))2K(x,x))≤κ2𝔼(y-fλ(x))2=κ2(𝔼(y-fρ(x))2+𝔼(fρ(x)-fλ(x))2)≤2κ2M2/p.

Plugging (29) into (26), there holds
(30)𝔼∥1m∑i=1mξ(zi)-LK(fρ-fλ)∥K2≤2M2/pk2m-1(1+4∑i=1m-1ϕi1/2).

Combining (25), (22), and (30) and taking the constant C=2k(k+1)M1/p, we complete the proof.

The following proposition provides the bound of the difference between fz,λ and fλ in HK with ϕ-mixing process.

Proposition 8.

Under the assumption of Proposition 7, there holds
(31)𝔼∥fz,λ-fλ∥K≤2M1/pκλ-1m-1/21+4∑i=1m-1ϕi1/2.

Proof.

The representations of fz,λ and fλ imply that
(32)𝔼∥fz,λ-fλ∥K=𝔼∥(1mSxTSx+λI)-1(1m∑l=1m-1ξ(zi)-LK(fρ-fλ))∥K≤λ-1𝔼∥1m∑l=1m-1ξ(zi)-LK(fρ-fλ)∥K2.
Then the desired bound follows from (30) and (32).

Now we turn to bound the sample error when the sampling process satisfies strongly mixing condition, and unbounded hypothesis holds. In Section 2, the key point is to estimate ∥ξ∥2 with the lack of uniform boundedness. For the sampling satisfying α-mixing condition, we have to deal with ∥ξ∥p for some p>2.

Proposition 9.

Suppose that the unbounded hypothesis holds with some p>2 and that the sample sequence {(xi,yi)}i=1m satisfies an α-mixing condition and LK-rfρ∈LρX2(X) with r>0. Then one gets
(33)𝔼∥fz,λ-fλ∥ρX≤C~λmin{(p-2)(2r-1)/2p,0}1+∑l=1m-1αl(p-2)/p𝔼∥fz,λ-fλ∥ρX≤×(λ-1/2m-1/2+λ-1m-3/4(1+∑l=1m-1αl)1/4),
where C~ is a constant only depending on κ,M and ∥LK-min{r,1/2}fρ∥ρX.

Proof.

For the strongly mixing process, by [8, Lemma 5.1],
(34)𝔼∥LK-1mSxTSx∥2≤k4m(1+30∑l=1m-1αl).
Taking δ=p-2 in [8, Lemma 4.2], we have
(35)𝔼∥1m∑i=1mξ(zi)-LK(fρ-fλ)∥K2≤1m∥ξ∥22+30m∑l=1m-1αl(p-2)/p∥ξ∥p2.

The estimation of ∥ξ∥2 has been obtained in Section 2, and now we mainly devote to estimating ∥ξ∥p. To get this estimation, the bound of fλ is needed which can be stated as follows ([3, Lemma 3] or [8, Lemma 4.3]):
(36)|fλ(x)|≤κ∥fλ∥K≤C1κλmin{(2r-1)/2,0},
where C1=∥LK-min{r,1/2}fρ∥ρX. Observe that ∥fλ∥ρX2≤∥fρ∥ρX2≤𝔼y2≤M2/p. Hence,
(37)(𝔼|fλ(x)|p)2/p≤(∥fλ∥ρX2C1p-2κp-2λ(p-2)min{(2r-1)/2,0})2/p≤M4/p2(C12κ2+1)λmin{(p-2)(2r-1)/p,0}.
Now we can deduce that
(38)∥ξ∥p2=(𝔼((y-fλ(x))2K(x,x))p/2)2/p≤κ2(𝔼|y-fλ(x)|p)2/p≤4κ2(𝔼max{|y|p,|fλ(x)|p})2/p≤2κ2((𝔼|y|p)2/p+(𝔼|fλ(x)|p)2/p)≤2κ2(M2/p+M4/p2(C12κ2+1))λmin{(p-2)(2r-1)/p,0}.
Plugging this estimate into (35) yields
(39)𝔼∥1m∑i=1mξ(zi)-LK(fρ-fλ)∥K2≤C2m-1λmin{(p-2)(2r-1)/p,0}(1+∑l=1m-1αl(p-2)/p),
where C2 is a constant only depending on κ,M and ∥LK-min{r,1/2}fρ∥ρX. Then combining (34) and (39) with (25), we complete the proof.

For α-mixing process we have the following proposition to get the bound of sample error in HK, and the proof can be directly obtained by the inequality (32).

Proposition 10.

Under assumption of Proposition 9, one has
(40)𝔼∥fz,λ-fλ∥K≤C3m-1/2λ-11+∑l=1m-1αl(p-2)/p,
where C3=C2.

4. Error Bounds and Learning Rates

In this section we derive the learning rates, that is, the convergence rates of ∥fz,λ-fρ∥ρX and ∥fz,λ-fρ∥K as m→∞ by choosing the regularization parameter λ according to m. The following approximation error bound is needed to get the convergence rates.

Proposition 11.

Supposing that LK-rfρ∈LρX2(X) for some r>0, there holds
(41)∥fλ-fρ∥ρX≤λmin{r,1}∥LK-min{r,1}fρ∥ρX.
Moreover, when r≥1/2, that is, fρ∈HK, there holds
(42)∥fλ-fρ∥K≤λmin{r-(1/2),1}∥LK-min{r,3/2}fρ∥ρX.

The first conclusion in Proposition 11 has been proved in [20], and the second one can be proved in the same way. To derive the learning rates, we need to balance the approximation error and sample error. For this purpose, the following simple facts are necessary:
(43)∑l=1m-1l-s≤{11-sm1-sif0<s<1,logmifs=1,1s-1ifs>1.

Proof of Theorem <xref ref-type="statement" rid="thm1.3">3</xref>.

The estimate of learning rates in LρX2(X) norm is divided into two cases.

Case 1. For 0<t<2, by (43) and ϕi≤ai-t, there is
(44)1+4∑i=1m-1ϕi1/2≤1+4a∑i=1m-1i-t/2≤(1+8a2-t)m1-(t/2).
Thus Proposition 7 yields that
(45)𝔼∥fz,λ-fλ∥ρX≤2C(1+16a2-t)(λ-1/2m-t/4+λ-1m-3t/8).
By Proposition 11 and Markov inequality, with confidence 1-η, there holds
(46)∥fz,λ-fρ∥ρX≤O(λmin{r,1}+η-1(λ-1/2m-t/4+λ-1m-3t/8)).

For 0<r<1/2, by taking λ=m-3t/(8(r+1)), we can deduce the learning rate as O(m-3tr/(8(r+1))). When 1/2≤r<1, taking λ=m-t/(2(2r+1)), the learning rate O(m-rt/(2(2r+1))) can be derived. When r≥1, the desired convergence rate is obtained by taking λ=m-t/6.

Case 2. t≥2. With confidence 1-η, there holds
(47)∥fz,λ-fρ∥ρX=O(λmin{r,1}+η-1(λ-1/2m-1/2+λ-1m-3/4)(logm)3/4).
For 0<r<1/2, taking λ=m-3/(4(r+1)), the learning rate O(m-3r/(4(r+1))(logm)3/4) can be derived, and for 1/2≤r<1, by taking λ=m-1/(2r+1), we can deduce the learning rate O(m-r/(2r+1)(logm)3/4). When r≥1, the desired convergence rate is obtained by taking λ=m-1/3.

Next for bounding the generalization error in HK, Proposition 8 in connection with Proposition 11 tells us that with confidence 1-η,
(48)∥fz-fρ∥K≤(λmin{r-(1/2),1}+η-1λ-1m-1/21+4∑i=1m-1ϕi1/2).
The rest of the proof is analogous to the estimate of ∥fz,λ-fρ∥ρX mentioned previously.

Proof of Theorem <xref ref-type="statement" rid="thm1.4">4</xref>.

For 0<t<1, by (43) and αl≤bl-t, there is
(49)1+∑l=1m-1αl(p-2)/p≤1+b(p-2)/p∑l=1m-1l-((p-2)/p)t≤(1+pb(p-2)/pp-(p-2)t)m1-((p-2)/p)t,1+∑l=1m-1αl≤1+b∑l=1m-1l-t≤(1+b1-t)m1-t.
By Propositions 9 and 11 and Markov inequality, with confidence 1-η, there holds
(50)∥fz,λ-fρ∥ρX=O((m-(p-2)t2p+λ-12m-(p-2)t2p-t4)λmin{r,1}+η-1λmin{(p-2)(2r-1)/2p,0}-1/2∥fz,λ-fρ∥ρX=OMi×(m-(p-2)t/2p+λ-1/2m-(p-2)t/2p-t/4)λmin{r,1}+η-1λmin{(p-2)(2r-1)2p,0}-12).

For 0<r<1/2, by taking λ=m-(p-2)t/2(2r+p-1), we can deduce the learning rate as O(m-(p-2)tr/2(2r+p-1)). When 1/2≤r<1, taking λ=m-(p-2)t/p(2r+1), the learning rate O(m-(p-2)rt/p(2r+1)) can be derived. When r≥1, the desired convergence rate is obtained by taking λ=m-(p-2)t/3p.

The rest of the analysis is similar; we omit it here.

Acknowledgment

This paper is supported by the National Nature Science Foundation of China (no. 11071276).

EvgeniouT.PontilM.PoggioT.Regularization networks and support vector machinesSmaleS.ZhouD.-X.Shannon sampling. II. Connections to learning theorySmaleS.ZhouD.-X.Learning theory estimates via integral operators and their approximationsWuQ.YingY.ZhouD.-X.Learning rates of least-square regularized regressionModhaD. S.MasryE.Minimum complexity regression estimation with weakly dependent observationsSmaleS.ZhouD.-X.Online learning with Markov samplingSunH.WuQ.A note on application of integral operator in learning theorySunH.WuQ.Regularized least square regression with dependent samplesAthreyaK. B.PantulaS. G.Mixing properties of Harris chains and autoregressive processesCaponnettoA.De VitoE.Optimal rates for the regularized least-squares algorithmGuoZ.-C.ZhouD.-X.Concentration estimates for learning with unbounded samplingLvS.-G.FengY.-L.Integral operator approach to learning theory with unbounded samplingWangC.ZhouD.-X.Optimal learning rates for least squares regularized regression with unbounded samplingWangC.GuoZ. C.ERM learning with unbounded samplingChuX. R.SunH. W.Half supervised coefficient regularization for regression learning with unbounded samplingSmaleS.ZhouD.-X.Shannon sampling and function reconstruction from point valuesSunH.WuQ.Application of integral operator for regularized least-square regressionBauerF.PereverzevS.RosascoL.On regularization algorithms in learning theoryLo GerfoL.RosascoL.OdoneF.De VitoE.VerriA.Spectral algorithms for supervised learningSunH.WuQ.Least square regression with indefinite kernels and coefficient regularizationBillingsleyP.