3.1. Two Tail Inequalities
In the section, we present two tail inequalities and the definition of well-conditioned basis in L2 norm. Then we present the Fast Cauchy Transform in L2 norm, which is analog of the fast Johnson-Lindenstrauss transform, and prove our main result Theorem 2 in Section 3.2.
Lemma 5 (upper-tail inequality).
For i∈[m], let Ci be m i.i.d. Cauchy random variables and γ≥0 with γ=∑i∈[m]γi. Let X=∑i∈[m]γiCi2. Then, for any t>0,
(8)Pr[X>γt]≤O(mt).
Proof.
For fixed M>0 and defined Fi={Ci2≤M} and F=⋂i∈[m]Fi, then we have Fi={Ci≤M}; note that F⋂Fi=F. Because of tan-1x≤x, we have that
(9)Pr[Fi]=2πtan-1(M)=1-2πtan-1(1M)≥1-2πM.
By a union bound, Pr[F]≥1-(2m/πM). Further, Pr[F∣Fi]Pr[Fi]=Pr[F⋂Fi]=Pr[Fi]; hence, Pr[F∣Fi]=Pr[F]/Pr[Fi]. First, we need to bound E[Ci2∣F],
(10)E[Ci2∣Fi] =E[Ci2∣Fi⋂F]Pr[F∣Fi]+E[Ci2∣Fi⋂F¯]Pr[F¯∣Fi] ≥E[Ci2∣Fi⋂F]Pr[F∣Fi].
Then,
(11)E[Ci2∣F]≤E[Ci2∣Fi]Pr[F∣Fi]=E[Ci2∣Fi]Pr[Fi]Pr[F].
By using the pdf of a Cauchy variable,
(12)E[Ci2∣Fi]=2(M/π-(1/π)tan-1(M))Pr[Fi],
so
(13)E[Ci2∣F]≤2((M/π)-(1/π)tan-1(M))Pr[F]≤2((M/π)-(1/π)tan-1(M))(1-2m/πM).
We conclude that
(14)E[X∣F]=∑i∈[m]γiE[Ci2∣F]≤2γ((M/π)-(1/π)tan-1(M))(1-2m/πM).
Because of Markov's inequality and Pr[X≥γt∣F¯]≤1, we have
(15)Pr[X≥γt] =Pr[X≥γt∣F]Pr[F]+Pr[X≥γt∣F¯](1-Pr[F]) ≤(2/t)((M/π)-(1/π)tan-1(M))(1-2m/πM)+2mπM.
We set M=mt; then
(16)Pr[X≥γt]≤O(mt).
Lemma 6 (lower-tail inequality).
For i∈[r], let Ci be independent Cauchy random variables and with γ=∑i∈[m]γi2 and ∑i∈[m]γi4≤γ2/β2. Let X=∑i∈[r]γi2Ci2. Then, for any t>0,
(17)Pr[X≤γ(1-t)]≤exp(-β2t23).
Proof.
To bound the lower tail, we use Lemma 3. By homogeneity, it suffices to prove the result for γ=1. Let Zi=γi2min(Ci2,M). Clearly Zi≤γi2Ci2 and defining Z=∑iZi, we have that Z≤X and Pr[X≤1-t]≤Pr[Z≤1-t]. Thus, we have that
(18)Pr[Z≤1-t]=Pr[Z≤E[Z]-(E[Z]-1+t)]≤exp(-(E[Z]-1+t)22∑iE[Zi2]),
where the last inequality holds by Lemma 3 for 1-t<E[Z]. Using the distribution of half-Cauchy, we can verify that by choosing a fixed M to make sure 2∫0Mx2(1/π)(1/(1+x2))dx=1, since 1<M<2, E[Zi]=γi2 and E[Zi2]≤(3/2)γi4, so ∑iE[Zi]=1 and ∑iE[Zi2]≤(3/2)∑iγi4≤(3/2β2). It follows that Pr[Z≤1-t]≤exp(-t2β2/3) and the result follows.
3.2. Definition of L2 Well-Conditioned Basis and Construction of FCT in L2 Norm
Clarkson et al. [6] constructed Π1 as Π1≡4BCH~; let δ∈(0,1] be a parameter governing the failure probability of our algorithm; we modified B and obtain Π2, where: B∈ℝr1×2n has each column chosen independently and uniformly from the r1 orthogonal standard basis vector for ℝr1 and Bij is chosen independently from a (0,1)-distribution; for α sufficiently large, we will set the parameter r1=αdlogd/δ; C∈ℝ2n×2n is a diagonal matrix with diagonal entries chosen independently from a Cauchy distribution; and H~∈ℝ2n×n is a block-diagonal matrix comprised of n/s blocks along the diagonal. Each block is the 2s×s matrix Gs=[Hs;Is], where Is is the s×s identity matrix and Hs is the normalized Hadamard matrix. We will set s=r16. The effects of H~ here are to spread the weight of a vector, so that H~y has most entries that are not too small.
We describe L2 norm well-conditioned basis and our main aim for constructing an L2 well-conditioned basis is to make sure that we can tolerate the distortion of subspace embedding. To form an approximate basis for the range of A, we are led to faster algorithms for a range of related problems including low-rank matrix approximation [8, 9].
Definition 7 (L2 well-conditioned basis, [3, 4]).
A basis U for the range of A is L2-(α,β)-conditioned if ∥U∥2≤α and for all X∈ℝd, ∥x∥2≤β∥Ux∥2.
Next we show the construction of L2 well-conditioned basis U, which consists two steps: (a) let Π2 be a matrix satisfying (4) and compute Π2A and its QR-factorization Π2A=QR, where Q is an orthogonal matrix; (b) output U=AR-1=A(QTΠ2A)-1. This structure is similar to the algorithm of [10] for computing an L2 well-conditioned basis.
To obtain Theorem 2, we divide it into two propositions. To prove the upper bound, we use the existence of a (d,1)-conditioned basis U and apply this basis to show that ∥Π2Ux∥2 cannot expand too much. To prove the lower bound, we show that the inequality holds with high probability for a particular y; then we use a suitable γ-net to obtain the result for all y.
Proposition 8.
With probability at least 1-δ, for all x∈ℝd, ∥Π2Ax∥2≤κ∥Ax∥2, where κ=O(r1d3/δ).
Proof.
Let U∈ℝn×d be a L2-(d,1)-conditioned basis space of A, which implies that for some Z∈ℝd×d we have A=UZ. By the construction of U, for any x∈ℝd, ∥x∥2≤∥Ux∥2 and so
(19)∥Π2Ux∥2≤∥Π2U∥2∥x∥2≤∥Π2U∥2∥Ux∥2.
Thus our main aim is to show that ∥Π2U∥2≤κ. We have
(20)∥Π2U∥2=4∥BCH~U∥2=4∑j∈d∥BCH~U(j)∥2=4∑j∈d∥BCU~(j)∥2,
where U~=H~U. We need to bound ∥U~∥22. For any vector y∈ℝn, we represent y by n/s block of size s, so zi∈ℝs and yT=[ziT,z2T,…,zn/sT]. Recall that Gs=[Hs;Is] and ∥Gs∥2=2. Then,
(21)∥H~y∥22=∑i∈n/s∥Gszi∥22.
It follows that
(22)∥H~y∥22≤∑i∈n/s∥Gs∥22∥zi∥22=2∥y∥22.
Applying this to y∈U(j) for j∈[d],
(23)∥U~∥22=∑j∈[d]∥U~(j)∥22≤2∥U∥22≤2d2.
The (i,j) entry of BCU~ is ∑kBikCkkU~kj; here we take γij=∑kBikU~kj. So,
(24)∥BCU~∥22=∑i,j(∑kBikCkkU~kj)2=∑i,j(γijC~ij)2=∑i,jγij2C~ij2.
Here C~i,j are dependent Cauchy random variables and γ=∑i,jγij2. Using B as a standard orthogonal basis, we obtain
(25)γ=∑i,jγij2=∑ij(∑kBikU~kj)2=∥U~∥22.
Hence, we can apply Lemma 5, and m=r1d to obtain,
(26)Pr[∥BCU~∥22≥tγ]≤O(mt).
Setting m/t to δ, then we can get t=O((1/δ2)r1d). Thus, with probability at least 1-δ,
(27)∥BCH~∥2=O(r1d3δ).
To prove the second section, first we will show a result for fixed y∈ℝn; we will describe in the next lemma.
Lemma 9.
Let Pr[∥Π2y∥2<2∥y∥2]≤exp(-r13/64)+exp(-s1/2/16r12+logr1).
Proof.
We represent any vector y∈ℝn by its n/s blocks of size s, so zi∈ℝs and yT=[z1T,z2T,…,zn/sT]. Let g=H~y, g=[Gsz1;Gsz2,…,Gszn/s]. We have that ∥g∥22=∑i∥Gszi∥22=2∑i∥zi∥22=2∥y∥22 and
(28)∥g∥1=∑i∥Gszi∥i≥12s1/4∑i∥zi∥2≥12s1/4(∑i∥zi∥22)1/2=12s1/4∥y∥2.
We can conclude that ∥g∥1≥(1/22)s1/4∥g∥2, so ∥g∥22≥(1/8)s1/4∥g∥42. To analyze 4∥BCg∥22(Π2y=BCg, g=H~y), we could have (BCg)j=∑iBjiCiigi. Here we take γj=∑i=1nBjigi; then (BCg)j=γjC~j (C~j is an independent Cauchy random variable), so
(29)∥BCg∥22=∑j(∑iBjiCiigi)2=∑jγj2C~j2.
To apply Lemma 6, we need to bound ∑jγj2 and ∑jγj4. First, because B is standard orthogonal, then
(30)∑jγj4=∑j(∑iBjigi)2=∑j∑iBji2gi2=∥g∥22.
To bound ∑jγj4, we will show that γj2 is nearly uniform. Because Bji and Bjk are independent for i≠k, we can use Lemma 4 with 1-p=1-1/r1, with ξi=gi2 and ∑iξi=∥g∥22; setting t=1/r1 in Lemma 4,
(31)Pr[γj2≥(p+t)·(∑iξi)] ≤exp(-t22·∥g∥24∥g∥44)≤exp(-s1/216r12).
By a union bound, none of the γj2 exceed 2∥g∥22/r12 with probability at most r1exp(-s1/2/16r12). We assume this high probability event, so we get ∑jγj4≤4∥g∥24/r13. We apply Lemma 4 with t=3/4 and β2=r13/4 to obtain Pr[∥BCg∥2≤∥g∥2·(1/2)]≤exp(-r13/64). By a union bound, with probability at least 1-exp(-r13/64)-exp((-s1/2/16r12)+logr1).
Proposition 10.
Assume Proposition 8 holds. Then, for all x∈ℝd, ∥Π2Ax∥2≥∥Ax∥2 holds with probability at least
(32)1-exp(dlog(2dκ)) ×(exp(-r1364)+exp(-s1/216r12+logr1)).
Proof.
The proposition follows by putting a γ-net Γ on the range of A (observe that the range of A has dimension at most d). Specifically, let L be any fixed d dimensional subspace of ℝn (L is the range of A). Consider the γ-net on L with cubes of side γ/d; there are (2γ/d)d such cubes required to cover the hypercube ∥y∥∞; and for any two points y1,y2 inside the same cube, ∥y1-y2∥2≤γ/d. From each of the γ/d-cubes, select a fixed generic point y* and ∥y*∥2=1. In order to make sure that y* belongs to (2γ/d)d, by a union bound and Lemma 9,
(33)Pr[miny*∥Π2y*∥2∥y*∥2<12] ≤(2dγ)d(exp(-r1364)+exp(-s1/216r12+logr1)).
We will condition on the high probability event that ∥Π2y∥2≥∥y*∥2/2 for all y*. For any y∈L with ∥y∥2=1, let y* denote the representative point which is in the same cube of y(∥y*∥2=1). Then ∥y-y*∥2≤γ/d. Considering
(34)∥Π2y∥2 =∥Π2y*+Π2(y-y*)∥2 ≥∥Π2y*∥2-∥Π2(y-y*)∥2≥2∥y*∥2-κ∥y-y*∥2,
by choosing γ=1/κ, we have ∥Π2y∥2≥∥y∥2, with probability at least
(35)1-exp(dlog(2κd)) ×(exp(-r1364)+exp(-s1/216r12+logr1)).
Since s1/2=r13, by choosing r1=αdlog(d/δ) for large enough α, the final probability of failure is at most 2δ. And also since κ=O(r1d3/δ) and r1=αdlog(d/δ), by the inequality (1), we could get the distortion O(d1+η) for an arbitrarily small constant η>0.