With the increasing prominence of big data in modern science, data of interest are more complex and stochastic. To deal with the complex matrix and vector data, this paper focuses on the mixed matrix regression model. We mainly establish the degrees of freedom of the underlying stochastic model, which is one of the important topics to construct adaptive selection criteria for efficiently selecting the optimal model fit. Under some mild conditions, we prove that the degrees of freedom of mixed matrix regression model are the sum of the degrees of freedom of Lasso and regularized matrix regression. Moreover, we establish the degrees of freedom of nuclear-norm regularization multivariate regression. Furthermore, we prove that the estimates of the degrees of freedom of the underlying models process the consistent property.
Fundamental Research Funds for the Central Universities2017JBM323National Natural Science Foundation of China116710291. Introduction
With the increasing prominence of large-scale data in modern science, data of interest are more complex, which may be in the form of a matrix, not a vector. At the same time, the random noises are not always normal. These complex stochastic data are frequently collected in a large variety of research areas such as information technology, engineering, medical imaging and diagnosis, and finance [1–7]. For instance, a well-known example is the study of an electroencephalography data set of alcoholism. The study consists of 122 subjects with two groups, an alcoholic group and a normal control group, and each subject was exposed to a stimulus. Voltage values were measured from 64 channels of electrodes placed on the subject’s scalp for 256 time points, so each sampling unit is a 256×64 matrix. To address scientific questions arising from those data, sparsity or other forms of regularization are crucial owing to the ultrahigh dimensionality and complex structure of the matrix data. Often, a variety of models in statistics lead to the estimation of matrices with rank constraints. The true signal often has low rank, which can be well approximated by a low rank matrix. Recently, Zhou and Li [5] proposed the so-called regularized matrix regression model to deal with these matrix form data, which is based on spectral regularization. This model includes the well-known Lasso as a special case; see [8] for more details. Moreover, one of the main results in [5] claimed the degrees of freedom of the proposed model under orthonormal assumption.
Degrees of freedom of the underlying stochastic model are one of the important topics. As we know, if we want to evaluate the performance of a model when we use it to analyze data, we need to choose the optimal tuning parameter in the same model. Many methods have been proposed to solve this problem. The popular methods include Cp, AIC, and BIC [9–11]. There is also a computational cost method named cross-validation. Efron [11] showed that Cp is an unbiased estimate of prediction error, and in most cases Cp provides an accurate parameter over cross-validation. Thus, Cp and AIC outperform the cross-validation. The fundamental idea of Cp, AIC, and BIC is connected with the concept of degrees of freedom.
Degrees of freedom can be easily understood in linear model. In linear case, the degrees of freedom are the number of prediction variables. However, if there exist constraints on the prediction variables, the degrees of freedom do not exactly correspond to the number of variables; see, for example, [5, 12–18]. After Stein [12] got Stein’s unbiased estimation, analytical forms of the degrees of freedom of different models have been studied for vector case. For instance, Hastie and Tibshirani [13] showed that the degrees of freedom of a linear smoother equal the trace of the prediction matrix. In general, it is difficult to get the degrees of freedom of many models. In 1998 Ye [15] and in 2002 Shen and Ye [16] used the computational method to predict the degrees of freedom. However, there is a deficient thing that the more data, the more cost of computation. For high-dimension vector case, Zou et al. [14] gave the degrees of freedom of Lasso. Furthermore, Tibshirani and Taylor [17, 18] gave the degrees of freedom of generalized Lasso.
However, for matrix case, there are a few results about the degrees of freedom of matrix regression. One can see that getting the analytical form of the degrees of freedom of our model is very essential both in theory and in practice. Thus, it is important to study the degrees of freedom in matrix case in the big data era. Notice that, besides Zhou and Li’s work [5] about the degrees of freedom of regularized matrix regression, Yuan [19] got the degrees of freedom in low rank matrix estimation, which includes the cases of the rank constraints and nuclear-norm regularization. Note that Yuan [19] just considered the rank constraints of multivariate regression, and Zhou and Li [5] did not consider the mixed case, which is combined with matrix and vector. If we use the nuclear norm as the penalty, what are the degrees of freedom of that model? If the variables are mixed, what are the degrees of freedom of that model?
We will answer the above questions affirmatively in our paper. Firstly, we prove that the degrees of freedom of mixed matrix regression model are the sum of the degrees of freedom of Lasso and regularized matrix regression; this result can be useful to construct adaptive selection criteria for efficiently selecting the optimal model fit. Then, following the same idea we establish the degrees of freedom of nuclear-norm regularization multivariate regression. It is worth noticing that Zou et al. [14] not only gave the unbiased estimate of the degrees of freedom of Lasso model, but also proved the following consistency of the estimate. This is an interesting and important work on the estimates of the degrees of freedom of Lasso. Based on their work, we finally prove that the estimates of the degrees of freedom given in this paper are consistent.
Our paper is organized as follows. In Section 2, we introduce the primary model, basic concepts, and notations used in our paper. In Section 3, we show the process of computing the degrees of freedom of model (3). In Section 4, we give the degrees of freedom of multivariate regression with nuclear-norm regularization. In Section 5, we verify the consistent property of the estimates. We conclude the paper with a discussion of potential future research in Section 6.
2. Preliminaries
In this section, we mainly introduce our model and basic concepts. First we present mixed matrix regression model. Then for convenient discussion and understanding of our work, we give some basic knowledge and notations.
Suppose y∈R is the response variable, γ∈Rp0 is the prediction vector, and X∈Rp1×p2 is the prediction matrix. They are known. Let z and B be unknown prediction vector and matrix. The statistical model of matrix regression is given as (1)y=B,X+γTz+ϵ,where B,X is the sum of multiply of corresponding element of B and X; ϵ is the prediction error of the model. Suppose we take n samples (2)yi=B,Xi+γiTz+ϵii=1,…,n.Note that, in the real data case, there are always some special structures of B and z such that B has low rank and z is usually sparse. In this case, we define mixed matrix regression model as(3)minB,z12∑i=1nyi-γiTz-B,Xi2+λ1B∗+λ2z1,where λ1≥0, λ2≥0 are the regularized parameters and B∗ is the nuclear norm of B which is the sum of singular values of B. That is, if B has singular decomposition, Udiag(b)VT, where U∈Rp1×p1 and V∈Rp2×p2, are normal orthogonal matrix, and diag(b) is a matrix whose elements of main diagonal are singular values of B with b=(b1,b2,…,bp,0,…,0), then B∗=b1+b2+⋯+bp. z1 is defined as the sum of absolute values of every component of z. That is, if z=(z1,z2,…,zp0), z1=z1+z2+⋯+zp0. Clearly, if B=0 in model (3), we will get Lasso model. For Lasso model, the research is very mature including algorithm and the degrees of freedom. In statistical parlance, Lasso uses an l1 penalty which has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. We say that Lasso yields sparse models that just involve a subset of the variables, performing variable selection. Lasso has been widely used in statistical and machine learning. In model (3), if z=0, we will get regularized matrix regression model mainly studied in [5].
Now we review some basic results on the degrees of freedom. Based on Stein’s unbiased estimation, Efron et al. [20] showed that the effective degrees of freedom of any fitting procedure δ has a rigorous definition under the differentiability condition on the estimate y^ of y based on δ, where y=(y1,y2,…,yn) denotes the response vector. That is, given a method δ, let y^=δ(y) denote its fit. Then under the differentiability of y^, the degrees of freedom of δ are given by (4)dfy^=trDy^y.This means that the degrees of the freedom of δ are the trace of the Jacobian matrix which is a special case of Definition 1. Once we get the degrees of freedom, we can establish three well-known information criteria Cp, AIC, and BIC under the normal noise case. That is, (5)Cp=y-y^22nσ2+2dfn,AIC=y-y^22σ2+2df,BIC=y-y^22σ2+lnndf.
In order to deal with the degrees of freedom of mixed matrix regression model, we define an operator to simplify the expression of optimal question and the Jacobian matrix of matrix function.
Definition 1.
Suppose there is a matrix function f:(6)f:Rp×q⟼Rs×t,N⟶M=fN.Then one defines the Jacobian matrix as DM(N)=∂vec(M)/∂vecT(N).
Suppose M=m11,…,m1t⋮ms1,…,mst, N=n11,…,n1q⋮np1,…,npq. We vectorize the matrix into a vector by column. For example, vec(M)=(m11,…,ms1,…,m1t,…,mst)T. Then the Jacobian matrix of f can be written as(7)DMN=∂m11∂n11,∂m11∂n21,…,∂m11∂np1,…,∂m11∂n1q,∂m11∂n2q,…,∂m11∂npq∂m21∂n11,∂m21∂n21,…,∂m21∂np1,…,∂m21∂n1q,∂m21∂n2q,…,∂m21∂npq⋮∂mst∂n11,∂mst∂n21,…,∂mst∂np1,…,∂mst∂n1q,∂mst∂n2q,…,∂mst∂npq.
Definition 2.
Let the operator ⋆ be defined from Rm×mn×Rm×n to Rm by(8)X⋆Y=XvecY.
It is easy to verify that the operator ⋆ is linear and A(X⋆Y)=(AX)⋆Y.
Let χ=vecT(X1)⋮vecT(Xn), γ=γ1T⋮γnT, y=y1⋮yn. Then, we can rewrite mixed matrix regression model (3) as(9)minB,z12y-χ⋆B-γ⋆z22+λ1B∗+λ2z1.Let B=(B,z) denote the unknown coefficients, and let A=(χ,γ) denote the prediction matrix. Our paper is based on the assumptions that ATA is full rank and the matrix data and vector data are independent, that is, χTγ=0.
3. The Unbiased Estimate of the Degrees of Freedom
We begin with the least squares estimate of our mixed matrix regression, which is the optimal solution of the problem (10)min12y-χ⋆B-γ⋆z22.By taking the partial deviation of the minimal question, we can know that (11)-χTy-χ⋆B^-γ⋆z^=0,-γTy-χ⋆B^-γ⋆z^=0.From our definitions in Section 2, we can easily verify that Dχ⋆B(B)=χ, Dγ⋆z(z)=γ. From the relationship y^=χ⋆B^+γ⋆z^, we obtain that(12)Dy^B^=Dy^B^,Dy^z^=χ,γ.By taking the partial deviation of the implicit functions on y above, we get(13)-χT+χTχDB^LSy+χTγDz^LSy=0,-γT+γTχDB^LSy+γTγDz^LSy=0.Thus, we derive ATADB^LS(y)=AT. If ATA is a full rank matrix, we can get DB^LS(y)=(ATA)-1AT.
Based on the definition of the degrees of freedom, we know that if the estimation y^ is differentiable on y, df^=tr{Dy^(y)} is an unbiased estimate of the degrees of freedom. Combining with the chain rule and the Jacobian matrix of fitted value with respect to responses, we can get(14)df^=trDy^y=trDy^B^DB^B^LSDB^LSy.This together with the above arguments, we can get(15)df^=trADB^B^LSATA-1AT=trDB^B^LSATA-1ATA=trDB^B^LS.Because DB^(B^LS)=DB^(B^LS),DB^(z^LS)Dz^(B^LS),Dz^(z^LS), it is easy to yield(16)df^=trDB^B^LS=trDB^B^LS+trDz^z^LS.
We are ready to present our main result in this section.
Theorem 3.
Let B^LS be the usual least squares estimate of B and assume that it has distinct positive singular values σ1>σ2>⋯>σp>0; then the unbiased estimate of the degrees of freedom of model (9) is(17)df^=z^0+∑i=1p1σi>λ11+∑j=1,j≠ip1σiσi-λ1σi2-σj2+∑j=1,j≠ip2σiσi-λ1σi2-σj2,where z^ is the estimate of z and z^0 is the number of nonzero elements in z^. Clearly, df=E(df^) is the degrees of freedom of mixed matrix regression.
Theorem 3 is an immediate result of the following two propositions whose proofs are relegated to the Appendix for the sake of presentation.
Proposition 4.
For any λ1≥0, the unbiased estimate of the degrees of freedom of regularized matrix regression model equals tr{DB^(B^LS)} given by(18)trDB^B^LS=∑i=1p1σi>λ11+∑j=1,j≠ip1σiσi-λ1σi2-σj2+∑j=1,j≠ip2σiσi-λ1σi2-σj2,where B^LS is the usual least squares estimate of B and assume that it has distinct positive singular values σ1>σ2>⋯>σp>0.
Proposition 5.
∀λ2≥0, the unbiased estimate of the degrees of freedom of Lasso equals trDz^(z^LS) given by(19)trDz^z^LS=z0.
4. Multivariate Regression with Nuclear-Norm Regularization
This section considers the multivariate regression, which has the following statistical model (20)Y=XB+E,where Y=(y1,y2,…,yn)T is an n×q response matrix, X=(x1,x2,…,xn)T is an n×p prediction matrix, B is a p×q unknown coefficient matrix, and the regression random noise E~N(0,τ2In⊗Iq).
Very recently, Yuan [19] studied the degrees of freedom of multivariate regression with low rank constraint via the following optimal model: (21)minB,rankB≤kY-XBF2.Since the above optimal model with the low rank constraint is difficult to compute, it is NP-hard problem. In this case, we usually relax the rank constraint to nuclear-norm regularization. Then we get the nuclear-norm regularization multivariate regression model(22)minBY-XBF2+λB∗.
Following the same technique as in the proof of Theorem 3, we can easily obtain the degrees of freedom of the nuclear-norm regularization multivariate regression. We omit its proof for brevity.
Theorem 6.
Assume that rank(XTX)=p in (22). Let B^LS be the usual least squares estimate and assume that it has distinct positive singular values σ1>σ2>⋯>σp>0, where p=min{p1,p2}. With the convention σi=0 for i>p, the following expression is an unbiased estimate of the degrees of freedom of the regularized fit (22):(23)df^λ=∑i=1p1σi>λ1+∑j=1,j≠ip1σiσi-λσi2-σj2+∑j=1,j≠ip2σiσi-λσi2-σj2.Thus df=E(df^(λ)) is the degrees of freedom of the nuclear-norm regularization multivariate regression.
5. Consistency of the Unbiased Estimate
The consistency of an estimate is important because it implies that the estimate is convergent to true value in probability. Suppose the estimated random variable is T(X); we use statistical methods to get an estimate T^n(X), which is a function of the size of sample. If T^n(X) is a consistent estimate of T(X), it means that, with the sample size increasing, T^n(X) equals T(X) almost everywhere. That is, for any ϵ>0, we can get (24)limn→∞PT^nX-TX<ϵ=1.
In this section, we prove the consistent property of the estimates of the degrees of freedom given in the former sections. We will first prove the consistency of the unbiased estimate df^ of regularized matrix regression. To do so, we need the following proposition on the continuous property of df^.
Proposition 7.
An unbiased estimate of the degrees of freedom of regularized matrix regression model is(25)df^=∑i=1p1σi>λ1+∑j=1,j≠ip1σiσi-λσi2-σj2+∑j=1,j≠ip2σiσi-λσi2-σj2,where σi is the singular value of the least square estimate. In this case, the degrees of freedom df^ are only continuous in {λ∣λ≠σi, i=1,2,…,p}.
Proof.
For any λ∈(σm,σm-1), we know that λ<σm-1<σm-2<⋯<σ1. So the degrees of freedom of regularized matrix regression model are written as(26)df^=∑i=1m-11+∑j=1,j≠ip1σiσi-λσi2-σj2+∑j=1,j≠ip2σiσi-λσi2-σj2.It is obvious that df^ is a linear function on λ∈(σm,σm-1). Thus, df^ is continuous in {λ∣λ≠σi, i=1,2,…,p}.
We next prove that df^ is not continuous in {σi, i=1,2,…,p}. If λ∈[σm,σm-1) and λ→σm+, λ<σm-1<σm-2<⋯<σ1, we obtain(27)limλ→σm+df^=∑i=1m-11+∑j=1,j≠ip1σiσi-σmσi2-σj2+∑j=1,j≠ip2σiσi-σmσi2-σj2.If λ∈(σm+1,σm) and λ→σm-, λ<σm<σm-1<⋯<σ1, we have(28)limλ→σm-df^=∑i=1m1+∑j=1,j≠ip1σiσi-σmσi2-σj2+∑j=1,j≠ip2σiσi-σmσi2-σj2=∑i=1m-11+∑j=1,j≠ip1σiσi-σmσi2-σj2+∑j=1,j≠ip2σiσi-σmσi2-σj2+1+∑j=1,j≠mp1σmσm-σmσm2-σj2+∑j=1,j≠mp2σmσm-σmσm2-σj2=∑i=1m-11+∑j=1,j≠ip1σiσi-σmσi2-σj2+∑j=1,j≠ip2σiσi-σmσi2-σj2+1. Therefore, we get limλ→σm-df^=limλ→σm+df^+1. Clearly, df^ is not continuous in {σi, i=1,2,…,p}.
Now, we show the unbiased estimate df^ is consistent to the true degrees of freedom.
Theorem 8.
Suppose σi is the singular value of the least square estimate of the regularized matrix regression model, and λn∗→λ∗>0, where λ∗ is not equal to the singular values, that means {λ∗≠σi, i=1,2,…,p}. Then, df^(λn∗)-df(λn∗)→0 in probability.
Proof.
By assumption and Proposition 7, it holds that df^ is a continuous function in {λ∣λ≠σi, i=1,2,…,p}. If we have a sequence λn∗ satisfying λn∗→λ∗≠σi, i=1,2,…,p, the continuous mapping theorem implies that limn→∞df^(λn∗)=df^(λ∗). Immediately, we see df^(λn∗)→pdf^(λ∗). By using the dominated convergence theorem, we can get(29)dfλn∗=Edf^λn∗⟶df^λ∗.Hence, df^(λn∗)-df(λn∗)→p0.
Notice that, for the vector case, Zou et al. [14] not only gave the unbiased estimate of the degrees of freedom of the Lasso model, but also proved the following consistency of the estimate.
Proposition 9.
For the Lasso model, if λn∗/n→λ∗>0 with λ∗ being a nontransition point, df^(λn∗)-df(λn∗)→0 in probability.
Based on Theorems 3, 8 and Proposition 9, we can easily show the following theorem.
Theorem 10.
If (λ1n,λ2n/n)→(λ1∗,λ2∗)>0, where λ1∗ and λ2∗ satisfy the assumptions in Theorem 8 and Proposition 9, then, df^(λ1n,λ2n/n)-df(λ1n,λ2n/n)→0 in probability.
6. Conclusions
In this paper, we mainly obtain the degrees of freedom of mixed matrix regression model. Moreover, we prove that the obtained estimates of degrees of freedom are consistent. Note that our results of the degrees of freedom are given under the assumption that the prediction matrix and vector are independent. However, if they are not independent but in linear relationship or another nonlinear relationship, or the number of samples is less than the number of variables, what is the analytical form of degrees of freedom? We will leave this as a future research topic.
Appendix
In this part, we give the proofs of Propositions 4 and 5. We first give the proof of Proposition 4. To do so, we need the following results. See [5] for more details.
Proposition A.1.
For a given matrix A with singular value decomposition A=Udiag(a)VT, f∘σ(B) denotes any function of singular vectors of B. The optimal solution to(A.1)minB12B-AF2+f∘σBshares the same singular vectors as A and its ordered singular values are the solution to (A.2)minb12b-a22+fb.
An immediate consequence of the above proposition is the well-known singular value thresholding formula for nuclear-norm regularization.
Corollary A.2.
For a given matrix A with singular value decomposition A=Udiag(a)VT. The optimal solution to (A.3)minB12B-AF2+λB∗shares the same singular vectors as A and its singular values are bi=(ai-λ)+.
Before proving Proposition 4, we also need some lemmas.
Lemma A.3.
Suppose that B^LS has singular decomposition B^LS=UΣVT=∑i=1pσiuiviT,p=min{p1,p2}. The estimate B^=UΣλ1VT, where Σλ1 has diagonal entries (σi-λ1)+.
Proof.
According to the result of Corollary A.2, we just need to show that (A.4)y-χ⋆B22=B-BLSF2.Note that, for any matrix A, AF2=〈vec(A),vec(A)〉. Thus, we can get(A.5)B-BLSF2=vecB-BLS,vecB-BLS=vecB-χTy,vecB-χTy=BF2-2vecB,χTy+y22.Direct calculation yields that vecB,χTy=vec(B)TχTy=(χvec(B))Ty=(χ⋆B)Ty=χ⋆B,y=y,χ⋆B. We then derive (A.6)y-χ⋆B22=y-χ⋆B,y-χ⋆B=y22-2y,χ⋆B+BF2.
Lemma A.4.
One has(A.7)trDB^viDviB^LS=1σi>λ1∑j=1,j≠ij=p2σiσi-λ1σi2-σj2,trDB^uiDuiB^LS=1σi>λ1∑j=1,j≠ij=p1σiσi-λ1σi2-σj2.
Proof.
Since B^LS=UΣVT, the eigenvectors of the symmetric matrix B^LSTB^LS=VΣ2VT coincide with the right singular vectors of B^LS. Then, by the chain rule,(A.8)DB^viDviB^LS=DB^viDviB^LSTB^LSDB^LSTB^LSB^LS.Now DB^(vi)=(σi-λ1)1{σi>λ1}Ip2⊗ui.
By the well-known formula for the differential of eigenvector, Dvi(B^LSTB^LS)=viT⊗(σiIp2-B^LSTB^LS)+, where C+ is the Moore-Penrose generalized inverse of a matrix C.
The Jacobian matrix of the symmetric product is D(B^LSTB^LS)(B^LS)=(Ip22+Kp2p2)(Ip2⊗B^LST), where Kp2p2 is the commutation matrix.
Now, by cycle permutation invariance of the trace function, we have (A.9)tr1σi>λ1σi-λIp2⊗uiviT⊗σiIp2-B^LSTB^LS+Ip22⊗B^LS=1σi>λ1σi-λ1trvi⊗σiIp2-B^LSTB^LS+B^LSTui=1σi>λ1σiσi-λ1trviT⊗0p2=0.Then,(A.10)trσi-λ1σi>λ1Ip2⊗uiviT⊗σiIp2-B^LSTB^LS+Kp2p2Ip22⊗B^LS=σi-λ11σi>λ1trσiIp2-B^LSTB^LS+⊗uiviTB^LST=1σi>λ1σiσi-λ1trσiIp2-B^LSTB^LS+⊗uiuiT=1σi>λ1∑j=1,j≠ij=p2σiσi-λ1σi2-σj2truiuiT=1σi>λ1∑j=1,j≠ij=p2σiσi-λ1σi2-σj2. By symmetry, we also have(A.11)trDB^uiDuiB^LS=1σi>λ1∑j=1,j≠ij=p1σiσi-λ1σi2-σj2.
Lemma A.5.
One has(A.12)trDB^σiDσiB^LS=1σi>λ.
Proof.
As in the proof of Lemma A.4, we utilize the fact that σi is the positive square root of the eigenvalues ηi of the symmetric matrix B^LSTB^LS. Then, by the chain rule and the Jacobian matrix of fitted value with respect to responses,(A.13)DB^σiDσiB^LS=DB^σiDσiηiDηiB^LSTB^LSDB^LSTB^LSB^LS.By combining DB^(σi)=1{σi>λ1}vi⊗ui, Dσi(ηi)=1/2ηi=1/2σi, Dηi(B^LSTB^LS)=viT⊗viT, and (A.14)DB^LSTB^LSB^LS=Ip22+Kp2p2Ip2⊗B^LST,we obtain that(A.15)DB^σiDσiB^LS=1σi>λ112σitrvi⊗ui·viT⊗viT1p22Ip2⊗B^LST+1σi>λ112σitrvi⊗ui·viT⊗viTKp2p2Ip2⊗B^LST=1σi>λ112σitrviviT⊗uiviTB^LST+1σi>λ112σitrviviT⊗uiviTB^LST=1σi>λ11σitrσiviviT⊗uiuiT=1σi>λ1.
Proof of Proposition 4.
We only need to show that the optimal B^ of our model is the solution to the following problem:(A.16)minB12y-χ⋆B22+λ1B∗.The least square estimate of B in model (9) is the solution of the following:(A.17)minB12y-γ⋆z-χ⋆B22.So vec(B^LS)=(χTχ)-1[χT(y-γ⋆z)]=(χTχ)-1χTy. Under the assumption, it is interesting to find that it has no relationship with γ and can be get from the following model:(A.18)minBy-χ⋆B22.Thus, by Lemma A.3, we have(A.19)trDB^B^LS=tr∑i=1pDB^viDviB^LS+DB^uiDuiB^LS+DB^σiDσiB^LS.By Lemmas A.4 and A.5, we easily yield the desired conclusion.
It is worth noting that Zou et al. [14] showed that the degrees of freedom of Lasso fit are that df(λ)=E|Bλ|, where Bλ is the effective set of the Lasso coefficient estimates β^. Thus, we know that df^=β^0 is an unbiased estimation of the degrees of freedom.
Proof of Proposition 5.
As we mentioned in Section 3, under a differentiability condition on y^(λ), df^=trDy^(y) is an unbiased estimation of the degrees of freedom. By the chain rule, (A.20)df^=trDy^y=trDy^β^Dβ^β^LSDβ^LSy.Because y^=Xβ^, we can get Dy^(β^)=X. The usual least square estimate for Lasso model is defined by (A.21)minβy-Xβ22.So β^LS=(XTX)-1XTy. If XTX=I, β^LS(y)=XT, then we have(A.22)df^=trXDβ^β^LSXT=trXTXDβ^β^LS=trDβ^β^LS.So we can get(A.23)df^=trDβ^β^LS=β^0.In the mixed case, under the assumptions, we obtain that the optimal z^ is the solution of the following:(A.24)minzy-γ⋆z22+λ2z1.It has no relationship with χ. Thus, in a similar way, we easily obtain (A.25)trDz^z^LS=z^0.The proof is completed.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the Fundamental Research Funds for the Central Universities (2017JBM323) and the National Natural Science Foundation of China (11671029).
PeteB.SaraV.2011SpringerNegahbanS.WainwrightM. J.Estimation of (near) low-rank matrices with noise and high-dimensional scaling20113921069109710.1214/10-AOS850MR2816348LiY.ZhangW.LiuX.Stability of nonlinear stochastic discrete-time systems201320132-s2.0-8488422791810.1155/2013/356746356746LiuX.LiY.ZhangW.Stochastic linear quadratic optimal control with constraint for discrete-time systems201422826427010.1016/j.amc.2013.09.036MR3151914Zbl1364.490322-s2.0-84891048759ZhouH.LiL.Regularized matrix regression201476246348310.1111/rssb.12031MR31648742-s2.0-84893793864ZhaoY.ZhangW.Observer-based controller design for singular stochastic markov jump systems with state dependent noise201629946958MaH.JiaY.Stability analysis for stochastic differential equations with infinite Markovian switchings2016435159360510.1016/j.jmaa.2015.10.047MR34234152-s2.0-84951311054TibshiraniR.Regression shrinkage and selection via the Lasso199658267288MallowsC. L.Some comments on Cp19731546616752-s2.0-8491542500710.1080/00401706.1973.10489103AkaikeH.Information theory and an extension of tha maximum likelihood principle1973New York, NY, USASpringer267281EfronB.The estimation of prediction error: covariance penalties and cross-validation20049946761964210.1198/016214504000000692MR20908992-s2.0-4944239996SteinC. M.Estimation of the mean of a multivariate normal distribution19819611351151MR63009810.1214/aos/1176345632Zbl0476.62035HastieT.TibshiraniR.1990New York, NY, USAChapman & HallZouH.HastieT.TibshiraniR.On the “degrees of freedom'' of the lasso20073552173219210.1214/009053607000000127MR23639672-s2.0-34548536008YeJ.On measuring and correcting the effects of data mining and model selection19989344112013110.2307/2669609MR16145962-s2.0-003235138910.1080/01621459.1998.10474094ShenX.YeJ.Adaptive model selection20029745721022110.1198/016214502753479356MR1947281Zbl1073.625092-s2.0-0036489055TibshiraniR. J.TaylorJ.Degrees of freedom in lasso problems20124021198123210.1214/12-AOS1003MR29859482-s2.0-84872012577TibshiraniR. J.TaylorJ.The solution path of the generalized lasso20113931335137110.1214/11-AOS878MR2850205Zbl1234.621072-s2.0-79954994522YuanM.Degrees of freedom in low rank matrix estimation201659122485250210.1007/s11425-016-0426-8MR35789682-s2.0-84995532098EfronB.HastieT.JohnstoneI.TibshiraniR.Least angle regression200432240749910.1214/009053604000000067MR20601662-s2.0-3242708140