JPSJournal of Probability and Statistics1687-95381687-952XHindawi Publishing Corporation37357210.1155/2009/373572373572Research ArticleData Depth Trimming Counterpart of the Classical t (or T2) ProcedureZuoYijunBaiZhidongDepartment of Statistics and ProbabilityMichigan State UniversityEast Lansing, MI 48824USAmsu.edu20090312200920092007200911102009141020092009Copyright © 2009This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The classical t (or T2 in high dimensions) inference procedure for unknown mean μ:X¯±tα(n1Sn/n  (or  {μ:n(x¯μ)S1(x¯μ)χ(1α)2(p)}) is so fundamental in statistics and so prevailing in practices; it is regarded as an optimal procedure in the mind of many practitioners. It this manuscript we present a new procedure based on data depth trimming and bootstrapping that can outperform the classical t (or T2 in high dimensions) confidence interval (or region) procedure.

1. Introduction

Let Xn:={X1,,Xn} be a random sample from distribution F with an unknown mean parameter μ. The most prevailing procedure for estimating μ is the classical t-confidence interval. A 100(1-2α)% confidence interval (CI) for μ and large n is X̅±tα(n-1)sn, where X̅=(1/n)i=1nXi is the standard sample mean, s=(1/(n-1))i=1n(Xi-X̅)2 is the standard sample deviation, and tr(N) is the rth upper quantile of a t distribution with degrees of freedom N. The rule of thumb in most textbooks for the sample size n is: n<15, do not use t procedure, 15<n40, do not use it if outliers present, use it if n>40. The procedure is based on the large sample property and the central limit theorem. So it is not exact but an approximation for large sample size n and arbitrary population distribution.

In higher dimensions, the counterpart to procedure (1.1) is the celebrated Hotelling's T2 procedure: A 100(1-α)% confidence region for the unknown vector μ and large n is the region: {μ:n(x̅-μ)S-1(x̅-μ)χ(α)2(p)}, where S is the sample covariance matrix and χ(α)2(p) is the upper αth quantile of χ2 distribution with degrees of freedom p.

Procedure (1.1) and (1.2) are so prevailing in practices that in many practitioners, mind they are regarded as optimal and unbeatable procedure. Are they really unbeatable? In this manuscript we introduce a new procedure that can outperform these seemingly optimal procedures.

The rest of the paper is organized as follows. Section 2 introduces the new procedure and Section 3 conducts some simulation studies. The paper ends in Section 4 with some concluding remarks.

2. A New Procedure for the Unknown <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M41"><mml:mrow><mml:mi>μ</mml:mi></mml:mrow></mml:math></inline-formula>2.1. A Univariate Location Estimator

It is well known that the sample mean in the t procedure is extreme sensitive to outliers, heavy tailed distributions, or contamination. The procedure therefore is not robust. So naturally, one would replace the sample mean with a robust counterpart. We will utilize a special univariate location estimator μ̂ to replace the sample mean X̅ in the t procedure.

Now we consider a special univariate “projection depth-trimmed mean” (PTMβ) for Xn in 1,   β>0  (see Wu and Zuo  in 1, also see Zuo  for a multidimensional PTMβ)

μ̂(Xn):=PTMβ(Xn)=1ki=1kXji, where j1,,jk are distinct numbers from {1,,n} such that PD(Xji,Xn)β for some β>0,  i=1,,k, and PD(Xi,Xn):=1/(1+|Xi-Med(Xn)|/MAD(Xn)); where Med(Xn) is standard sample median and MAD:=med{|Xi-Med(Xn)|,i=1,,n} is standard median of absolute deviations (see Zuo and Serfling  and Zuo  for the study of PD in high-dimensions).

Let Fn be the empirical distribution based on Xn which places mass 1/n at points Xi, i=1,,n. We sometimes write Fn (for Xn) for convenience. Let F be the distribution of Xi. Replacing Xn with F in the above definition, we obtain the population version. For example, the popular version of PTMβ for F1 is

PTMβ  (F)=PD(x,F)>βxdF(x)PD(x,F)>βdF(x).

2.2. The New Procedure

Let Xn*={X1*,,Xn*} be a random sample from the empirical distribution Fn. It is often called a bootstrap sample. Let Ym:={Xn1*,,Xnm*} be m bootstrap samples from Fn.

We calculate yj:=PTMβ(Xnj*) for j=1,,m. Now we calculate depth of yj with respect to ym:={y1,,ym}: PD(yj,ym) and then order yj with respect to their depth from the smallest to the largest: y(1),,y(m) where PD(y(1),ym),,,PD(y(m),ym).

Finally, we simply delete first 100·2α·m% points from y(1),,y(m). Then the interval (or closed convex hull in high-dimensions) formed by y(100·2α·m%+1),,y(m) is our 100(1-2α)% confidence interval for μ, where · is the floor function.

3. Simulation Study

Now we conduct simulation study to examine the performance of the new and classical t (or T2) procedure based on 2000 (replication) samples from various distribution F  (including N(0,1),  t(3), and others). Set α=0.025 and β=0.078; we consider the combinations of n=100 with the bootstrap number m=300, 500, 1000, and 2000.

We will confine attention to the average length (or area) of the confidence interval (or region) from both procedures as well as their coverage frequency of true parameter μ (which is assumed to be the mean of the F), which ideally should be close to 95%. If both procedures can reach the nominal level 95%, then it is obviously better to have a shorter (or smaller) confidence interval (or region) or smaller average length (or area) of the intervals (or regions).

3.1. One Dimension

Table 1 lists the simulation results at the normal and t(3) distributions.

Average coverage (length) of 95% CI's by t and PTMβ.

n=100methodN(0,1)t(3)
m=300PTMβ.9390 (.3819).9515 (.5855)
t .9550 (.3967).9605(.6607)
m=500PTMβ.9470 (.3842).9505 (.5902)
t .9520 (.3956).9520(.6618)
m=1000PTMβ.9410 (.3864).9470 (.5916)
t .9490 (.3959).9540(.6582)
m=2000PTMβ .9435 (.3876).9550 (.5943)
t .9480 (.3960).9545(.6589)

Inspecting the table immediately reveals that the bootstrap number m affects the average coverage of the new procedure, with the increase of m it gets closer to the nominal level 95%, while the average length of intervals gets slightly larger. Of course, it does not affect the t procedure which has nothing to do with bootstrap. Overall, both procedures are indeed (roughly) 95% procedure and the new one produces an interval on the average about 2%-3% shorter than that of the classical t procedure even at N(0,1) case, and it becomes 12%-13% shorter in the t(3) case.

Figure 1 displays the typical single run results from two procedures based on 100 sample points from N(0,1) (a) and t(3) (b). We see even at N(0,1) case, the new procedure outperforms the classical t procedure with a 95% confidence interval [-0.1422550,0.2463009]  10% shorter than that of t  [-0.1644447,0.2661397], both cover the target parameter μ=0. At t(3) case, new procedure produces an interval [-0.0874566,0.357108]  60% shorter than that of t:[-0.1644486,0.9482213]. Both cover the target parameter μ=0.

95% confidence intervals by t (red one) and new procedures (blue one) for the mean of F based on 100 sample points from F: (a) F=N(0,1) and (b) F=t(3).

In our simulation studies, we also compare our new procedure with the existing bootstrap percentile confidence procedure (i.e., it orders means of m bootstrap samples and then just to trim the upper and lower 100·m·α% points, the left points form an interval which is called bootstrap percentile conference interval, where x is the ceiling function of x), our new procedure also outperforms this one. But the later performs better than the classical t procedure in term of the average length of intervals at the same confidence level.

Our experiments with n also reveal that small n (the real situation in practice) is in favor of our new procedure. Note that this is the exact case where it is difficult to determine if the data are close to normal and hence to decide if one is able to use the classical CI. This is what we expected since the classical CI is based on normal F (or on the large sample property for large n). But this does not mean that the classical CI has an edge over the new procedure at really large sample size n (say, 10,000) even for the perfect N(0,1) case.

In addition to the distributions we considered in Table 1, we also conduct simulation studies to compare the performance of the new and classical t procedure at contaminated normal model: (1-ϵ)N(0,1)+ϵN(μ,σ2) with different choices of ϵ and (μ,σ2) since we know in practice, there is never a pure (exact) N(0,1); we may have just a slight departure from the pure normal or some contamination. Our results reveal that the new procedure is overwhelmingly more robust than the classical t; this is what we would expect since the t procedure depends on the sample mean which is notorious for its extreme sensitivity to outliers or contaminations. We also compare the performance of the two procedures at Cauchy distribution since we know that sample mean x̅ performs extremely well at symmetric light tailed distributions like N(0,1) but not so at heavy tailer ones like cauchy distribution.

We first display the typical single run results of 95% confidence intervals in Figure 2 to demonstrate the difference between the two procedures.

95% CIs by t (red one) and new procedures (blue one) for the mean of F based on 100 sample points from F: (a) F=0.95N(0,1)+0.5N(1.5,0.12) and (b) F=t(1).

Here in Figure 2, on the left-hand side are 95% CI's by t (red one) and by our new procedure (blue one) at the model 0.95N(0,1)+0.05N(1.5,0.12) with an interval from t: [-0.09723137,0.3047054] and from new procedure [-0.06083767,0.2763990] which is 16% longer than that of t. These intervals are supposed to estimating the mean parameter μ in this case is ϵ·u=0.075. So both intervals cover the unknown parameter μ.

On the right-hand side are 95% CIs by t (red one) and by new procedure (blue one) at the Cauchy distribution with an interval from t: [-19.02593,6.527279] and from new procedure: [-0.8909354,0.5884936] which is 94% shorter.

Of course, the single run results may not represent the overall performance of the two procedures. So we conduct a simulation over 2000 replications. The results are listed in Table 2.

Average coverage (length) of 95% CIs by t and PTMβ.

n=100method.95N(0,1)+.05N(1.5,0.12)Cauchy
m=300PTMβ .9455 (.3922).9585 (1.135)
t .9595 (.4070).9760(57.76)
m=500PTMβ .9525 (.3953).9765 (1.115)
t .9590 (.4081).9825(20.10)
m=1000PTMβ .9530 (.3972).9715 (1.164)
t .9600 (.4073).9775(31.87)
m=2000PTMβ .9485 (.3994).9725 (1.168)
t .9525 (.4077).9830(25.61)

Inspecting the table immediately reveals that the classical t procedure becomes useless in the heavy tailed Cauchy distribution case: exceeding the nominal level 95% and reaching 98% with an extremely wide confidence interval, no informative any more. At the same time, the new procedure can roughly reach the nominal level 95% (it is about 97%) and provide a meaningful estimation about the underlying unknown parameter. We list the results from the contaminated model with just 5% contamination to a pure N(0,1) model with the contamination also come from a normal distribution centered at 1.5 and with a small variance 0.01. Under such a potential real situation, the classical t   95% procedure becomes again useless since it can never reach the nominal level 95%, it is a roughly 96% procedure with an interval slightly longer than that of the new procedure, while the new procedure still is a reasonable 95% procedure with an interval on the average 2%–4% shorter than that of t one.

3.2. Higher Dimensions

In higher dimensions, with the multivariate version of PTM and PD (see Zuo , Zuo ) it is straightforward to extend our new procedure described in Section 2. That is, with the m bootstrap sample: Ym={Xn1*,,Xnm*} we calculate yj:=PTMβ(Xnj*) for j=1,,m. Then we calculate the projection depth of yj with respect to ym:={y1,,ym}: PD(yj,ym) and then we order yj's with respect to their depth from smallest to largest: y(1),,y(m) where PD(y(1),ym),,,PD(y(m),ym). The final step is the same as before: trimming first 100·2α·m% points from yj's the left formed a convex hull, that is our 100(1-2α)% confidence region for μ. We will examine the performance of this one and the classical Hotelling's T2 given in (1.2) in term of their average area of confidence regions as well as their coverage frequency of true parameter μ (which is assumed to be the mean of the F). The latter ideally should be close to 95%. If both procedures can reach the nominal level 95%, then it is obviously better to have a smaller confidence region or smaller average area of confidence regions.

We first display single run results of two procedures at bivariate standard normal distribution N2(0,1) and bivariate t distribution with 3 degrees of freedom t2(3) in Figure 3.

95% confidence regions by T2 (red one) and new procedures (blue one) for the mean of F based on 100 sample points from F: (a) F=N2(0,1), (b) F=t2(3).

Of course, single run result may not represent the overall performance of the two procedures. To see if the single run results are repeatable now we list the average of coverage and the area of the confidence regions based on two procedures in 2000 replications in Table 3. Here we set β=0.1, n=100 and α=0.025.

Average coverage (length) of 95% confidence regions by T2 and PTMβ.

n=100methodN2(0,1)t2(3)
m=300PTMβ.9501 (.1492).9515 (.3045)
T2 .9524 (.1935).9495(.5137)
m=500PTMβ.9395 (.1582).9577 (.3247)
T2 .9515 (.1947).9507(.5204)
m=1000PTMβ.9436 (.1672).9475 (.3438)
T2 .9547 (.1949).9353(.5111)
m=2000PTMβ .9403 (.1736).9586 (.3554)
T2 .9470 (.1949).9488(.5202)

Inspecting the Table reveals that the two procedures are indeed (roughly) 95% confidence procedures. Therefore it make sense to compare their average area of confidence regions. The table entries show that the new procedure produce a confidence region on the average 11%–22% smaller than that of the classical Hotelling's T2 procedure in term of area even at N2(0,1). This becomes 32%–40% in t2(3) case.

4. Concluding Remarks

From the last section we see that the new procedure has some advantages over the classical (seemingly optimal) procedures. But we know that we cannot get all the advantages of the new procedure for free. What kind of price we have to pay here? For all the advantage of the new procedures possess over the classical ones, the price it has to pay is the intensive computing in the implement of the procedure. In our simulation study, there are 4 million basic operations (the case n=100, replication R=2000, and bootstrap number m=2000). Computing the data depth in two or higher dimensions is very challenging. Fortunately, there is a R package (called ExPD2D) for the exact computation of projection depth of bivariate data already developed by Zuo and Ye  and is part of CRAN now. For high-dimensional computation, see Zuo . In one dimension it is straightforward. One can compute the sample median in linear time (i.e., the worst case time complexity is O(n)) by employing special technic (see any computer science Algorithm textbook), for further discussion about the property of related remedian, see H. Chen and Z. Chen . Fortunately, in practices, only one replication is needed. Also with the everlasting advance in computing power, the computation burden should not be an excuse for not using a better procedure.

A natural question is Why the new procedure has advantage over the classical one? The procedure clearly depends on bootstrap and data depth. Is it due to bootstrap or data depth? Who is the main contributor? If one just uses bootstrap, can one have some advantages? The answer for the latter is positive, Indeed, in our simulation we compare the classical one with the bootstrap percentile procedure, it reveals that the bootstrap percentile one does have some mild advantage over the classical one but still is inferior to our new procedure. So both bootstrap and data depth make contributions to the advantages of the new procedure. But remember, it is data depth that allow the bootstrap percentile procedure (which originally was defined only in one dimension) implementable in high-dimensions: to order sample bootstrap mean vectors. Without the data depth, it is impossible to implement the procedure in high-dimensions. So overall, it is data depth that makes the major contribution towards the advantages of the new procedure.

We also like to point out at this point that there is different new procedure introduced and studied in Zuo , where depth-weighted mean used in the procedure instead of the depth-trimmed mean used in our current procedure. However, our simulation studies indicate that our current new procedure is superior to the one in Zuo  which confines attention mainly to one dimension.

Our empirical evidence for the new procedure in one and higher dimensions is very promising, but we still need some theoretical developments and justifications, which is beyond scope of this paper and will be pursued elsewhere. A heuristic argument is because the bootstrap percentile confidence interval has advantage over the classic confidence interval procedure in term of at the same nominal level it can produce an asymptotically shorter interval (see Hall , and Falk and Kaufmann ). But the classical bootstrap percentile interval procedure is limited to one dimension, here we use data depth to ordering high-dimensional estimators so that we can extend the procedure to high-dimensions. The advantage of bootstrap percentile confidence interval carries on to high-dimensions.

One question left about our new procedure in practices is how does one choose the β value? Well, there are at least two ways to deal with this β value problem. First, one can chose a fixed value, our empirical experience indicates a value between 0.01-0.1 will serve most of our purposes. Or (second), dynamically choose β value by minimizing some objective function which could be your interval length in our simulation case or variance in the efficiency evaluation case. With such a data dependant β, one natural question raised is: Is the theory in Zuo  established based on the fixed constant β still holds? Fortunately, all still hold if we employ a more powerful tool (empirical process theory) from Pollard  or van der Vaart and Wellner  to handle this situation with a data dependent β.

There are a number of depth functions and related depth estimators (see Tukey , Liu , Zuo and Serfling , and Bai and He ), but among them projection depth function used here is the most favorite one (see Zuo [4, 16]). Furthermore, the computation of depth functions all are very challenging but we have some algorithm at hand for the projection depth function, this is yet another motivation for us to pick the projection depth function in this paper.

Finally, we comment that findings in this paper are consistent with the results obtained in Bai and Saranadasa (BS)  which shows the Effect of high-dimension, that is, there are better procedures than the classical inference procedures like Hotrlling's T2 one which is inferior compared to other procedures like Dempster's nonexact test (Dempster ) and BS proposed test even for moderately large dimension and sample sizes.

Acknowledgment

This research was partially supported by NSF Grants DMS-0234078 and DMS-0501174.

WuM.ZuoY.Trimmed and Winsorized means based on a scaled deviationJournal of Statistical Planning and Inference2009139235036510.1016/j.jspi.2008.03.039MR2474011ZBL1149.62047ZuoY.Multi-dimensional trimming based on projection depthThe Annals of Statistics200634522112251ZuoY.SerflingR.General notions of statistical depth functionThe Annals of Statistics200028246148210.1214/aos/1016218226MR1790005ZBL1106.62334ZuoY.Projection-based depth functions and associated mediansThe Annals of Statistics20033151460149010.1214/aos/1065705115MR2012822ZBL1046.62056ZuoY.YeX.ExPD2D: Exact Computation of Bivariate Projection Depth Based on Fortran Code. R package version 1.0.12009, http://CRAN.R-project.org/package=ExPD2DZuoY.Exact computation of the bivariate projection depth and Stahel-Donoho estimatoraccepted to Computational Statistics & Data AnalysisChenH.ChenZ.Asymptotic properties of the remedianJournal of Nonparametric Statistics200517215516510.1080/1048525042000267798MR2112518ZBL1055.62019ZuoY.Is t procedure : x¯±t1α/2(n1)s/n optimal?accepted to The American StatisticianHallP.Theoretical comparison of bootstrap confidence intervalsThe Annals of Statistics198816392798510.1214/aos/1176350933MR959185ZBL0663.62046FalkM.KaufmannE.Coverage probabilities of bootstrap-confidence intervals for quantilesThe Annals of Statistics199119148549510.1214/aos/1176347995MR1091864ZBL0725.62043PollardD.Convergence of Stochastic Processes1984New York, NY, USASpringerxiv+215Springer Series in StatisticsMR762984van der VaartA. W.WellnerJ. A.Weak Convergence and Empirical Processes with Applications to Statistics1996New York, NY, USASpringexvi+508Springer Series in StatisticsMR1385671TukeyJ. W.Mathematics and the picturing of data2Proceedings of the International Congress of MathematiciansAugust 1974Vancouver, Canada523531MR0426989ZBL0347.62002LiuR. Y.On a notion of data depth based on random simplicesThe Annals of Statistics199018140541410.1214/aos/1176347507MR1041400ZBL0701.62063BaiZ.-D.HeX.Asymptotic distributions of the maximal depth estimators for regression and multivariate locationThe Annals of Statistics19992751616163710.1214/aos/1017939144MR1742502ZBL1007.62009ZuoY.Robustness of weighted Lp-depth and Lp-medianAllgemeines Statistisches Archiv200488221523410.1007/s101820400169MR2074731BaiZ.SaranadasaH.Effect of high dimension: by an example of a two sample problemStatistica Sinica199662311329MR1399305ZBL0848.62030DempsterA. P.A high dimensional two sample significance testAnnals of Mathematical Statistics1958299951010MR0112207ZBL0226.62014