Weighted Wilcoxon-type Rank Test for Interval Censored Data

Interval censored (IC) failure time data are often observed in medical follow-up studies and clinical trials where subjects can only be followed periodically, and the failure time can only be known to lie in an interval. In this paper, we propose a weighted Wilcoxon-type rank test for the problem of comparing two IC samples. Under a very general sampling technique developed by Fay (1999), the mean and variance of the test statistics under the null hypothesis can be derived. Through simulation studies, we find that the performance of the proposed test is better than that of the two existing Wilcoxon-type rank tests proposed by Mantel (1967) and R. Peto and J. Peto (1972). The proposed test is illustrated by means of an example involving patients in AIDS cohort studies.


Introduction
Interval censored (IC) failure time data often arise from medical studies such as AIDS cohort studies and leukemic blood cancer follow-up studies.In these studies, patients were divided into two groups according to different treatments.For example, in leukemic cancer studies, one group of the patients was treated with radiotherapy alone, and the other group of patients was treated with initial radiotherapy along with adjuvant chemotherapy.The two groups of patients were examined every month, and the failure time of interest is the time until the appearance of leukemia retraction; the object is to test the difference of the failure times between the two treatments.Some of the patients missed some successive scheduled examinations and came back later with a changed clinical status, and they contributed IC observations.For our convenience, we assume that in such a medical study, the underlying survival function can be either discrete or continuous, and there are only finitely many scheduled examination times.IC data only provide partial information about the lifetime of the subject, and the data is one kind of incomplete data.To deal with such incomplete data, Turnbull [1] introduced a self-consistent algorithm to compute the maximum likelihood estimate of the survival function for arbitrarily censored and truncated data.For IC data, there have been some related studies in the literature as well.For example, Mantel [2] extends Gehan's [3,4] generalized Wilcoxon [5] test to interval censored data, and R. Peto and J. Peto [6] also develop a different version.Sun [7] applied Turnbull's algorithm to estimate the number of failures and risks of IC data and then propose a log-rank type test.
For the purpose of comparing the power of the test statistics, Fay [8] proposed a model for generating interval censored observation.A similar selection scheme can also be seen in the Urn model of Lee [16] and mixed cased model of Schick and Yu [17].In this paper, we propose a Wilcoxontype weighted rank test to compare with the existing two Wilcoxon-type rank tests proposed by Mantel [2] and R. Peto and J. Peto [6].We restrict ourselves to the Wilcoxon-type rank tests because these tests are simple to use and have the robustness property that their powers are fairly stable under different lifetime distributions.
This paper is organized as follows.In Section 2, we review the Turnbull's [1] algorithm and introduce Fay's [8] selection model for generating interval censored data.This selection model can be extended to a more general one, and the consistency property can be found in Schick and Yu [17].In Section 3, we introduce Mantel's [2] and R. Peto and J. Peto's [6] generalized Wilcoxon-type rank tests and propose our weighted rank test.In Section 4, a simulation study is conducted to compare the performance of the three tests under different configurations.Finally, an application to AIDS cohort study is presented in Section 5.

Data Treatment
Assume that  is the lifetime random variable of a survival study, measured in discrete units and taking values Turnbull [1] proposed an algorithm to estimate the unknown probabilities  = ( 1 ,  2 , . . .,   ).The algorithm can be described by the following four steps.
Step 4. Stop when the required accuracy has been achieved.
The algorithm is simple and converges fairly rapidly.The estimate p = (p 1 , p2 , . . ., p ) yielded from the iteration is in fact the unique maximum likelihood estimate of  = ( 1 ,  2 , . . .,   ) and is a self-consistent estimate.

Return Probability Model.
To comply with the periodical clinical inspection, Fay [8] proposed a simulation model for generating IC data.He assumed that the probability for a patient to return to the clinic for inspection at time points  1 ,  2 , . . .,  −1 are i.i.d.Bernulli random variables  1 ,  2 , . . .,  −1 ; that is, (  = 1) = , (  = 0) = 1 − , 0 <  < 1,  = 1, 2, . . .,  − 1.   = 1 means that the patient returned to the clinic at the inspection time   , and   = 0 means that the patient missed the inspection.In our model, we always assume that   = 1.The failure time  is independent of ( 1 ,  2 , . . .,  −1 ), and the observable random interval is (2) 2.2.1.Model Consistency.Under Fay's [8] selection model, the consistency property has been proved.This selection model can be generalized to the case that the return probability at each examination time point may be different; say that (  = 1) =   ,  = 1, 2, . . ., .To demonstrate the generalized return model, we set  = 3 and  1 = 1,  2 = 2, and  3 = 3.The selection probabilities for all admissible intervals are shown in Tables 1 and 2. It is not difficult to see that the selection probability of the interval  = (  ,   ] is where () = ( +1 +  +2 + ⋅ ⋅ ⋅ +   ),  0 = 0, and  0 = 1.For instance, the interval (0, 2] may be selected under two possibilities.First, the true value of  is  = 1, and the patient who missed the inspection at  1 = 1 then goes to inspection at  2 = 2; in this case, the interval is selected with probability  1 (1 −  1 ) 2 .Second, the true value of  is  = 2, and the patient missed the inspection at  1 = 1 then goes to inspection at  2 = 2; in this case, the interval is selected with probability  2 (1− 1 ) 2 , and therefore {(0, 2]} = ( The generalized return probability model can be viewed as a special case of the mixed case model in Schick and Yu [17]; under very mild conditions, the estimate of  = ( 1 ,  2 , . . .,   ) computed by Turnbull's algorithm is still consistent.

Wilcoxon-Type Rank Tests for Interval Censored Data
Two-sample Wilcoxon rank test is a well-known method to test whether two samples of exact data come from the same population.The method is constructed by ranking the pooled samples and giving an appropriate rank to each observation.However, this ranking technique is in general not admissible for intervals.In this section, we will discuss how to generalize the ranking technique and then propose a Wilcoxon-type rank test for IC data to compare with two existing rank tests proposed by Mantel [2] and R. Peto and J. Peto [6].Suppose that two samples of IC data for  and  are, respectively, (   ,    ],  = 1, 2, . . .,  1 and (   ,    ],  = 1, 2, . . .,  2 .To test whether these two samples come from the same population is equivalent to testing the equality of survival functions   () and   (), for all  ≥ 0; that is,  0 :   () =   () , ∀ ≥ 0. (5) 3.1.Mantel's Test.Mantel [2] extended Gehan's [3,4] generalized Wilcoxon test to interval censored data by defining the score of the th observation as the number of observations that are definitely greater than the th observation minus the number of observations that are definitely less than the th observation.He proposed the test statistic Under  0 , the test statistic is approximately normal distributed with mean 0 and variance 3.2.R. Peto and J. Peto's Test.Different from the Mantel's generalized version, R. Peto and J. Peto [6] defined the score of the th observation as where Ŝ is the estimated survival function, () =  2 − ; hence,   = Ŝ(   )+ Ŝ(   )−1.They proposed the test statistic , where Under  0 , the test statistic  2 is approximately distributed as  2 1 .

Our
Let  1 ,  2 be, respectively, the average weighted rank of the  and  samples, so that To test whether two IC samples come from the same population, we propose the test statistic Under  0 , the central limit theorem implies that W.R.T is approximately distributed as a standard normal random variable.However, the mean and variance of  1 and  2 may depend on the probability space where they are defined; it means, different selection probability for IC intervals in (4) leads to different mean and variance of  1 and  2 .We therefore only consider the selection model of Fay defined in Section 2.2.In this model, the selection probability of an IC interval is in one of the following categories: Consider the probability space (, 2  , ), where the probability measure  is defined in Section 2. To compute the variance of  1 and  2 , we define a random variable  on this space by assigning value {(  ,   ]} to the interval (  ,   ] in , where The value {(  ,   ]} can be viewed as the weighted rank of (  ,   ].If   ,  = 1, 2, . . .,  are chosen as in the Wilcoxon test for exact data, then our proposed test statistic W.R.T is a Wilcoxon-type weighted rank test.Under this probability space, the expectation () can be simplified as in the following theorem.
Theorem 1. Suppose that  is the random variable defined on the probability space (, 2  , ) according to (17).Then, the expectation of , (), can be simplified as which is independent of the choice of .
Proof.It is obvious that () can be written as () = ∑  =1       , where the coefficients   ,  = 1, 2, . . .,  are to be determined.The theorem is, hence, proved if we can show that all the coefficients   are ones.
By (15), this category contributes By (16), this category contributes Consequently, the coefficient of   is Finally, the proof for the case  =  is The variance of , Var(), is where (  ) and (  ) are the selected probability and the weighted rank of the th admissible interval of   , respectively,   ∈ .

Simulation Study
In this section, we carry out simulation studies to compare the performance of W.R.T test with Mantel's [2] and Peto's [6] tests.In the study, we assume that the failure time random variable is distributed as exponential, total sample sizes are  = 100 and 200, and each sample has (/2) subjects.The interval censored data are generated by the following four steps.
Step 1. Generate a failure time   from some distribution.
In the case of  = 6, 6 return points, we set the hazards 1/3 for population 1 and 1/3  for population 2. Figure 2 shows the density plot of exponential distribution with  = −0.4,−0.2, 0, 0.2, 0.4.In the case of  = 10, 10 return points, we set the hazards 1/4 for population 1 and 1/4  for population 2. Figure 3 shows the density plot of exponential distribution with  = −0.6,−0.3, 0, 0.3, 0.6.Tables 5 and 6 present the powers of the three tests with sample size  = 100 and 200.Simulation result shows that when the failure times come from the exponential distribution, our proposed test W.R.T is the most powerful.

An Application to AIDS Cohort Study
Consider the data of 262 hemophilia patients in De Gruttola and Lagakos [18], among them, 105 patients received at least 1,000 g/kg of blood factor for at least one year between 1982 and 1985, and the other 157 patients received less than 1,000 g/kg in each year.In this medical study, patients were treated between 1978 and 1988, the observations (  ,   ] for the 262 patients, based on a discretization of the time axis into 6-month intervals.The failure time of interest is the time of HIV seroconversion.The object is to test the difference of the failure times between the two treatments.Applying our proposed test, namely, W.R.T, Mantel's [2] and Peto's [6] tests to this data set, the values of the three test statistics are −7.815,−7.352, and 56.476, respectively.All the three  values are less than 0.001 and have the same conclusion that the HIV seroconversion appeared in the two groups of patients being significantly different.

Figure 1 :
Figure 1: CDF of standard normal and simulation result of W.R.T. Line: standard normal.Point: simulation result of W.R.T ( = 0.5).

Table 1 :
The probability of selected interval.
≤   , and () = ∑   >   .Note that the observed failure time data in a clinical trial can be discretized if the underlying variable is continuous. is to denote that the failure time of the th subject occurs after the last examination time  −1 .

Table 3 :
The mean, sample variance, and sample deviation of q.

Table 5 :
Power comparison of tests under exponential distribution with sample  = 100.

Table 6 :
Power comparison of tests under exponential distribution with sample  = 200.