Weighted Lin-Wang Tests for Crossing Hazards

Lin and Wang have introduced a quadratic version of the logrank test, appropriate for situations in which the underlying survival distributions may cross. In this note, we generalize the Lin-Wang procedure to incorporate weights and investigate the performance of Lin and Wang's test and weighted versions in various scenarios. We find that weighting does increase statistical power in certain situations; however, none of the procedures was dominant under every scenario.


Introduction
Lin and Wang [1] have recently introduced an ingenious modification of the two-sample logrank statistic, appropriate for crossing hazards alternatives. Through a simulation study, they demonstrated that their modified test had greater power than the commonly used logrank and Wilcoxon tests for detecting differences between crossing survival curves. In this note, we propose weighted versions of the Lin-Wang (LW) test and investigate the performance of these weighted tests in a limited simulation study. Details are given in Section 2, and the simulation results are presented in Section 3. We give an example in Section 4 and conclude remarks in Section 5.

Methods
For consistency, we adhere to the notational conventions introduced by Lin and Wang [1]. We have survival data from two groups of subjects, the groups being labeled I and II, and are interested in comparing the survival distributions of the two groups. Events (failures or deaths) are observed at distinct time points 1 < ⋅ ⋅ ⋅ < across the pooled groups. At time , the number of observed failures in each of the two groups is denoted by 1 for Group I and 2 for Group II, and the numbers at risk just before time are denoted by 1 and 2 , respectively, for = 1, 2, . . . , . Consequently, at time , there are = 1 + 2 failures out of = 1 + 2 subjects. Subjects may be censored during or at the end of the period of observation. A representative 2×2 contingency table of group by status at observed failure time is given in Table 1.
We are interested in assessing the null hypothesis 0 : the survival distributions of the two groups are identical versus the global alternative hypothesis. 1 : the survival distributions of the two groups are not identical.

Lin and Wang introduced the quadratic statistic
for comparison of the two groups: they argued that Δ reflects the quadratic distance between the two underlying survival distributions hence should be sensitive to differences in either direction. They therefore based inference relating to 0 on the standardized version of Δ, which they denoted as * .
Let us define a weighted version of Δ as 2 Computational and Mathematical Methods in Medicine Total − with arbitrary weights , usually nonnegative. Our test statistic for assessing 0 is the standardized version of Δ ; namely, where (Δ ) and Var(Δ ) are calculated from the marginal hypergeometric distribution of the 1 . In particular, and Var(Δ ) is given by The raw moments of can be readily calculated from the following expression for the factorial moments: where ( ) = * ( − 1) * ⋅ ⋅ ⋅ * ( − + 1). For reference, Var ( 4 1 ) = 6 ( 3 1 ) − 11 ( 2 1 ) + 6 ( 1 ) + (4) . (11) We note in passing that there are typographical errors in the expressions for ( 3 ) and ( 4 ) in Lin and Wang [1]. Under the same assumptions as enumerated by Lin and Wang [1]; namely, the underlying failure times are independent, the censoring distributions (if any) for group I and group II are independent of each other, and of the respective survival distributions, the total number of observed failures and the distinct number of failure times are large, and the weights are positive and bounded; then approximately follows a standard normal distribution. We are thus specifying the usual random censorship model, with further conditions to ensure approximate normality of . For assessing the null hypothesis of equality of the underlying survival distributions of the two groups, Lin and Wang propose a two-sided test statistic based on * , and we will follow that convention with .

Simulation Studies
In this section, we will investigate the empirical performance of weighted versions of the LW statistic, compared to the original (unweighted) LW statistic.

Empirical Type I Error.
We first investigate achieved significance levels of the LW statistic and three weighted versions. Following LW, we generated two independent random samples from the exponential distribution with mean of 4. The censoring distribution is Uniform (0, 20) in each group. The number of iterations in each simulation study is 5000. The empirical Type I error is calculated as the proportion of 5000 repeated random samples in which we reject the null hypothesis at the alpha = 0.05 significance level, under the assumption that and weighted versions have normal distributions, and two-sided tests are utilized. We report on three weighted versions of the LW statistic, delineated by different sets of weights , 1 ≤ ≤ : . The empirical Type I errors are given in Table 2.
In this limited simulation study the empirical Type I errors are quite close to the theoretical 0.05 value, for both the LW statistic and the weighted variants. The normal distribution seems an adequate approximation for the sample sizes investigated.

Empirical Power.
Following LW, we undertook simulation studies comparing the empirical powers of the unweighted LW statistic with its weighted variants, under the three following scenarios. Scenario 1. This scenario entails crossing survival curves. The LW specification is as follows. "In Group I the survival times follow an exponential distribution with mean of 6. In Group II the survival times follow an exponential distribution with mean of 2. However, if the survival time in Group II is greater than or equal to 1.5, then the survival time is regenerated to follow an exponential distribution with mean of 40. The censoring distribution is Uniform (0, 20) in Group I and Uniform (0, 100) in Group II, which result in about 24% censoring rate in Group I and 18% in Group II, respectively. " Scenario 2. In this situation, the two survival curves are initially close, then cross, and diverge. The LW description is as follows. "In Group I the survival times follow an exponential distribution with mean of 4. In Group II the survival times follow an exponential distribution with mean of 3. However, if the survival time in Group II is greater than or equal to 4, then the survival time is regenerated to follow an exponential distribution with mean of 20. Also, censoring is assumed to occur randomly across the two groups. For each subject in the two groups, an independent Uniform (0, 1) random variable is generated. In Group I, if is less than 0.2, then the corresponding time point will be flagged as censored. Otherwise it is not censored. The censoring in Group II is created similarly but with a different rate. The censoring rate is 20% in Group I and 30% in Group II, respectively. "

Scenario 3.
Here, the proportional hazards assumption obtains. The LW specification is as follows. "The survival times follow an exponential distribution with means 2 and 5 in Groups I and II, respectively. The censoring mechanism is similar to that in Situation (Scenario 2), but this time with 20% censoring rate in Group I and 15% censoring rate in Group II, respectively. " The number of iterations in each simulation study is 5000. The estimated statistical power is calculated as the proportion of 5000 repeated random samples in which we reject the null hypothesis at the nominal alpha = 0.05 significance level, with two-sided test statistics. The weighted versions of the LW statistic are as above, namely, (i) = ; (ii) = √ ; (iii) = 1/SD( 1 ), where SD( 1 ) = √ Var( 1 ). Findings for the three scenarios are given in Tables 3, 4, and 5, respectively.
Interestingly, none of the procedures is dominant under every scenario. We might tend to favor the LW statistic under

An Example
We will apply the various procedures to data arising from a cancer chemotherapy experiment, as explained in Koziol [2] and Koziol and Yuh [3]. Briefly, sixty leukemic mice were randomly subdivided into two groups of equal size; one group (Group (a)) was treated with a new investigative chemotherapeutic agent, and the other group (Group (b)) served as controls. Survival times of the two cohorts are given in Table 6, and Kaplan-Meier survival curves for the groups are depicted in Figure 1.
Clearly, we are in crossing hazards setting, and the logrank test and the generalized Wilcoxon test are not necessarily sensitive to this type of alternative. Indeed, with these data, the logrank chi-square statistic (with 1 d.f.) is 1.36 ( = 0.24), and the generalized Wilcoxon chi-square statistic is 1.12 ( = 0.27); we would fail to reject the hypothesis of equality of survival distributions for the two cohorts with either of these tests.
On the other hand, the LW statistic and its weighted variants all point to significantly different survival experiences in the two cohorts, with values of 10 −6 or smaller. In comparison, the omnibus Kolmogorov-Smirnov, Kuiper, and Cramér-von Mises statistics introduced by Koziol and Yuh [3] were also indicative of significantly different survival distributions but with more modest values of 10 −3 .

Concluding Remarks
The logrank test as described in Section 2 should be ascribed to Mantel [4]: Mantel brilliantly intuited that the Mantel-Haenszel (MH) statistic [5] for assessing association across independent 2 × 2 tables could be applied to survival data, by  Table 1 at each event (death) time then combining the resulting 2 × 2 tables as in the MH procedure. Correspondingly, our incorporation of weights into the LW statistic as described in Section 2 is not new: our motivation devolves from similar introduction of weights into the Mantel formulation of the logrank statistic, by Tarone and Ware [6] and Leurgans [7] among others. And, anticipating the findings in Section 3, these investigators have shown that the weights can enjoy improved power properties over the unweighted MH statistics in various settings. We remark that calculation of the LW statistic is rather computationally intensive; but incorporation of weights should cause no additional computational difficulties. Optimal choice of weights remains an open issue, which we are currently pursuing.
The generalized Wilcoxon test and the logrank test are perhaps the best known and most commonly used procedures for the comparison of two survival distributions Computational and Mathematical Methods in Medicine 5 with observations subject to random censorship. Mantel [4] and others recognized, however, that these tests may not be appropriate whenever the alternative of interest is not that the one survival distribution is stochastically larger than the other but merely that the distributions are not equal. Crossing hazards are an example of nonstochastic ordering of survival distributions. For testing equality against such alternatives, Koziol [2] proposed a two-sample Cramér-von Mises type statistic based on the product-limit estimates of the individual survival distributions, and later Koziol and Yuh [3] introduced Kolmogorov-Smirnov and Kuiper as well as Cramér-von Mises statistics for the same omnibus twosample testing problem. The LW statistic is more closely attuned to the logrank test than these omnibus procedures; and, as seen in the example, the LW statistics may be more sensitive to crossing hazards alternatives.
It should be noted that Mantel [4] also proposed a modification of the Mantel logrank test, appropriate for crossing hazards: Mantel suggested that one construct a "chisquared" statistic at each event time as in Table 1, sum these individual statistics over the event times, and then treat the resulting sum as an approximate chi-square random variable with degrees of freedom, being the number of tables (distinct event times). We explored this statistic in simulation studies, but regrettably we cannot recommend this statistic, due to decreased power relative to the other statistics reported herein, and the tenuous assumption that a chi-square distribution for this statistic is adequate (though with larger sample sizes, a normal approximation might be invoked).