Power Loss of Stratified Log-Rank Test in Homogeneous Samples

We study the loss of power of the stratified log-rank test (SLRT) compared to the unstratified log-rank test (ULRT) in the case of a large number of strata with relatively a small number of stratum sizes in terms of the asymptotic distributions of test statistics under local alternatives. The SLRT tends to lose information due to overstratification. It is better to test the homogeneity among strata before using the stratified log-rank test.


Introduction
It is well known in survival analysis that the (unstratified) log-rank test (ULRT) is the most efficient invariant test under contiguous alternatives in the proportional hazards model [1,2]. Gill [3] gave a nice proof of this conclusion by using the Cauchy-Schwarz inequality.
In multicenter clinical trials with time-to-event as the primary outcome variable, we want to compare the treatment effects of two or more treatment methods. In the example in Section 6, patients were randomized in a 3 : 2 ration to two treatment groups. The primary outcome is the time to death from all causes during the study. Individuals from different centers are assumed to be independent. Even if the treatment effect can be assumed to be the same among centers, each center may have some factors which make the baseline hazard functions different from center to center. For this kind of data, the stratified log-rank test (SLRT) can be used to account for the baseline difference.
Some previous work has been developed to study the power loss of the log-rank test. Akazawa et al. [4] evaluated through simulation the loss of power of stratified log-rank test due to the the heterogeneity in clinical trials. Generally, the power of the stratified log-rank test decreases due to two reasons: (i) the stratum size may be too small and (ii) the individuals in the same stratum are heterogeneous. The simulation shows that the loss of power is substantial when the stratum size is very small and "the total number of failures and the treatment effect are fixed". From the stratified Cox regression model in survival analysis, we can see that if the stratum size is small, its contribution to the overall test is small with censored data in the stratum. This may decrease the power of the stratified log-rank test. Note the stratified log-rank test is the score test from stratified partial likelihood (see Section 3).
In this paper, we consider the case where there is a large number of strata, but each stratum has a relatively small sample size. We assume that patients are homogeneous within each treatment group. For this kind of data, we can construct both the stratified and unstratified log-rank tests. We derive a variance relation between the SLRT and ULRT and quantify the power loss due to unnecessary stratification by this relation. We illustrate our approach with data from a multi-center clinical trial (MADIT II) to test the treatment effect of an implantable defibrillator on survival of patients with reduced left ventricular function after myocardial infarction.
This paper is organized as follows. Data and notation are described in Section 2. We derive the SLRT and ULRT and their local asymptotic distribution from the Cox proportional hazards regression model in Sections 3 and 4. An association between the SLRT and ULRT is developed in Section 5. We apply the approach to the MADIT II study data in Section 6, and offer concluding remarks in Section 7.

Data Structure and Notations
Suppose there are n centers with n i patients randomized to treatment (Z = 1) or control group (Z = 0) in center i. Assume that n i , i = 1, . . . , n, are independent and identically distributed (iid) positive integer random variables with finite second moments.
The underlying survival times T i j , i = 1, . . . , j = 1, . . . , n i , are subject to random censoring with censoring time C i j . Here we assume that {C i j } are independent of {T i j } and {n i }. We further assume that {C i j } are iid. Due to the censoring, the observable data are Therefore, N i j (t) = 1 if and only if the event has happened before time t, and Y i j (t) = 1 if and only if the patient is still at risk immediately before t. Note that Suppose that the hazard functions of T i j are of the form For homogeneous samples, λ i0 (t) = λ 0 (t) for all i.

Stratified Log-Rank Test
The stratified log-rank test (SLRT) can be derived from the stratified Cox-proportional hazards model [5][6][7]. The contribution to the partial likelihood from stratum (center) i is The log partial likelihood is Let The first two order derivatives of the log partial likelihood are Under the null hypothesis H 0 : β = 0 (no treatment effect), , where dM i j (t) = dN i j (t) − Y i j (t)λ 0 (t)dt. Let International Journal of Quality, Statistics, and Reliability 3 The predictable variation of T 1 is Under the null hypothesis, T 1 → N(0, σ 2 1 ), where σ 2 1 can be consistently estimated by n −1 n i I i (0). The stratified logrank test is defined as The asymptotic distribution of SLRT under local alternatives which is derived in the appendix.

Unstratified Log-Rank Test
Similar to SLRT, the ULRT can be derived from the Cox proportional hazards model. The log partial likelihood function is Let The first-and the second-order derivatives of the log partial likelihood are Under the null hypothesis H 0 : β = 0, where (18) is the general form of log-rank used in literatures. Let The predictable variation of T 2 is Under the null hypothesis, T 2 → N(0, σ 2 2 ), where σ 2 2 can be consistently estimated by I(0)/n. The unstratified log-rank test is defined as With the same method as in the appendix, the asymptotic distribution of ULRT under local alternatives

A Relation between the Asymptotic Variances of T 1 and T 2
From martingale theory, the predictable covariation of T 1 and T 2 is This means that the asymptotic covariance of T 1 and T 2 satisfies Acov(T 1 , T 2 ) = Avar(T 1 ) = σ 2 1 .
From Cauchy-Schwarz inequality, This lemma is readily checked. From (22), (25), (A.4), and Lemma 1, the ULRT is always asymptotically more powerful than the SLRT in the homogeneous samples. The loss of power of SLRT is due to the loss of information from the unnecessary stratification.

A Real Example
We study the SLRT and ULRT in a multi-center clinical trial [8]. The Multicenter Automatic Defibrillator Implantation Trial II (MADIT-II) was designed to evaluate the potential survival benefit of a prophylactically implanted defibrillator in coronary patients with a prior myocardial infarction and advanced left ventricular (ejection fraction ≤ 0.3). The trial started in July 1997, and enrolled 1232 patients from 76 hospital centers (71 in US and 5 in Europe). The patients were randomized in a 3 : 2 ratio to receive either an implantable defibrillator or conventional medical therapy. We first test the homogeneity of strata by the log-rank test (P-value = .06). The estimated variances of T 1 and T 2 are σ 2 1 = 46.48 and σ 2 2 = 48.04, respectively. This shows that the ULRT is asymptotically slightly more powerful than its SLRT counterpart. For those 76 centers, the number of patients ranges from 1 to 55 with mean value 16.4. There were 21 centers without any event by the end of the study. They did not contribute to the stratified log-rank test.

Discussion
In this paper, we studied the loss of power of stratified logrank test in multi-center clinical trials with a large number of centers, but relatively small stratum size (assuming homogeneous strata). Our results show that asymptotic variance of SLRT is smaller than that of ULRT which makes the SLRT less powerful. Overstratification may incur loss of information compared to the unstratified log-rank test. However, there are some limitations in our study. First, we assumed that strata are homogeneous. In that case, the unstratified logrank test should be the best choice. In practice, it is important to test homogeneity of strata before using the stratified logrank test. Second, we considered the case with a large number of strata, but small stratum size. Another case of interest is a small number of strata with large stratum size. Although (23) is always true, the local alternatives cannot be specified in terms of n. We are currently investigating whether (25) is still true.