New Test for the Comparison of Survival Curves to Detect Late Differences

. Background . Survival analysis attracted the attention of diferent scientists from various domains such as engineering, health, and social sciences. It has been widely exploited in clinical trials when comparing diferent treatments looking at their survival probabilities. Kaplan–Meier curves plotted from the Kaplan–Meier estimates of survival probabilities are used to depict the general image for such situations. Methods . Te weighted log-rank test has been dealt with by suggesting diferent weight functions which give specifc strength in specifc situations. In this work, we proposed a new weight function comprising all numbers at risk, i.e., the overall number at risk and the separate numbers at risk in the groups under study, to detect late diferences between survival curves. Results . Te new test has been found to be a good alternative after the FH (0, 1) test in detecting late diferences, and it outperformed all tests in case of small samples and heavy censoring rates according to the simulation studies. Te new test kept the same strength when applied to real data where it showed itself to be among the powerful ones or even outperforms all other tests under consideration. Conclusion . As the new test stays stronger in the case of small samples and heavy censoring rates, it may be a better choice whenever targeting the detection of late diferences between the survival curves.


Introduction
Survival analysis has so many applications in the real world such as engineering like testing the lifetime of life bulbs, medicine like testing the efciency of diferent treatments, and it fnds even its role in social sciences. In medical research studies, the comparison of two medical treatments is of crucial importance because they help to decide on which treatment works better than another. Tis is where the comparison of survival curves has its role.
Te comparison of survival curves is done when two or more samples are submitted to diferent treatments or drugs. When comparing drugs, they test them on parallel groups and they decide which one is more efcient. Efciency may be referred to as the time it takes to cause positive efect if any and at which percentage. For the comparison of survival curves, we consider and record the survival probabilities at each instant of interest for the groups or samples under consideration and we draw the Kaplan-Meier curves and compare them using diferent techniques. Diferent scenarios are explored and some tests are more powerful in specifc scenarios accordingly. Such scenarios are proportional hazards, early diferences, and late diferences. Some also include middle diferences even though they do not attract the attention of many and this may probably be due to the fact that it rarely happens. Te test that is explored in this research is appropriate while investigating the late diferences between curves.

Weighted Log-Rank Test.
Te weighted log-rank test is sometimes used in testing the equality of survival distributions. Taking the case of two groups or two treatments, the type of hypotheses that are being tested is of the following form: H 0 : S 1 (t) � S 2 (t) for all t, against H 1 : S 1 (t) ≠ S 2 (t) for some t, where S i (t) is the survival in group i at time t.
In the case of nonproportional hazard rates, the comparison of survival curves is preferably done using diferent weighted log-rank tests. Te weight function is of crucial role, and its misspecifcation leads to inaccurate results and will cause the loss of power of the test.
Te weighted log-rank statistic is written in a stochastic integral form by the following quantity: where τ is the total time of the study, w (t) is the weight function at time t, R i (t) is the number of items/individuals at risk at time t in the i th group, R(t) is the overall number of items/individuals at risk at time t, and N i (t) is the number of items/individuals which underwent the event of interest by time t in the i th group [1]- [2]. Te variance of this weighted log-rank statistic is estimated by the quantity: where Computationally, the weighted log-rank statistic is written as follows: where w j is the weight at time t j , r j is the overall number of items/individuals at risk at time t j , r ij is the number of items/individuals at risk at time t j in the i th group, d ij is the number of events of interest at time t j in the i th group, and d j is the overall number of events of interest at time t j . Te statistic U is such that its expected value is E [U] � 0 and Var(U) � k j�1 w 2 j (r 1j r 2j d j (r j − d j ))/(r 2 j (r j − 1)), and hence, the statistic to be computed becomes where r j is the overall number at risk in both groups at time t j and r ij is the number at risk in the i th group at time t j . We recall that the statistic mentioned above is asymptotically chi-square distributed (χ 2 wl ∼ χ 2 (1)) and can be reduced to a normal distributed statistic as follows: Te weighted log-rank test statistic contains all three quantities, while the weights considered by diferent researchers were based on r j transformed diferently or the overall survival probability [3]. Even the survival probabilities considered were the overall ones for the overall sample.
One of the famous weight functions is displayed in Table 1.
Various modifcations and improvements have been made to get more powerful weight functions. For example, Garès et al. [4] used the G ρ,c family of tests which was proposed by Fleming and Harrington [5] to investigate the late efects in controlled trials. Tere exists another test statistic found from a given number of FH statistics tests and it is called Max-combo test statistic [6]- [7]. Tis test is calculated as the maximum (linear) combination of a selected set of FH tests (G 1,1 ), (G 1,0 ), (G 0,1 ), and (G 0,0 ). Tis technique was introduced because nearly each test statistic has high power in a specifc situation, and it would be more helpful to know the situation before.
However, it is not easy to know if in the situation under study, there are early or late efects. FH (0, 1) is more powerful in the case of late efects or late separation of survival curves, while FH (1, 0) becomes more powerful in the case of early efects or early separation of the survival curves. Te lack of prior knowledge about the (location of ) efects is the cause of using the combination of two or more tests in order to capture every feature [7].
According to the work done by Rückbeil et al. [6], they dealt with the Max-Combo test statistic from three standardized FH tests which are (G 1,0 ), (G 0,1 ), and (G 0,0 ) under fve diferent randomization procedures. Tey compared the separate FH tests and Max-Combo test, and it was found that the Max-Combo test in each case was the second in power where the highest power of Max-Combo of 83% was observed when they were assessing late treatment efects.
Te study Lee [8] has dealt with the standardization of the weighted log-rank test statistics and the Max-combo test statistics Lin et al. [9]. Tis is the statistic divided by the square root of its variance estimate. Tree cases were considered for multiple standardized weighted log-rank test statistics. Considering the corresponding Z statistics Z 1 and Z 2 from (G 1,0 ) and (G 0,1 ), respectively, as studied by [8]; the three cases are as follows: (i) Te average of the absolute values. Tis is, Lee [10] evaluated the maximum and average of (G 0,0 ), (G 0,2 ), (G 2,0 ), and (G 2,2 ). Karrison [11] considered Max (|Z 1 |, |Z 2 |, � � � �Z 3 |), where the Z statistics Z 1 , Z 2 , and Z 3 were from (G 0,0 ), (G 0,1 ), and (G 1,0 ). Tis combination covers a good range of possibilities including early diferences or late ones and proportional hazards features. Abou-Shaara [12] studied the similarities between the Kaplan-Meier and ANOVA in his work, and he fnally found that the two methods lead to the same conclusion.
Tere can be a need of estimating the confdence interval of the estimated probability [13], and it is found as follows: Journal of Probability and Statistics where Var[S(t)] is computed according to Greenwood's formula as follows: Klein et al. [14] proposed a test called a naive test of the null hypothesis for some fxed time points. Such test might be obtained from cumulative hazards H i (t) or survival probabilities S 1 (t).
Qian and Zhou [15] proposed a family of hazard rate functions of hyperbolic-cosine-shaped (CH) type and the deduced CH class weight functions generated good statistic tests for the late diferences detection.

New Weight Function.
Te existing weight functions are built-in functions of r j and, hence, vary in function of the total remaining number of individuals at risk in general. Te use of r j transformed in diferent ways shows that only the size of the total number of individuals at risk in general is taken into account. However, the separate numbers r 1j and r 2j of individuals at risk in each of the groups would be involved and may probably help to capture more features. Te involvement of r 1j and r 2j separately in the weight will help to detect the diference in the occurrence of the event interest in the two groups at each time point depending on the relation between the two numbers. Tere is, therefore, a need of a new weight function comprising simultaneously r j , r 1j , and r 2j which will change in function of the three variables and hence probably take into account the variations between r 1j and r 2j . Tis new weight is thought of being more adaptive since it captures, to some extent, the diference in variations between r 1j and r 2j by itself and it will be relatively small (big) for small (big) diferences in the two quantities. In other words, if the occurrences are likely equal in both the groups, the weight will be relatively less heavy than when the occurrences will be higher in one group than another. While r j was considering the overall change (and hence general occurrences), separate changes in numbers of individuals at risk in the respective groups are needed for the search of more accuracy and precision of the test.
Te new weight function that has been proposed in this study is of the following form: and according to its form, this weight function is monotone increasing. For diferent couples (r 1j , r 2j ) whose sum is r j , the new weight will be relatively higher as the diference between r 1j and r 2j increases compared to when the two numbers are nearly equal. Te stochastic form of the frst statistic will be reduced to with its corresponding variance which is as follows: From the direct observation, it can be seen that this statistic depends on the variations in numbers of events in the respective groups, which may lead to the probable predicted sensitivity.
Substituting the new weight function in the general weighted log-rank statistic, we obtain the new statistic which is as follows: or simply

Power and Relative Efciency of a Test.
Te power of the test statistic is by default expressed as follows: 1 − β, where β is the probability of type two error. With the statistic of the weighted log-rank test, we have quantities which help to get the power. Assuming the quantity U � k j�1 w j (d ij − d j (r ij /r j )) found in the numerator, we have the corre- ) on the denominator, and they are such that [16]. Te power of the test statistic is then computed as follows: Since the p value is also one among the methods of testing the hypothesis, it is good to recall how it is found from the two statistics. With U and V, the one-sidep value is calculated as follows: ) [17]. Having two weighted log-rank statistics T w and T w , the ARE of T w relative to T l as proposed by Jiménez et al. [18] is given by

Tests
Weight functions Log-rank 1 Gehan-Wilcoxon where Φ −1 is the quantile function of the standard normal distribution and α � 0.05. Computationally, the power of the Z statistic obtained from the log-rank test is found as follows: where M is the number of simulations which were performed (example: 10,000, 5,000, 1,000, . . .), and in our computations, we used M � 5000.

Simulation Study Scenario.
Te ideal illustration of late separation is depicted in Figure 1. To carry out the simulation study, we used the simsurv R package which helped to simulate survival times from standard parametric distributions. In our case, we used the Weibull distribution to simulate the survival times. For one group, we generated the survival times using the Weib (1.2, 3.6), while for the second group, the survival times were generated from Weib (2.9, 5.4) (60% of the survival times for this group) and the remaining (40%) were generated from Weib (1.5, 3.6).
For any case, we performed 5,000 simulations, and the analysis was done by R. We considered the cases of equal sample sizes in all our simulations. Te notation (n1, n2) (c) has been used, where n1 � n2 represents the sample size under consideration and n1 � n2 is the number of individuals in each group and c is the overall censoring rate. Te censoring rates taken into account are 20%, 40%, and 60% and c � 0 means that there has been no censoring. Tere are therefore four simulation cases for each sample size. Te used sample sizes per group are 20, 50, 80, and 100.

Simulation Results.
To make it more visible and separate, we look at the following plot in Figure 2 which shows graphically the variations in power as obtained in Table 2. To read the plots well, NoCens100 stands for the case of no censoring in the case of a sample size of 100 individuals per group. It is the same for 80, 50, and 20. Cens10020 stands for the case of 100 individuals per group with the overall censoring rate of 20% and the same analogy applies to others.
Te new test may be recommended as an alternative of test while aiming at the detection of late diferences between treatments. It imposes itself as a good choice when the sample size becomes smaller. In other words, the new test outperforms the existing ones for small sample sizes (n ≤ 50) . To see this more clearly, we used the relative efciencies of all tests (in power) compared to the standard log-rank test. We will mainly look at FH (1, 1) and FH (0, 1), and the new test looks to be relatively more efcient. In regard to the efciency of the tests, we evaluate them relatively to the standard log-rank test. Tis last is known to perform better in the case of proportional hazards but still keeps some level of power in other scenarios. Even in our case of late differences detection, it was the third choice after being outperformed by our newly proposed test. Table 3 shows the heatmap of the relative efciencies of other tests at all levels of censoring under consideration with respect to the LR test.
As seen on Figure 3, the graph at the left side is a random simulation for sample size n � 100, while the right one is for n � 20, and the censoring rate is 20% in both cases. As it can be seen, the FH (0, 1) weight in dashed red increases gradually and this justifes its high power for late diferences. Te separation of curves usually happens gradually, and hence, as the diference becomes higher, the FH (0, 1) weight becomes higher too.
For the new weight in solid blue, there is only a brutal increase in a very small number of time points at the end, while it is relatively very small since the beginning of the study. Tis behavior can help us to justify its efciency for the case of small sample size because in such case, the late separation does not take longer, and hence, the new weight will not lose many event times of the separation. Apart from this, the new weight could be powerful in case of brutal separation in the very last few event times, and this may not happen often practically. However, again, in the few cases of strength, the new weight can reach to numbers above 1 as seen on the graph at the right where it even reached 2 at one last point. It is clear that the relative weakness of the new weight for large sample sizes resides in that fact of failing to capture some event times at the beginning of the separation which may normally start around the middle time of the study and remain sensitive to a very limited number of last event times as both graphs show. In contrast, FH (0, 1) captures gradually all separation since their occurrence as shown by its gradual shape or gradual increase. We recall that where the new weight drops to 0 is when the number at risk in one of the groups becomes 0 because there is no comparison at such points and onward. To make it well understood, assume the separation happened at the time point 30 (graph at the left). We can see how much is the diference between the two weights since then and hence the loss of power for the new weight. For the graph at the right, if the separation started from time point 25, for example, we notice that the difference between the two weights is not that high as at the left side case. But again, we may highlight that the new weight is very strict on the very late few event times with exceptionally higher weights. Te lower loss of power for the new weight in the case of censoring resides in the fact that this last reduces the number of event times, and because the new weight needs just the very last few event times, it does not lose too much power as the FH (0, 1) which might have benefted from many event times since the beginning of the separation. Tis is why small sample size cases and heavy censoring cases are the favoring ones for the new weight which needs just fewer last event times than FH (0, 1). Tis is not strange because every weight function has some circumstances when it excels in power but fails in others. Our newly suggested weight is then powerful in heavy censoring and/or small sample size cases. Cens10060 NoCens100 Powers for 20%, 40% and 60% censoring rates for n=100    1) is more powerful than LR when the censoring rate is higher than 50% since the relative efciency has been more than 150% only in the case where the censoring was 60%. For the cases of no censoring and for those of censoring of 20%, the test has not been more efcient than the standard LR test irrespective of the sample size under consideration. Te FH (0, 1) test, which is usually known to be the most powerful for late diferences, still keeps its power, but it becomes outperformed by the newly proposed test for small sample sizes, that is, for n ≤ 50. We can take two extreme points for the two tests. For n � 100 with no censoring, the RE of FH (0, 1) was 175% while it was 150% for the new test. Tis implies that the diference in relative efciency is 25% (or we can say that FH (0, 1) is 25% more relatively powerful than the new test when both are compared to the LR for n � 100.) For n � 20 with the censoring rate of 60%, the RE of FH (0, 1) is 356%, while it is 428% for the new test, and this implies that the new test is 72% relatively more powerful than FH (0, 1) when both tests are compared to the LR test.
So, we can see that the new test will make a higher diference in relative efciency where it is relatively powerful than what FH (0, 1) does in its favorable conditions. Noting the importance of sample size, the new test may be a good recommendation due to its behavior in case of small samples and heavy censoring.
To get a more general recommendation between the two tests, we can do an unweighted sum of diferences of relative efciencies in all cases under study and see the result. Tat is, we take the relative efciencies for FH (0, 1) minus those of the new test (RE (FH (0, 1))-RE (new test)) in each case and    Journal of Probability and Statistics we sum up to see which one is generally relatively more efcient. Operating on the data in the heatmap, we obtain −220% in total, which shows that the new test is relatively more efcient than FH (0, 1) in general. Tis is immediately linked to the fact that where the new test is relatively more efcient, it makes bigger diferences.

Application to Real Data.
To check the reliability of the new test, we preferred using two real datasets to be sure of the comparison. Tose datasets are as follows: (i) Head-and-Neck-Cancer Study by the Northern Oncology Group (NCOG) (ii) Time to infection of Kidney Dialysis Patients Data Te data from the Head-and-Neck-Cancer Study which was done by Northern Oncology Group (NCOG) are found in Efron [19] and have been reused by many other authors including Qian and Zhou [15] recently. Arm A represents patients who underwent radiation therapy and those who underwent radiation plus chemotherapy were put in Arm B.
For the second dataset of time to infection of kidney dialysis patients, it is a (built-in) dataset found in R under KMsurv package. Te group was formed referring to the methods for placing catheters in kidney dialysis patients. Surgically placed catheter made group 1 and percutaneously placed catheter made group 2. Te plot of Kaplan-Meier curves for both datasets is shown.
From Figure 4, we notice that for NCOG data, the curves are closer to each other at the beginning but separate later where Arm B appears to have higher survival probabilities than Arm A. Te two-sidedp values for the nine tests have been computed and are given in table.
As seen from the p values in Table 4, the newly proposed test showed itself as stronger than any others as it has the smallest p value of 0.0129, followed by the Fleming-Harrington (FH (0, 1)) with p value � 0.0223 and lastly by the standard log-rank test with p value � 0.047. Tis is in accordance with the simulation results even though the new test seems to outperform the existing stronger test for late diferences, FH (0, 1).
However, this is not strange because even the diference in the powers observed in the simulation was not that high enough that one may not hesitate to recommend this new test as a good choice. Te other tests got p values greater than 0.05 because they are usually known to be weak in the detection of late diferences, and this is no surprising based on the shape of the two curves. Teir failure or weakness to detect such diference might be from their nature. However, since the diference seems to be signifcant by an immediate look at the graph, if onesidedp values are under consideration, the majority of all these tests could have their p values to be less than 0.05, and hence, the diference might be detected. In such a case, only GW and FH (1, 0) might be the only ones to fail detecting such diference. Te general observation which will remain intact is that the new test performed better than any other test in this case.
As it can be immediately observed from the KM curves for kidney data, the two survival curves crossed each other at the early stages where they were even close to each other. After crossing each other, they separated quickly, and this will lead us to the justifcation of the p value obtained for FH (1, 1) in Table 5. It has been obtained that in addition to the two tests which were expected to detect such diferences, we got another one (FH (1, 1)) which is stronger in the detection  Figure 4: Graphs for real data application cases. 8 Journal of Probability and Statistics of middle diferences. In other words, because it gives heavier weights to middle events and reduces as they go farther from the median time, it detected those diferences in this case because in the middle of the study period, the curves had already been separated as it can be immediately seen on the graph. It is to be highlighted that this test has been surprising since it was at the point of outperforming both expected tests with the p value of 0.005. However, FH (0, 1) remained the frst among the three tests with the p value of 0.0046 and the new test was the third with p value of 0.021. Contrary to the frst NCOG data, even if we had taken one-sidedp values, no change might have been observed on the tests with signifcant p values.

Conclusion
Te newly proposed test is a good alternative for the detection of late diferences between survival curves. It shares the same positive behavior with FH (0, 1) of being relatively more efcient and powerful than the LR, and even though the reduction of power as the censoring rate increase is common, this reduction is relatively small for the new test compared to the remaining others (including the LR test and FH (0, 1)). Te new test may, therefore, be the frst choice in cases of small sample sizes and heavy censoring rates. Te same strength has been observed while dealing with real datasets when the new test remains still sensitive for late diferences in survival. Based on the fact that the small size of the sample and censoring are the major threats in survival analysis studies, referring to the power and higher relative efciency of the new test in such cases, one may consider it as a better choice for late diferences detection between survival curves.

Data Availability
Te data used to support the fndings of this study are publicly and freely available. One dataset is accessed through R software, and another is in the cited research.

Conflicts of Interest
Te authors declare that they have no conficts of interest.