Assessing outcome in randomized clinical trials : Inflammatory bowel disease

EJ IRVINE. Assessing outcome in randomized clinical trials: Inflammatory bowel disease. Can J Gastroenterol l 993;7(7):561-567. Methodological principles or standards to assess the quality of a clinical trial must be simple and user-friendly if we expect busy clinicians to adopt them. When assessing the success or failure of a new treatmenr m inflammatory bowel disease, the patient charactenstics, therapeutic strategy, outcome assessment and interpretatton of results arc unique to the disease and the study. Because inflammatory bowel disease rarely is fatal, we use surrogate markers, such as disease activity index, endoscopic appearance, histology, blood tests, tissue markers, need for other medications anJ quality of life scores, to assess treatment efficacy. Primary and secondary outcomes must be identified and uniformly evaluated to ensure unbiased objective assessment. Careful a priori definition of outcome events is essential. The statistical analysis depends on the data type which comprise the outcome events and the study design. Both intention to treat as well as efficacy analyses should be performed. Interpretation of results should address both clinical and statistical importance.


GOALS OF T HERAPY
As responsible clinicians with a general objective ro improve our puricnts' heulth state, we read relevant literature to optimize therapy for differen t groups of patients and to individualize therapy for our IBD patients.
The questions we pose which arc germane to patient~ with CD or UC arc the following: How can we prevent a disease exacerbation?How can we induce a remission?I low do we prevent or minimize the complications of disease, such as cancer in UC or obstruc tion in patients with CD? Can we avoid treatment tox1c1ty, liuch as moon facies, I RVINE mement pour assurer une evaluation objective et impartialc.La definition attenttve a prion des incidents a survenir est essentielle.L'analyse staciscique depend du type de donnees, qui englobe la survcnue d'incidencs et le modele de l'erude.L'imencion de tra.iter et les analyses d'efficacite doivent etre identifiees.L'interpretation des resultats doit porter sur la portee clinique et statistique.acne, hirsutism, weight gain or other unpleasant effects of corticosteroids?How can we improve this patient's quality of life?What is the cost to chis patient, to the hospital, to the pharmaceutical or insurance company or to society as a whole for a particular treatment?What are the coses and benefi ts of one treatment strategy over another?

DEFINING AN OUTCOME EVENT
To answer these questions, we must more closely examine outcome measurement.The modalities to assess success or failure of a new therapy in IBO are unique to the disease, while the statistical methods, interpretation and distribution of the results to the medical community, are not.
A generalist might consider a treatment 'success' or 'failure' as the sum of events occurring after a 'treatment intervention'.A purist would demand more careful dissection of phenomena into those that were directly attributable to the treatment and those which were likely coinciden tal.As well, quantification of the efficacy or lack of efficacy of the new treatment wou Id enhance the meaning of the results.A rudimentary approach to measuring outcome has been to examine the number of fatal versus nonfatal events (morcaliry) after an intervention.However, as IBO rarely is fa tal.staggering numbers of patients would be required for such a clinical trial, and the potential benefit (or harm) of the drug would be dissipated if it failed to reduce or worsen mortality.

SURROGATE OUTCOME MEASUREMENT
O utcome assessment in mo, as for other disease entities, has been replaced by 'surrogate markers', ie, symptoms, signs or labo ratory test results which are important measuremen ts of health status for that particular disease.
For mo, these could include stool frequency, presence of blood in stools, fever, extraintestinal complications, abdominal tenderness, mass or blood parameters, such as erythrocyte sedimentation rate or orosomucoid.More recently, functional change, such as the ability to attend work or school, or changes in quality of life index are becoming more widely accepted.
A fundamental trait of an outcome measurement is that it reflect the proficiency of the intervention, ie, the new drug or surgical technique.T o decide the best outcome, it is helpful to know something of the disease biology and the expected mechanism of the drug or intervention.A lthough it was early appreciated that sulfasalazine was an 'anti-inflammatory drug' with potent benefits in the treatment of active UC , it was only later discovered that the active moiety was aminosalicylic acid (5-ASA) (1 ), and later still that some of its effect was due to inhibition of prostaglandin and leukotriene production (2,3).Thus, early outcomes or measurements of health status changes primarily included clinical disease activity index; later, when the mechanisms of drug action were the focus of research questions, blood and tissue inflammatory mediators were more appropriate.
O ther features of outcome measurement are variably present.Less commonly, an outcome measurement may represent a single state which was not present at the outset of the trial.An example of this would be a clinical trial resting a new drug to prevent disease exacerbation in UC.Although exacerbation would not have been present in patients at entry, careful defini tion of remission and 'exacerbat ion' at entry would be essential to permit recognition of an 'outcome event' and subsequent assessment of the drug efficacy.
In many studies, a comparison is made between the health status at the beginning of the trial (or baseline) and at the end, in which we look for 'improvement' or 'wo rsening'.This often is referred to as a 'transition event'.Most clin ical trials are of a pre-specified duration with repeated measurements over time, sometimes called 'time series' or 'time dependent' measurements.
Other general descriptors for surrogate outcomes reflect our common sense aspirations to prove the intervention beneficial, such as a desirable outcome (benefit) in which the patient improves, or an undesirable outcome (risk) in which the pati ent worsens or suffers a side effect.These effects may be anticipated further if we have prior knowledge of the drug mechanism or toxicity, or be unanticipated if they are unexpected or idiosyncratic.

SELECTING AN OUTCOME
EVEN T Selecting the tools to measure study outcome (Table 1) is crucial to the scientific rigour and results of a clinical trial, and must be decided before a study begins.The criteria must be objective and interpretable by all participating investigators (or readers) in the same fash ion.Thus, clearly defined criteria are necessary.There must be adequate patient follow-up and uniform application of outcome assessment.Subjects dropping out of trials prematurely often do so because a drug is ineffective or has an unpleasant side effect.The importance of blinding of outcome assessment warrants reiteration to assure the objectivity upon which a clinical trial result may hang perilously.Finally, this assessment often is undertaken by an adjudication process or a committee to ensure fairness and standardized application of outcome criteria.
When assessing results of (or participation in) a clinical trial, how do we recogn ize that the outcome events selected were the most appropriate?ls a decrease in the CD Activity Index (C OAi) of 50 points clinically important ?Does such a result suggest we should advocate or avoid the treatment ?Consider the following hypothetical questions which may have been posed for two clinical trials in IBO.
Question one: Does Jrug 'X', at a do~e of 1 g tid orally, induce remission within eight weeks in patients who have a nonsevere exacerbation of extensive UC?We can discern that the study patients have UC and are experiencing a mild or m,xlerate exacerbation.Drug X will be given orally for eight weeks at l grid.What should be our outcome events?Q uestion two: Does drug 'Y', at 500 mg bid given orally for one year, prevent exacerbation of CD?This study will examine patients wtth CD who are in remission and who will receive drug Y for one year.Thus, outcome assessment in the two trials will necessarily he different.
ln the first study, we expect the pa-tientl> with UC exacerbation at baseline to be completely well or improved after eight weeks.Thus, we must define 'remission', 'improvement' and 'exacerbation' for these time-dependent events, using the same yardstick.Such a trial would likely assess several outcome events, such as a disease activity index of symptoms and physical signs, endoscopic appearance, possibly histology and adverse drug effects.le must be decided beforehand which combination of these four gauges is the most important and worthy of the title 'primary outcome'.Other •econdary outcomes might lend support to the primary mealiurements but in the event of a discrepancy between primary and secondary outcomes would not sway beliefs about the drug efficacy.These may be softer measures, such as quali ty oflife index, biochemical test results, or number of hospitalizations or surgeries for UC.We might arbitrarily define improvement as a "decrease m the disease activity index ( 4) of, say, five points together with improvement in endoscopic appearance of one grade (5)".Remission could be defined as a score of 'O' on the clinical index, normal endoscopic appearance and no histological activity.
Turning to the second question, let us examine the outcome measurements used in the Canadian Cmhn's Relapse Prevention Trial of chronic low <lose cyclosponne administration to prevent disease worsening (6).The primary outcome measurements m that trial were the median time to exacerbation, the proportion of patients having an exacerbation, and the frequency and severity of adverse events over the 18 months of the trial.Exacerbation was defined as an increase in the COAi (7) of 100 pomts ahove baseline.Patients at entry were stratified by activity index (CDAI ~150, CDAl > 150).Adverse events were a descriptive list of events classed as requiring premature withdrawal or not.Secondary outcome measurements included quality of life index (8,9), mean dose of corticosteroids or 5-ASA, drugs and the frequency of hospitalization, surgery or death.These outcomes pem1itted us to assess the impact of low dose cyclosporine use and to demonstrate that it provided no additional benefit in patients already stable at study entry (6).
The outcome events must therefore be linked to the questions of the study, he obiective and preferably be quantttattve.

THE IMPORTANCE OF COMPLIANCE
Compliance occurs at several levels m clinical trials.lt includes adherence to the protocol by clinicians and patients, especially the taking of study medication but also keeping appointments and avoiding other nonstudy interventions.Compliance may be considered an outcome measurement in its own rite as it may completely explain the results of a study.Thus, 1t must also be measured ob1cct1vely when possible, for example by pill count, blood or urine testing, to prevent bias.If drug A appears to be no better than placebo, it is possible that patients presumed to be takmg the drug were noncom pliant or failed to continue in the trial Jue to an undesirable side effect, or that those who were 'on placebo' bought or somehow procured and took the active Jrug.(10) developed criteria for mild and severe UC.However, as with most categorical or 'ordinal' indices, moderate exacerbation was ill-defined, lying somewhere between 'mild' and 'severe', and patients often had some, but not all, features in a stngle category.Thus, defining improvement or worsening based on such indices was difficult except when patients were in complete remission with absence of all features of active disease.A second generation index was the St Marks mdex for extensive UC (Table 2) which graded 11 different features of the disease, including sigmoidoscopy from O to 3, with a full score range from O to 22 ( 4 ).This permitted a clearer quantitative definition of disease activity or change in activity, ie, improvement or worsening.However, most features of this index are not pertinent for patients with limited colitis or proctitis since few such patients experience anorexia, abdominal tenderness, fever, nausea or signifi-  The COAi (7), although not used by most of us in clinical practice, is the one most familiar from reported studies anJ includes eight features of disease activity which are variably weighted to yield a score range from O to about 700 (Table 3).Accive disease is considered greater than 150 and severe disease greater than 450.The simple index ( 12) has reduced the number of features from eight to five, and eliminated the weighting of scores and the need for patients to keep a <liary (score range 0 to 29).Both indices have been shown to he reliable in clinical tria ls (as have other similar indices) .Garrett and Drossman ( 13) have summarized the limitations of these and other scales, which have incluJed lack of validation in some cases, substantial interobserver variability anJ inability to detect functional disability ( 13 ).Some scales show poor responsiveness to change in clinical status or fa il to gauge symptoms such as intestinal obstruction or perianal disease (13).The author's group (l 4) recently developed a simple fivepoint index to evaluate change in perianal disease activity.

EXAMPLES OF OUTCOME EVENTS lN lBD
Radiological assessments: Radiological assessment, such as <louble contrast barium enema in IBO, undoubtedly is more su ired as a d iagnostic -rather than an evaluative -tool.For safety reasons, it is undesirable to undertake repeated studies which expose patients to ionizing rad iation.Barium stu<lies arc useful to evaluate small bowel CD because of the relative insensit ivity and inaccessibil ity to small bowel with other techn iques.Most radio logical features, however, are nonresponsive to change except in L ong term studies of many mon ths (1 5).The computed tomography (CT) scan is expensive and requi res considerable radiation exposure.CT, ultrasounJ or magnetic resonance scann ing might reveal nonspecific colon ic wall or mesenteric thickening in IBO, or the presence of abscess or fistula in t he abdomen, pelvis or peria nal regions.For patients with colonic Jisease, fistula b 111.d' • C l or a scess, m 1um scannmg or ,eca indium quantitation are useful for disease localization or severity but, like x-rays, are poorly con<lucive to repeated measurements ( 16).intestinal permeability studies using oral, intestinal or colonic administration of probe molecules, like 51 C rEDT A, polyethylene glycols, mono-and/or disacchariJes with quantification of uri nary recovery, also yie ld a measure of J isea~e activity but wh ich may be affected by the use of nonsteroi<lal or other drugs ( 17).Endoscopic assessment abo permits procurement of intestinal mucosa[ biopsies for histological and mflammacory mediator measurements.I l iswlogical evaluation 111 UC has heen standardized and althnugh m111or adaptations have been reported, the one used hy Riley er al (Table 5) (20) which was validated c.lunng their swdy has appeared frequently in the ltternture.Histology has playec.la lesser role m the evaluation nf change m CD activity since the snc of disease limits the avai lability of tissue, cndoswpic biopsy evaluates only the mucosa and not the full wall thickness, and 1s complicated by the focality of the disease and sampling bias.Biochemical assessments: Riochemical surrogate marker levels in hoth CD and UC often arc increased due to other condittons, hut hecause these proce:-ses arc automated, there is less hias and ranJom error m their measurement (21,22).for example, low hemoglohm may be due to hlooc.lloss, type of drug therapy or numt1onal parameters, and depends somewhat on disease extent.Erythrocyte sedimentation rate has a moderate correlation with disease activity except in procttt1s, wh ile platelet count tends to be e levated in more severe colontc disease.The scromuco1ds, such as thl' acid a lpha-glycoprotcin

QUALITY OF LIFE MEASUREMENT IN IBO
Because of limitations of J1scase aL-ttv1ty indices and the chrnnic1ty of !BD, several disease-specific quality of life mJiccs have hecn dl•vcloped (9,22,23).Although several genera l instruments, like thl' StCkness Impact Profile and other d1sease-spec.1ficinstruments, have been assessed, only the McMaster inflammatory howd disease quality (IRDQ) study ( 9) was developed specific.allyas an outcome measurement for clinical trials.Quality of life mstrnmcnts arl' generally patien1reporrcd (suhiectivc) and quantitative 111strumenb which evaluate patient sa1-11>fact1on and function in the physical (cg, pain, diarrhea), social (ability to work, attend social engagements) and cmottonal (anger, irritability) spheres.
The IBDQ score (range 32 to 224, higher score indicates better quality of life) m a group of healthy voluntccn.diJ not experience a flare-up over 18 months (24).In a second study, patients experiencing a mild to moderate cxacerhauon of CD had a s1gmfu;ant improvement of 34% (P<0.05)m the cotal I BDQ score after L wo weeks of prednbone therapy (25).

ADVERSE TREATMENT EFFECTS
C learly, the appra isal of adver~e effects of thcrnpcuric interventions in rhe existent literature lacks sophistication .me.I 1s, perhaps, at a level more rudimentary than even the early activity indices.Most often, adverse effects arc dl•scnhe<l ,ts 'deaths', 'life-threatening adverse events' or a frcqucnc.ydis-tnhut1on of self-reported symptoms displayed as occurring m patients caking either active nr placebo therapy.Although this is a practical fnundatton on wh ich to hutld, the full spectrum of tox1cny, as for efficacy, must be evaluated.AdJ1tional feature, shoulc.linclude frequency and cause of fatal or nonfatal incidents, frequency and grading of severity of the latter, whether they occur early or late after drug m1tiation and whether they appear co be dosc-relatec.lor idiosyncratic.Further elucidation of requirements for complete discontinuation of therapy, dose rcc.luction, the degree of reversibility and whether there are potential interacttons with other drugs or 11npl1cations for child-bearing women arc essential.Standardized mstruments, like the qwmtitattvc activity or qm1 lity of

CLINICAL VERSUS STATISTICAL SIGNIFICANCE
As clinicians, we must decide when to adopt a particular management strategy for an in<lividual patient.As investigators, we must decide whether study results support the success or fa ilure of a certain therapy.Statistical analysis is a practical mathematical means to assess success or failure objective ly.Nevertheless, it is the clinician who must decide whether 40% of patients in remission taking treatment A versus 35% taking treatment B or a difference in COAi of 30 points between two treatment strategies is sufficient to justify the associated adverse effects or to choose one or the other treatment, despite P<0.05.Clinical significance reflects the relevance of the primary and secondary outcome measurements to the research question, while statistical significance evaluates the number of patients, type of data collected, choice of statistical test, arbitrary predefined test parameters (like types l and Il er-ror) as well as the success or failure of the treatment.
Guidelines for the use of statistical tests are presented in Table 6.Selection of statistical test is dependent on the study design and the type of data which constitute the outcome events.There are four different types of data.Nominal data, such as sex, site of disease or smoking status, describe features that are absent or present.Ordinal or categorical data, such as mild, moderate or severe disease, categorize a particular feature and imply a direction of change, ie, mild is better than moderate or severe.Interval data, such as the difference in CDAl or UC activity score between the end of the study and baseline, tell us the amount of rise or decrease in the activity while ratio data have we ll-defined anchor points, such as the normal hemoglobin (~150 g/L) or absolute CDAl score (normal is 0, > 150 implies active disease).For each data type and study design, there are one or more appropriate summary statistics best suited to the analysis.
Data analysis and presentation of results must also answer questions of clinical importance (Table 7).Readers or investigators must be able to tell how missing data, patients lost to follow-up, dying of unrelated causes or taking ad-

INTERPRETATION OF RESULTS
Finally, the study resu lts should be interpreted objectively.The discussion section of a manuscript should identify the important findings -but also the limitationsof the study, and why these resu lts are consistent or incongruent with findings of others.They should be presented clearly for the clinician as well as the investigator.The reader (investigator) must 4uestion the validity of the conclusions.Problems or new questions to be addressed by future research should be identified.When tymg together the results of a study, as clinicians, we ought to be able to predict the impact of the results on scare-of-the-art clinical practice.Alternatively, are we satisfied that a negative study has been adequately explained?

TABLE 3
Crohn's Disease Activity Index

TABLE 6
Statistical guidelines

TABLE 7 Statistical
itional treatments were handled in the analysis.ltalso is important to correct for multiple statistical testing to minimize the risk of a type l error.Omission of such information from study results can alter the interpretation of outcome d ramatically.Reporting the balance of prognostic variables between treatment groups, eg, number of patients who are steroid-dependent, or taking 5-ASA versus nonsteroiJ dependent or not on 5-ASA also has important effects on the final result.Statistical evaluation should assess both intention to treat anJ efficacy analysis.The former assesses whether patients are better because of the original treatment group assignment (and thus accounts for patients dropping out due to lack of drug efficacy or side effects) while the latter tests whether the drug worked in those patients who actually took it.While clinicians may not appreciate the subtle differences in the more complex statistical tests, they can at least recognize the common and notable flaws in data analysis. d