Pediatric and adolescent obesity is a global concern as the range of its health consequence includes cardiovascular diseases, diabetes, poor quality of life, disability, mental health problems, and even adulthood mortality [
Many opine that schools represent “key settings” or “ideal settings” for obesity prevention or intervention, and numerous school-based studies have been conducted worldwide [
The critical issue is that required sample sizes are likely to be underestimated if such clustering effects are not taken into account [
The primary aim of this review is to evaluate statistical methods applied to published school-based randomized trials addressing weight issues. The evaluation is focused on assessing appropriateness of power analysis and applied statistical methods as to whether clustering effects often represented by correlations of subject outcomes within schools, which is also known as the intracluster (or intraclass) correlation coefficient (ICC) [
We searched PubMed for papers meeting the following inclusion criteria: Published in journals indexed in PubMed From the PubMed inception year Randomized school-based trials Written in English Outcomes involving weight-related issues Subjects of age younger than 25 years that include young adults.
To this end, we applied a Boolean algebraic combination of MeSH (
We retrieved full texts of all of those papers, and two authors (MH and SN) applied the following exclusion criteria for the review: Protocol paper. Not a randomized study. Not a school-based intervention. Analysis of subjects of age beyond the range. Outcome is not related to weight issues. Baseline or subgroup analysis of a parent study. Single-school trials.
When the two raters’ ratings were not concordant due to unclear descriptions in the text regarding study design parameters or analytic methods, the rating was resolved by consensus between them.
We extracted the following study characteristic variables from the included studies: the publication year (past 5 years (2013–2017) versus the prior years (1995–2012)); number of schools; analytic sample sizes of individuals, which are the size of the completers at the final follow-up when available; randomization level (cluster versus individual); length of follow-up in months (<12 versus ≥12 months); number of repeated measurements including baseline (2 versus >2); location of the trial (the US versus others); and types of schools (preschool, elementary/primary, middle, and high school, and college).
As per the randomization level, the “cluster” classification category includes any higher level units in which all subjects received the identical interventions or treatments. Therefore, the cluster unit could be schools, classes, health centers, communities, camps, and so on. Nevertheless, we extracted the number of involved schools, instead of the number of the various randomized cluster units, for the number of schools variable. If randomization occurred at the individual/student level within schools or other cluster units, we classified the randomization level for such trials as “individual.” When multiple types of schools are involved in a trial, we classified such cases into the school type of the lowest level. For instance, if a trial involved preschool and primary schools, then this case was classified as “preschool” for the school type.
With respect to the presence of power analysis and appropriateness of applied statistical methods, we extracted the report of power analysis (yes versus no); consideration of the intracluster correlation coefficient (ICC) for power analysis (yes versus no); primary statistical analysis models (multilevel versus no); and the report of the ICC estimated from statistical analysis (yes versus no). If the description of power analysis or sample size determination referred to a protocol paper and was not provided in the text to an evaluable extent, we classified such cases as no report of power analysis. As the terminologies of statistical models are not standardized across the papers, we classified the following models as “multilevel” models in which a “clustering” effect at some level of the data hierarchy had presumably been taken into account in the analysis: generalized mixed-effects model, mixed-effects model, linear mixed-effects model, multilevel mixed-effects model, hierarchical models, random-intercept models, mixed models, logistic mixed-effects models, generalized estimating equation models, and the likes. We consider the application of those multilevel analyses appropriate. Other models that do not take the clustering effects into account were classified as “no” multilevel model: for example, paired
Descriptive statistics are provided in terms of frequency, percentages, median, and the first (Q1) and the third (Q3) quartiles. There were three missing values for the number of schools, and studies with those missing values were excluded from analyses involving the number of schools. The follow-up length and number of measurements are analyzed as both continuous and dichotomized variables. Comparisons of study characteristics, the presence of power analysis, and appropriateness of statistical methods between the publication years 2013–2017 versus 1995–2012 are made using Wilcoxon rank-sum tests or Fisher exact tests. The same sets of these tests are also applied to testing associations of study characteristics and the presence of power analysis with application of appropriate multilevel analysis. Statistical significance is declared at
The application of the exclusion criteria resulted in 121 papers (Supplementary Materials (available
Reasons for exclusion.
Reasons |
|
% |
---|---|---|
Single-school trial | 49 | 34.5 |
Protocol paper | 36 | 25.4 |
Baseline or subgroup analysis of a parent study | 27 | 19.0 |
Not a randomized trial | 18 | 12.7 |
Students of age beyond the age range | 7 | 4.9 |
Not a school-based trial | 5 | 3.5 |
Total | 142 |
Detailed results are presented in Table
Study characteristics by study years: median (Q1, Q3),
Study characteristics | All years ( |
1995–2012 ( |
2013–2017 ( |
|
---|---|---|---|---|
Number of schools |
13 (7, 28) | 14 (8, 31) | 12 (6, 20) | 0.434 |
Analytic sample size | 620 (340, 1182) | 670 (407, 1295) | 610 (310, 1083) | 0.269 |
Follow-up length in months | 12 (6, 24) | 12 (6, 24) | 12 (5, 20) | 0.078 |
Number of repeated measurements | 2 (2, 3) | 2 (2, 3) | 2 (2, 3) | 0.098 |
Randomization at a cluster level | 108 (89.3%) | 57 (90.5%) | 51 (87.9%) | 0.772 |
Follow-up ≥1 year | 68 (56.2%) | 38 (60.3%) | 30 (51.7%) | 0.365 |
Two measurements including baseline | 70 (57.9%) | 32 (50.8%) | 38 (65.5%) | 0.140 |
Trials conducted outside the US | 69 (57.0%) | 32 (50.8%) | 37 (63.8%) | 0.198 |
School type | 0.038 | |||
Preschool | 11 (9.1%) | 3 (4.8%) | 8 (13.8%) | |
Primary/elementary school | 83 (68.6%) | 50 (79.4%) | 33 (56.9%) | |
Middle school | 20 (16.5%) | 7 (11.1%) | 13 (22.4%) | |
High school | 5 (4.1%) | 3 (4.8%) | 2 (3.5%) | |
College | 2 (1.7%) | 0 (0.0%) | 2 (3.5%) |
Results from the comparisons of power analysis and application of multilevel models between years are presented in Table
Power analysis and statistical methods by study years:
Power analysis statistical methods | All years ( |
1995–2012 ( |
2013–2017 ( |
|
---|---|---|---|---|
Power analysis conducted and reported | 52 (43.0%) | 24 (38.1%) | 28 (48.3%) | 0.258 |
ICC taken into account for power analysis | 26 (21.5%) | 13 (20.6%) | 13 (22.4%) | 0.828 |
Multilevel analysis performed | 83 (68.6%) | 40 (63.5%) | 43 (74.1%) | 0.176 |
ICC estimated from the analysis reported | 10 (8.3%) | 7 (11.1%) | 3 (5.2%) | 0.327 |
As presented in Table
Association between multilevel analysis and study characteristics and power analysis: median (Q1, Q3),
Study characteristics | Multilevel analysis |
| |
---|---|---|---|
Yes ( |
No ( |
||
Number of schools |
16 (8, 32) | 10 (6, 18) | 0.030 |
Analytic sample size | 816 (432, 1323) | 472 (181, 869) | 0.002 |
Follow-up length in months | 12 (6, 24) | 7 (4, 18) | 0.014 |
Number of measurements | 2 (2, 3) | 2 (2, 3) | 0.558 |
Randomization at a cluster level | 82 (98.8%) | 26 (68.4%) | <0.001 |
Follow-up ≥1 year | 52 (62.7%) | 16 (42.1%) | 0.048 |
Two measurements including baseline | 47 (56.6%) | 23 (60.5%) | 0.843 |
Trials conducted outside the US | 47 (56.6%) | 22 (57.9%) | 1.000 |
|
|||
|
|||
Power analysis conducted and reported | 39 (47.0%) | 13 (34.2%) | 0.236 |
ICC taken into account for power analysis | 23 (27.7%) | 3 (7.9%) | 0.016 |
ICC estimated from the analysis reported | 10 (12.1%) | 0 (0.0%) | 0.030 |
The primary finding of this review is that the proportion of studies which failed to apply multilevel models to analyzing school-based data appears to be relatively high at 31.4%. In addition, even if multilevel models were applied, specification of levels of clustering effects was rarely described (data not shown). For instance, clustering effects of only the highest-level units of data hierarchy seem to have been taken into account, ignoring additional potential clustering effects of lower-levels units. Taken together, significance of findings based on
The magnitudes of ICC for power analysis were mostly low, and rationales for such a hypothesized ICC are seldom clearly described. The dearth of rationale might have been due to the lack of information regarding ICC estimates, published or not, pertinent to their studies. This is reflected on the very low proportion of studies (8.3%) that reported ICCs estimated from their data analysis. To this end, the reporting of the ICC estimated from data analysis would be critical for designing future studies with adequate sample sizes. Even if the ICC appears to be very small, it needs to be accounted for power analysis. For example, when the number of students is as small as 30 for each school, the number of required schools with ICC = 0.01 increases by 29% for the same power, compared to when ICC = 0, regardless of hypothesized effect sizes. Therefore, the impact of a small ICC on the sample size could be substantial.
Detailed design characteristics are often referred to protocol papers published earlier, and adequacy of power analysis was not evaluable in the text of outcome papers. Classification of such outcome papers as no report of power analysis might have underestimated the proportion of studies with power analysis because the protocol papers might have properly reported power analysis. However, we surmise that it is fairly rare for a school-based study to be conducted exactly as planned or designed in the protocol papers as school environments are dynamic and changes may alter aspects of the design or analysis plans including sample size determinations/power analysis, statistical methods, and outcome parameters. If outcomes papers do not clearly delineate these aspects, it may be unclear to know whether the power analysis in the earlier protocol paper would be appropriate for the applied statistical methods for the analysis of trial outcomes. We believe that this issue would be a research topic worthy to investigate. After all, the following key design elements should be reported in the outcome analysis papers: target power, significance level, hypothesized effect sizes and ICCs and their rationales, the number of clusters, the number of levels, anticipated attrition rates, and the planned sample size of the subjects. This description can enable readers to effectively and clearly evaluate whether the study has been analyzed as designed, not being forced to compare with detailed elements described in the protocol papers.
Many trials are excluded from evaluation due to the utilization of single schools. These studies are mostly college-based trials likely because multicollege trials may be difficult to conduct compared to the other types of schools. Nonetheless, it is not possible to take the clustering effect into account not only for power analysis but also for statistical analysis, nor is it possible to estimate ICCs from data analysis from single-school/cluster studies. Although it could bring up an issue as to what extent clustering effects should be considered in general, this limitation should be addressed in regard to limited generalizability or transportability of the findings from single-school trials. Identical or similar findings from replicate studies in other school settings would validate the findings.
Our findings also have implications for doctoral programs training future obesity researchers, particularly those conducting school-based interventions. The programs include but are not limited to health, clinical, and school psychology subspecialties. Researchers should ensure that the most rigorous and appropriate methodologies, including issues related to clustering addressed in this study, are included as part of the core curriculum. In this way, students learn early in their careers the practice of reporting such information. To this point, the American Psychological Association (APA) recently released two APA Publications and Communications Board Task Force reports addressing standards for reporting study results. Separate reports were made for quantitative studies [
Although our review is confined to school-based trials, the findings may be applicable to other cluster randomized trials using different settings and different types of interventions and treatments in other research areas [
There are limitations that should be counted when interpreting results from this review. First, the search strategy was rather incomprehensive including only PubMed papers, and the keywords for search may be coarse. Therefore, studies that would have been eligible and added to our evaluations might not have been captured, and the scopes of potentially missing studies are unknown. As is the case for review papers in general, subjective ratings may not be completely avoided even if efforts are placed in minimizing misclassification errors. For instance, we did not evaluate whether potential confounding factors were appropriately controlled for in the data analysis. This evaluation might rely more on subjective judgments with substantial knowledge on the research topics under study and also might be difficult to reach a consensus. Lastly, again, the proportion of the studies with the reported power analysis might have been underestimated because studies that referred detailed power analysis to a protocol paper without a minimum level of description in the text were counted as papers with no report of power analysis.
In conclusion, the extent of the application of multilevel models to analyzing school-based trials appears to have so far been inadequate. Key elements such as the hypothesized ICC for power analysis or sample size determinations and the reported ICC estimated from data analyses of school-based trials are missing for a majority of studies. Future school-based trials should specifically address these issues at both design and analysis stages, preferably adhering to the extended CONSORT guideline to increase rigor and reproducibility of experimental settings and study findings. Clinical implications drawn based on the outcomes from school-based trials with rigorously well-performed design, conduct, and analysis would be the most useful to advance knowledge for preventing and treating pediatric and adolescent obesity epidemic.
The data analyzed for the present analysis along with programming codes will be available upon reasonable request.
The opinions expressed are those of the authors and not necessarily of the NIH or any other organization.
Dr. David B. Allison has received personal payments or promises for the same from IKEA, Law Offices of Ronald A. Marron, Nestle, Paul Hastings LLP, and Tomasik Kotin Kasserman, LLC, and multiple NIH grants to teach, develop, apply, and evaluate statistical methods.
This study was in part supported by R01DK097096, P30DK111022, UL1 TR001073, R25DK099080, and R25HL124208.
This material contains the list of 121 papers that we reviewed and evaluated for this paper.