Confidence Intervals and Smallest Worthwhile Change Are Not a Panacea

Recently, a group of editors from physiotherapy journals wrote a joint editorial on the use of statistics in their journals. Like many editorials before them, the editors, who were not statistical experts themselves, put forth numerous recommendations to physiotherapy researchers on how to analyze and report their statistical analyses. This editorial unfortunately suffers from numerous mischaracterizations or outright falsehoods regarding statistics. After a thorough review, two major issues appear throughout the editorial. First, the editors incorrectly state that the use of confidence intervals (CI) would alleviate some of the issues with significance testing. Second, the editors incorrectly assume “smallest worthwhile change” statistics are immutable facts related to some ground truth of treatment effects. In this critical review, we briefly outline some of the problematic statements made by the editors, point out why it is too premature to adopt an estimation approach relying on a minimal clinically relevant difference, and offer some simple alternatives that we believe are statistically sound and easy for the average physiotherapy researcher to implement.

We read with interest the recent Editorial written by , who are the Editor-in-Chief members of the International Society of Physiotherapy Journal Editors (heretofore referred to as "The Editorial"). We applaud the author group for encouraging clinical researchers to look beyond null-hypothesis significance testing (NHST) and into the realm of effect estimation. In the Frequentist framework, NHST and effect estimation are two sides of the same coin with fundamental mathematical relationships. As methodological tutorials have described previously (Rafi & Greenland, 2020), using estimation or an "unconditional" approach to reporting statistics is a valid alternative to NHST. However, the Editorial (2022) also contains a multitude of incorrect or misleading statements. Also, the central thesis that Frequentist Confidence Intervals (CIs) should always be contrasted against a point estimate of Smallest Worthwhile Effect (SWE) could be problematic. In this short response, we will briefly detail a non-exhaustive list of misleading statements in the Editorial (2022) and expand on the statistical issues with immediately using CI overlap with SWE metrics instead of NHST.

Misleading Statements about Statistics
At a foundational level, the goal of NHST is to make inferences with an eye towards error control and the goal of Estimation, whether Frequentist or Bayesian, is to quantify the magnitude of an effect and the uncertainty of the estimate. As  also points out (page 2, paragraph 6), there is a mathematical relationship between the p-values calculated through NHST and confidence intervals around model estimates (Altman & Bland, 2011). For this reason, it is surprising the number of misstatements made within the Editorial (2022) regarding NHST and CIs. For example, Table 1 in the Editorial (2022) states "Statistically significant findings are not very replicable"; however, when exactly reproducing a study repeatedly in the same population with different samples, one would have the exact same replication characteristics for both pvalues and CIs. This appears to be a misinterpretation of Boos and Stefanski (2011) which was primarily focused on the reported precision of p-values (i.e., reporting p = 0.0123 vs. p < 0.025) and how the average study is underpowered (~67% power) so an exact replication is unlikely to yield a significant result 1 . This is a point directly addressed by Lakens (2022) in his review of the Editorial, but was seemingly misinterpreted again in their response (2022). The authors also seem to forget that a move to CIs would suffer from these exact same issues and would not magically solve the problem of replicability (Hoekstra et al., 2014;Morey et al., 2015). Table 1 also makes some very peculiar assumptions about interventions in clinical trials by stating without evidence that "Almost all interventions would be expected to have some effect, even if that effect was trivially small". 2 It is possible this is inelegant wording, and the intention was to state that, within a given trial, it is highly unlikely for a measured construct to be exactly nil. It could be that the the Editors (2022) are making a vague allusion to Lindley's paradox 1 This section of Table 1 of the editorial (2022) could also be a misunderstanding of the replication crisis which, while tangentially related to p-values, is largely believed to be due to systematic publication practices and the behavior of researchers.
2 Bizarrely, the  doubles down on this assertion in their response to Lakens (2022), who makes a strong case for the null at least sometimes being true. In their response  simple state that their assertion is "self-evident". For this peculiar claim, we believe Hitchen's razor is apt response to these editors: "What can be asserted without evidence can also be dismissed without evidence." (Lindley, 1957) and, that given a large enough sample size (i.e., high statistical power), NHST will yield a significant effect even when the difference is itself of no practical value (Rouder et al., 2009). In fact, in situations of very high statistical power, a p-value close to the significance threshold (e.g., p = 0.045) would be more likely under the null hypothesis than the alternative hypothesis (Maier & Lakens, 2022). All of this is technically true, but it ignores that the alpha does not need to be fixed at 5%. The statement by the Editors (2022) ignores the Neyman-Pearson approach of balancing type 1 and type 2 errors. The alpha level could be lowered in situations where negligible effects could be detected (thereby balancing the type 1 and type 2 error rates) (Maier & Lakens, 2022). Even if the null were never exactly true (which we believe is a unjustified claim), secondary equivalence testing could be utilized to prevent small effects from being declared as "significant" when they are practically equivalent (Campbell & Gustafson, 2018). The related statement in the Editorial (2022) that "All trials should therefore identify an effect" (Table 1), is simply not justifiable in any case that we can envision. It is often unclear if the Editorial (2022) is talking about an effect measured by a statistical model/test (which can always be wrong for a variety of reasons) or a "real" effect which can never be truly known in empirical work.
Finally, there is a bit of irony in that while the Editorial (2022) states, "it is possible to put a confidence interval around any statistic, regardless of its use, including mean difference, risk, odds, relative risk, odds ratio, hazard ratio, correlation, proportion, absolute risk reduction, relative risk reduction, number needed to treat, sensitivity, specificity, likelihood ratios, diagnostic odds ratios, and difference in medians." They omit the fact that SWE or Minimal Clinically Important Difference (MCID) metrics can and should also be reported with confidence intervals. These estimates of "clinical relevance" are subject to the same sampling errors as an estimate of treatment effect.

Estimates with Uncertainty
A failure to recognize the empirical ambiguity in the SWE/MCID metric is a fatal flaw in the Editorial (2022) as the primary thesis and remediation for supposed ills of NHST are to examine the overlap between effect estimates and the SWE/MCID. While SWE/MCID can be useful concept, it is not an immutable ground truth. We have no problem with establishing some SWE/MCID threshold for an individual study, but care has to be taken in the usage and interpretation of such thresholds. Many researchers, including one author of this manuscript , have noted that there are a multitude of issues with SWE/MCID measures reported in the literature. Additionally, evidence from previous attempts to abandon NHST (i.e., "magnitude based inference" or "MBI") indicates that researchers are more likely to adopt standard thresholds 3 rather than develop empirically based MCIDs . All of these issues with MCIDs/SWEs preclude the immediate use of the "estimation" method suggested in the Editorial (2022), and should give some pause when establishing or utilizing an SWE/MCID. 3 The review by Lohse et al. (2020) indicates that when researchers MBI, which requires setting a SWE, they always defaulted to 0.2 standard deviations of a difference. This, like a significance cutoff of 0.05, is an arbitrary threshold.

Potential Issues with MCID
1. Not all measures have SWE or MCIDs in the literature, something the Editorial (2022) overtly recognizes, and therefore this approach cannot be universally applied to all research questions.
2. There is no consensus, accepted calculation for SWE or MCID metrics. To our count, there are at least nine ways that these have been derived in the literature (Ferreira, 2018). The MCID/SWE may vary depending upon the method used.
3. The vast majority or nearly all SWE/MCID metrics reported in the physiotherapy literature do not meet the criteria for SWE conventions set out in by Ferreira (2018), which is the SWE manuscript the Editorial (2022) cites supporting SWE/MCID use. The common univariate MCIDs are also biased by regression-to-the-mean . More work improving SWE/MCID estimation would need to be done prior to recommending their use. (2018) (2022) states "If the estimate and the ends of its confidence interval are all more favorable than the smallest worthwhile effect, then the treatment effect can be interpreted as typically considered worthwhile". However, this "smallest worthwhile effect" (SWE/MCID) is being treated as some sort of immutable ground-truth. In fact, an empirically derived SWE/MCID is, by its very nature, going to be derived from a sample of the population and thus have confidence intervals around that point estimate. Due to these issues, we believe physiotherapy researchers would be justified in rejecting the Editorial's suggestion of universally applying MCID thresholds and instead use NHST with a nil hypothesis. We believe, and others appear to agree (Lakens, 2022), that the premature requirement of testing against an MCID threshold is a counterproductive practice 5 .

Whether developing a SWE/MCID via Ferreira
Further, if we ignore points 1-4 on the previous list, and pretend the SWE/MCID is only another estimate to compare against, do we have a path forward as the Editorial (2022)  4. Calculate the -score = 5. The -score can then be used to test of the null hypothesis that, in the population, the difference, , is zero by referencing the calculated -score against the normal distribution -table found in the appendices of many statistics textbooks.

Alternative Hypothesis Tests
The Editorial (2022) sets out to tell researchers that p-values should not be used but the only method which makes their proposed NHST alternative statistically valid is, in fact, a pvalue. If one can detach themselves from the some of misleading statements in the Editorial (2022), the concept that researchers should think more critically about their research questions and analyses is an excellent suggestion. In fact, if we are willing to accept that SWE/MCIDs are not immutable facts but rather "reasonably good thresholds in certain circumstances", similar to an alpha level of 0.05, there exists a NHST-based framework that seems to approximate the goal of comparing a sample to a "clinically meaningful bound" against sample population estimates: superiority, equivalence, non-inferiority, and minimal effects hypothesis tests (Caldwell & Cheuvront, 2019;Mazzolari et al., 2022). Therefore, many of the goals outlined in the the Editorial (2022) could very well be accomplished with NHST and p-values.

Vignette on Conditional Equivalence Testing
For this vignette we will revisit a study on glucocorticoid steroid injections for knee osteoarthritis (Deyle et al., 2020), which we believe is an example that physiotherapists will find relevant. In the study (Deyle et al., 2020), patients with osteoarthritis were assigned to glucocorticoid injections (experimental group; GLU) or physical therapy (concurrent control; CON). The study also used the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) at 1 year (scores range from 0 to 240). So, in this case, we may want to perform a simple t-test on the mean differences where the null hypothesis is zero and perform two onesided tests (TOST) to test for equivalence. These tests conceptually examine whether the treatment groups are statistically different and whether the treatment groups are statistically 'the same'. This type of test can be accomplished in almost any statistical program (e.g., R, SPSS, SAS, jamovi, JASP, or Stata). However, an author of this comment (ARC) has specifically created functions for this purpose in the TOSTER R package and jamovi module. Deyle et al. (2020) state in the article that a difference of 12 units on the WOMAC scale between GLU and CON was considered the SWE and so we can set the equivalence bounds to this value. 6 Some researchers may use some type of SWE/MCID to set the equivalence, but, as we mentioned above, even these empirically derived equivalence bounds are subject to sampling error. There are many subjective and objective methods of setting an equivalence bound (Lakens et al., 2018), and researchers should be careful in describing why and how they set their equivalence bounds.
The results presented by Deyle et al. (2020) are clear, and show an estimated treatment effect of 18.8 points 95% C.I.[5.0, 32.6], p = 0.008. From these we can see that that the NHST interpretation, at an alpha level of 0.05, would reject the null hypothesis of zero effect. However, we can also perform an equivalence test, using TOST, with the equivalence bounds set at 12 units. Such an analysis would yield a p-value of approximately 0.83. Therefore, we would reject the null hypothesis of no effect, but retain the null of non-equivalence. Essentially, we could conclude there is an effect and the magnitude is non-negligible. From a clinical perspective these statistics would indicate that the use of GLU over CON would likely lead to worse outcomes for osteoarthritis patients. Details on how to perform this analysis can be found in the appendix.

Conclusions
We are sad to see yet another example of scientists making claims about statistics beyond their expertise . The unfortunate reality is that authoritative papers such as the Editorial (2022) can do real damage to the field of physiotherapy. First, the incorrect information provided in the Editorial (2022) will undoubtedly mislead physiotherapy researchers towards worse statistical practices by providing misinformed beliefs about NHST and continuing the trend of believing an MCID/SWE is an immutable threshold. We fear that the mindless implementation of another threshold, much like the perfunctory use of the 0.05 significance threshold (Hopewell et al., 2009), will fail to improve the quality of research and only create a new form of publication bias. Similar frameworks, such as MBI , did not improve statistical practice among sport scientists, and when broadly implemented can cause more harm than good (e.g., adopting a new threshold of 0.2 standard deviations as SWE/MCID).
Second, editors stating their preferred statistical methods implicitly coerces authors submitting to those journals into performing and reporting statistical analyses they do not find useful.
Misguided commentaries from editorial boards are nothing new within academic publishing (Mayo, 2021 We can also provide a plot of the estimates with multiple confidence intervals. plot(test1) Figure 1. A visualization of the cumulative distribution function with 4 levels of confidence being displayed for the standardized mean difference (top panel) and the mean difference (bottom panel) The interpretation provided above takes a Neyman-Pearson perspective. Both the NHST and TOST tests have an alpha-level of 0.05 and one reached significance will the other did not.
Therefore, an author using this approach would have to conclude that one null hypothesis is rejected regarding GLU while other other null hypothesis is rejected.
However, those who wish to use an estimation approach may have a different interpretation. Under the approach outlined by Rafi and Greenland (2020), we could instead look at the data and see how "compatible" the data is with each competing hypothesis (i.e., NHST versus TOST). From this perspective, the interpretation is much more fluid, and one could conclude that the data is more incompatible with "no effect" than "equivalence" (p-values of 0.008 and 0.83, respectively).
Both perspectives are valid and it is up to researchers to decide how they plan to tests or estimate their effects. Again, it our belief that researchers, not editorial boards, are usually the better judges of what statistical framework is best for their research questions. However, researchers should be consistent with whatever language/framework (e.g., estimation or NHST) within each study/manuscript.