“Absence of proof is not proof of absence.”

—William Cowper, English poet.

CLINICAL studies frequently test whether one treatment or intervention is superior to another. When a test for superiority is statistically significant, we happily conclude that one method is better than the other, especially if it is in the expected direction! But what can it mean if the test is not significant? Can we conclude that the treatments are “similar” or “equivalent”? Actually, no, because although it could be that no clinically meaningful population difference exists, it is also possible that one does exist—but either the study was underpowered to detect it (sample size was too small) or we just got unlucky (false negative result). So from a nonsignificant test for superiority, we can really conclude only that no population difference was detected, and not equivalence, even if the observed means are very similar!

Fortunately, accepted methods of assessing and claiming equivalence between two randomized interventions do exist. They require an *a priori* definition of clinical “equivalence” between interventions, in the form of limits within which treatments are considered to be effectively the same. Equivalence is then claimed if the observed confidence interval for the difference between groups falls within the *a priori* defined equivalency region. Perhaps more commonly, however, the goal is to demonstrate that a new treatment is “as good as” or “not worse than” a standard. In such cases equivalence is not expected and superiority not needed; a third alternative, called a “noninferiority design,” is best, because it allows one to claim noninferiority by refuting the null hypothesis that the preferred intervention is worse than the comparator.1,2

Suppose an intervention is known to have favorable intraoperative properties on certain key parameters but is suspected of adversely affecting other parameters. For example, Bala *et al.* 3were interested in demonstrating that dexmedetomidine (*vs.* placebo) did *not* have a clinically important effect on evoked potentials in patients undergoing complex spinal surgery, because the drug had other known benefits. Because no effect (*i.e.* , no difference between groups) was the desired outcome, testing for superiority or noninferiority would not have addressed the research question. Rather, an equivalency trial was conducted, in which a clinically acceptable difference between groups on the primary outcome was specified *a priori* , and testing was performed to assess whether the true difference was within the equivalence region.

In an equivalence trial with a continuous outcome, we test the *null* hypothesis (H0) that the true difference between means is outside of the prespecified equivalency region, either below −δ or above +δ (*i.e.* , “not equivalent”), as H^{0}: M^{T}− M^{S}≤−δ*or* M^{T}− M^{S}≥+δ, where M^{T}and M^{S}are the population means for *test* and *standard* treatments, respectively. The alternative hypothesis (H^{1}), which one wishes to conclude, is that the true difference lies between the specified limits (*i.e.* , “equivalent”), as H^{1}: M^{T}− M^{S}> −δ*and* M^{T}− M^{S}< +δ.

Equivalence is claimed *only* if the treatment difference is concluded to be both significantly above the lower limit (−δ) and significantly below the upper limit (+δ) using a traditional one-sided test against a constant for each of the two components of H^{1}(usually *t* tests if the outcome is continuous). Equivalence testing is thus referred to as “two one-sided tests” (or TOST).4If both tests are significant, the observed confidence interval (CI) for the difference will correspondingly fall within the equivalency region. It is then concluded that the true difference lies between −δ and +δ, as in the first equivalency example (E) in figure 1. For the second equivalency trial result (F) in figure 1, the confidence interval is not within ±δ, and so the conclusion is that equivalence at the specified δ value cannot be claimed. In equivalence testing, we are given a bonus—no correction to the significance criterion for multiple comparisons is needed when performing the two one-sided tests because both must be significant (say, *P* less than 0.05) before equivalence can be claimed.

Sometimes it is more natural to express the equivalence region in terms of ratios of means than as an absolute value. For example, with the primary outcome of opioid consumption, researchers might have difficulty choosing an absolute number for an equivalency δ. Instead, they might *a priori* hypothesize δ to be, say, 0.90. Equivalence would imply that the true mean consumption for the *test* intervention was between 90 and 111% (1/90%) of the *standard* mean. When a ratio formulation is used, hypotheses are best specified on the log scale to create symmetric equivalency limits.5,6For a binary (yes/no) outcome, δ can be an absolute difference between proportions, relative risk, or odds ratio.7

Now suppose a new warming device can be made for one third the cost of an existing device, is known to have a better safety profile, and/or is much easier to use. Demonstrating superiority on rate of rewarming or intraoperative temperature of the new device over the existing one would be a luxury, as it would suffice to show that the new device was at least not worse than the existing device (*i.e.* , noninferior) on the main efficacy measure. Noninferiority designs are useful when the goal is to show that a new treatment is at least as effective as the standard (*i.e.* , equivalent or superior to), particularly when the new treatment is more favorable in other ways. They are essentially one-sided equivalency designs that test the null hypothesis that the preferred treatment is worse than the comparator by at least “δ,” against the alternative (which a significant *P* value would conclude) that the preferred is “not more than δ worse than” or “at least as effective as” the comparator.2When higher values of the outcome are favorable, the null and alternative hypotheses are the same as the first components of the H^{0}and H^{1}statements, above, respectively, as H^{0}: M^{T}− M^{S}≤−δ*versus* H^{1}: M^{T}− M^{S}> −δ, and testing is the same as for the lower limit of an equivalence trial. When the noninferiority test is significant (*e.g.* , *P* < 0.05 for α= 0.05), the lower limit of the CI for the difference between means correspondingly lies above −δ, as in the first noninferiority example (C) in figure 1(or below +δ if lower values are favorable). The second noninferiority example (D) in figure 1shows a nonsignificant result because the lower limit is below −δ.

For superiority designs, a two-sided superiority test significant in either direction corresponds to a CI that does not contain zero, as in the first example (A) in figure 1, where the test treatment is found to be superior to standard. The second superiority CI (B) contains zero, so the test must be nonsignificant, and we conclude that no difference was found. Equivalence cannot be claimed here because it was not tested and because no definition of equivalence is specified in the design of a superiority trial. Now notice that the second superiority CI (B) is identical to the first and significant, noninferiority CI (C). This prompts a question: in a trial designed for superiority, can researchers test for noninferiority after a nonsignificant test of superiority? No, a nonsignificant superiority test ends the testing, because further testing would increase the chance of type I error for which the trial was designed. However, it would be acceptable to assess superiority in a noninferiority trial after noninferiority had been established, because a significant noninferiority result implies potential superiority. In other words, additional testing to refine a statistically significant result is appropriate, but changing hypotheses to find statistical significance is not.

Choosing the equivalency δ is an integral part of study design and is best based on both clinical and statistical grounds. δ should be small enough to be of little clinical consequence, well within the range of background variability, and smaller than differences expected in superiority trials of an active treatment *versus* placebo. Too large a δ risks a claim of equivalence based on a clinically misleading δ, whereas too small a δ can waste sample size resources and make claiming equivalence unnecessarily difficult.8–9

Sample size calculation for an equivalence or noninferiority design is the same as for a one-tailed superiority trial powered to detect a difference equal to the chosen equivalency δ. However, because the equivalency δ is usually smaller than the superiority difference, a larger sample size is often needed. Often the sample size is calculated by using the postulated equivalency δ and assuming that the population difference is truly zero. If there are prior data and good intuition suggesting that the underlying difference is nonzero, the sample size may be calculated assuming a nonzero effect. For a noninferiority trial in which the underlying difference favored the preferred treatment, the sample size would be decreased.

In reporting results, studies designed to assess equivalency and noninferiority10,11should be clearly labeled as such. Choice of δ should be determined *a priori* and should be justified clinically; confidence intervals for the treatment difference should be presented in relation to δ.12In addition, treatments should be labeled “comparable” or “equivalent” only if formal tests for equivalence were done. Incorporating these readily available and widely accepted methods for assessing equivalency and noninferiority will strengthen clinical trial design and reporting.

Edward J. Mascha, Ph.D.

Departments of Quantitative Health Sciences and Outcomes Research, Cleveland Clinic, Cleveland, Ohio. maschae@ccf.org