[Here’s a link to a PDF version that retains the equations from the editor: Playing with Power–Severity Testing as an Anal-lytic Strategy]

As a young anus, this one changed interests from science to philosophy because it had questions science couldn’t answer, like how does all this knowledge hang together in a general sort of way. Then it encountered the Gettier problem. Good god: maybe Jones just had no business believing the company president because CEOs lie for a living when it serves the company’s needs; it might as well be in the job description. Or maybe one can know with all the best available information and still turn out to be wrong; lord knows it happens in research often enough (maybe knowledge is always *probable*). Happily, the anus made it through those tough times and is now back where it started, in science, and in this capacity it uses a lot of statistics in practical research. So it was definitely interested when it came across *Error and the Growth of Experimental Knowledge*, as it definitely errs from time to time but hopes to learn and grow from the experience. This book led the anus to the technical paper which constitutes the core idea of severity testing.

In severity testing (Mayo and Spanos, 2006), one tries to determine how well or not a null hypothesis can be affirmed. That is, given a statistically insignificant result, one wants to know how confidently one can conclude there is no effect, given the data one has. And more than this: in severity testing one also tries to determine how much this confidence is warranted by ruling out small deviations from no effect—in other words, it determines how well the test could detect a small discrepancy from the null, if such a small effect were present to be detected. If the ability to detect this effect is high and no small discrepancy is found, then one can be confident that the null is true because even small deviations from it are unlikely, given the data one has. Conversely, if severity is low, one should be cautious in affirming the null because deviations not detectable by the test could exist. With this two pronged approach severity testing is a post-hoc data analytic strategy that allows one to affirm a null hypothesis more or less confidently. It can also be used when the null is rejected, but that is a separate case of no direct interest here.

Affirming the null hypothesis in any case is a slippery business fraught with risk, liable to lead to confusion and error. On both points, severity testing does not disappoint. For it is both confused and liable to lead one to error. Why this is so should become evident in describing how it proceeds.

Severity testing works as follows. Fail to reject the null. Construct a small interval from the null (the authors use both 12.1 and 12.2 from 12.0), one, perhaps, of substantive interest. Compute the severity probability assuming that the true parameter value is *not* in the interval constructed (u>u1). This probability represents the chances that the parameter estimate found in the data could be larger than it is, given the data one has. If that probability is high, one can reject the assumption that the true parameter falls *outside* the interval of interest, thus affirming the alternative that it is less than the upper boundary of the interval—that is, if there is an effect, that effect is likely very small, smaller than the one of interest. One can then ‘affirm the null’ with even more confidence than before because a larger deviation than one of interest has been ruled out. The converse follows if the severity probability is low, and both alternatives are justified, according to the authors, because 1) the original statistically insignificant result agrees with the hypothesis of no effect, and 2) the power of the test to detect the small effect is high. Severity, as a probability, is both a measure of that power and a test to reject (or not) an effect size that that power could detect. Hence severity is christened as “attained” or “actual” power, even as it tests the inference that a small discrepancy might exist, and that the true parameter value lies somewhere within the upper bound of that interval.

It should be pointed out that the logic of this procedure, such as it is, amounts to establishing a second, ‘nested’ null hypothesis within the original null hypothesis of no effect, even as it works, as the authors insist, under the auspices of the original statistical null. Since the inference it wants to test is ‘the discrepancy from the null is less than the upper bound of the interval of interest,’ it proceeds analogously to a standard statistical test by nullifying the inference of interest, i.e. it converts it to a null hypothesis: “the discrepancy exceeds the upper bound of the interval”. The probability that the parameter estimate could be larger than the one derived from the data is then tested against this assumption—a comparison that allows severity testing to use the standard formulas for null hypothesis testing to evaluate the wisdom of affirming the null in the original hypothesis test. Additionally, consistent with the logic of a power determination, severity also measures the probability of detecting a small effect, to wit, the discrepancy of interest, given the test and data one has. This dual logic of hypothesis testing and power is born out in the computation of severity as much as it is brought in to inform it.

Specifically, the computation of severity proceeds by finding the probability that one will find a value greater than the difference of two Z scores, the first representing the value of the parameter estimate from the data and the second representing the discrepancy or interval of interest, i.e. the discrepancy of interest from the null. Computationally this is P(Z> Zx – Zi), where Zx represents the point estimate in the data and Zi represents the interval ((u1-u0)/SE). Equivalently, one could write P(Z>Zs), where Zs represents the Z score for severity (it is important to leave the computation in this form rather than the (P(Z> (x-u1)/SE) for reasons that will become clear in what follows). Ostensibly, the probability yielded by this formula is the probability of finding a value for the parameter estimate as great as or greater than the value found, given that the true parameter value falls outside the interval range, and given the parameter estimate one has, it is the power to detect and effect within this interval of interest, thus incorporating the “power” to detect an effect in the first place. As the authors note: SEV (u<=u1)=P(d(X)>d(x0); u<=u1 false)=P(d(X)>d(x0); u>u1). Thus consistent with a null hypothesis test, severity testing yields the probability of getting a Z score greater than one that incorporates the parameter estimate found (i.e. P(Z>Zs), while consistent with a power determination it accounts for an interval of interest between two Z scores, one of the test statistic (x0) and another of a critical value (the upper bound of the discrepancy; P(Z>Zx-Zi)), thereby determining the power to detect effects. In this way, severity both determines the power of the test to detect an effect as small as or smaller than the discrepancy of interest, even as it tests the inference that the parameter lies inside this interval. So with the dual-purpose severity score, one can affirm the original statistical null with more (or less) warrant, confident that a discrepancy does (or does not) exist because it would be (or would not be) detected.

Except it can’t do any of this. That is, severity testing cannot be both a) a test for the probability of getting an effect larger than the one found in the data (i.e. a hypothesis test) and b) a test to see if that effect can be detected (i.e. a power test) because to do so commits an error so basic that no one has thought to coin a name for it, namely, simultaneously affirming and denying the null hypothesis. As a hypothesis test, severity as the test of an inference about the null (i.e. “the discrepancy (from u0) is less than y”, as part of the original statistical null) requires *assuming *that the null is true in order to generate a probability that data more extreme might occur. However, as a determination of power—“achieved” or “actual” power—severity testing also requires a *denial* of the null in order to determine the chances of detecting an effect, given that the effect exists. These two aspects of severity testing are irreconcilable because they are irreconcilable in any single statistical test—or equivalently, the probability in a single test cannot stand for both power and the chances of getting a more extreme value in a hypothesis test because one requires the affirmation of the null and the other requires its denial. In short, severity testing is either a hypothesis test or a determination of power. It cannot simultaneously be both.

But the severity testing does compute a probability, one might point out. But it remains to be seen a probability of what. It is relatively easy to see that algebraic manipulation allows the computation of a probability that looks like it is related to an interval “analogous” to the interval used to compute power (as the authors note, severity = P(Z> Zx – Zi) and power= P(Z> Zc – Zx)), and the authors try to capture this interval using severity= P(Z> (x-u1)/SE). But severity as a probability yields neither the probability of finding an effect of any size, nor the probability that a parameter estimate will fall into the discrepancy of interest. It is nothing more or less than the probability of getting a value greater than the parameter estimate (x0) than the one yielded by the data, evaluated relative to the upper bound of the interval of interest as the ‘new null hypothesis’. That this is the case follows trivially from the standard formula for a null hypothesis test as it would apply in the authors’ stock example or their Ms. Rosy variant (SEV= P(Z> Zx – Zi)= P(Z> Zs)= P(Z> Zx – Zo)), where per a standard hypothesis test Zx =(x-u0)/SE and Zo=0, meaning there is no interval ((u0-u0)/SE). Simply insert 12.2 or 12.1 respectively as the value for the null (u0) in the standard test and one can compute “severity”. In effect, then, though not in intent, severity testing amounts to a second null hypothesis test evaluating the data relative to the new null hypothesis of u1– or the upper bounds of the interval of interest– only in this computation the p-value is imputed with the power to detect the small interval it is testing, and is therefore called *severity*. But the power imputation is just statistical nonsense, and only the p-value from the second hypothesis test remains.

If that is not problematic enough, the situation for severity testing may be even worse than this. That is, besides simultaneously affirming and denying the null and imputing meaningless power to a probability, i.e. a second p-value, the probability that is severity contains no more information than the p-value from the original statistical test, and it is arguably far more confusing. In fact, to compute the corresponding p-value from a severity score (or vice versa), one need only ‘put the interval back in’ (or ‘pull it out’) using p= P(Z>Zs +Zi) (or severity= P(Z>Zp – Zi)), with Zi as the Z score for the size of the discrepancy of interest, Zs as the Z score corresponding to the severity probability, and Zp as the Z score for the p-value. So in their Ms. Rosy example, for which a p value is not offered but an interval and sample mean are, p=.067. One need not do the severity computation—exemplifying that, knowing the p-value from the original statistical test, one need not even run a severity test with its mind-numbing logic of ‘testing’ an inverted negative though nulls within nulls based on inequalities within inequalities. One can completely determine severity from the p-value and go from there. While in itself not necessarily dispositive, this one-to-one correspondence between p-values and severity across tests for any standard error and sample size does indicate that severity is no more informative about effect sizes than is a garden variety p-value. For an interesting discussion that this is true of p-values (and by extension of severity) under the null and why it may not necessarily be so when the null is rejected, the reader is referred to Hung, O’Neill, Bauer, and Kohne (1997). In their analysis they show how one might examine p-values for information about effect sizes relative to sample size in cases where the alternative hypothesis is true, i.e. where the null is rejected and an effect size is determinable. Severity testing falls far short of their approach, however, mainly because (among other things) it simultaneously affirms and rejects the null in its analysis. But even in the best of cases it only computes a second p-value, one that under the null is as insensitive to effect sizes as the original. As such, despite its aspirations, it is not a post-hoc data analysis based on any meta-statistical principle of interest; it is at best a re-asking of the original research question, namely: what would the p-value be if I ran a different test with a different null using this same data—or alternatively, what parameter estimate might I get with this same test under a different null but with the same p-value.[1] Neither question is of any interest to the practical researcher, nor do the answers illuminate the truth or falsity of the original statistical null. In other words, severity allows no more interpretation of the data than is warranted by the p-value from the original statistical test. Severity fails as a post-hoc, data dependent test.

On this note, since severity testing fails, is any test that affirms the null warrantable? One intriguing approach has been tried. In what can be called “equivalence testing,” Shuirmann (1987) has shown that one can reasonably reject the hypothesis ‘the parameter falls outside a discrepancy of interest from the null’ with the same degree of confidence that one would reject a null that there is no effect at α if the 1-2α confidence interval occurs entirely within this discrepancy and is clustered tightly around the null parameter value in a highly powered test with low standard error. Intuitively, this makes good sense, and it represents a formal development of the underlying intuition guiding “Severe Testing as a Basic Concept in Neyman-Pearson Philosophy of Induction”. The reader is therefore referred to that paper for a method that reasonably warrants “affirming the null.” It is not to be found here.

______________________________________________

[1] This second computation is essentially what severity does when it sets severity to a desired level and determines the upper bound of the maximum discrepancy one can warrantably affirm from the original statistical null, as Spanos (2008) does in illustrating its use and [sic] power.

For reference, the original articles: Mayo and Spanos (2006) and Spanos (2008)