-
PDF
- Split View
-
Views
-
Cite
Cite
Kenneth J Rothman, Rothman Responds to “Surprise!”, American Journal of Epidemiology, Volume 190, Issue 2, February 2021, Pages 194–195, https://doi.org/10.1093/aje/kwaa137
- Share Icon Share
In this issue of the Journal, Cole et al. (1) present the case for the S value, or the surprise index, a metric that is proposed to aid in or possibly substitute for the interpretation of P values. It should come as no surprise that we need something to help us interpret P values. Many scientists, even those highly trained in statistics, all too often use P values as a statistical machete, slicing their way through studies to dichotomize them all as null or nonnull. Statistical significance testing degrades data that may have been laboriously collected and rich in detail with dichotomous labels that oversimplify to an extreme, and often mislead (2).
Unfortunately, statistical significance testing is still considered the state of the art of statistical analysis by legions of scientists. The long trail of criticism aimed at significance testing (3) has so far had a limited effect on everyday practice, although there have been recent rumblings that raise hope for needed change (4). The problem with the reliance on statistical testing is 2-fold: first, the dichotomization of findings alluded to above (which Greenland has termed “dichotomania” (5)), and second, the routine misinterpretation of the P value as the probability that the null hypothesis is true (2). The latter can only have some meaning as a Bayesian proposition, but even so it is a common viewpoint.
Misinterpretation of the P value is a key motivation for replacing it with the S value, so I will share a teaching example that I have found useful for conveying that a P value cannot shed light on the truth of the null hypothesis: This is a true story that makes clear why the incorrect interpretation is wrong. I was playing the memory game on a 4 × 3 layout, so there were 6 pairs of matched cards all face down and placed without my knowledge (there is an explanation for why I was playing the game with only 6 pairs, but it isn’t relevant here). My opponent said, “Why don’t you go first?” I did, turned up 2 cards that matched, and so continued. I turned up another matched pair, and so on through all 6 pairs, the last of which had to match. According to the null hypothesis that this outcome was blind luck, the P value is 1/(11 × 9 × 7 × 5 × 3) = 0.0001. I would wager that most hypothesis-testers would describe such a “highly significant” P value as an indicator that the null hypothesis is almost surely wrong.
But what is the alternative hypothesis to chance? I would have known if I had cheated—and I didn’t—which leaves only 1 possibility: I am psychic. If the probability that chance is the correct explanation is 0.0001, then according to the way many people interpret a P value, it is 99.99% certain that I am psychic. But they’re wrong. I’m not psychic. The outcome was just blind luck, and though the P value was only 0.0001, the null hypothesis is without question true. The take-home message is this: The P value does not tell you whether the null hypothesis is true or not. All it tells you, assuming that the null hypothesis and related assumptions are true, is how compatible the data are with that hypothesis. In fact, you can use the P value as a measure of compatibility with any hypothesis, not just the null, which produces the much more useful P value function.
If a P value is interpreted as an index of compatibility between the null hypothesis and the data, and if the pressure to dichotomize it is resisted, it remains a helpful statistic. Even better is the full P value function, graphing the compatibility between the data and a continuous range of hypotheses, rather than just the null hypothesis. The P value function is equivalent to a graph of all possible confidence intervals, which may explain why Cole et al. have encouraged replacing the term “confidence interval” with “compatibility interval” (1).
So why the need to add the surprise index, if the P value can be resurrected as a compatibility index? The S value provides several advantages. First, it converts compatibility—or rather its inverse, the lack of compatibility—to an intuitive model of successive coin flips. As Cole et al. described it, “A key point here is that the S value maps directly onto a standard game of coin-tossing, providing the highly heterogeneous set of human observers with an easily taught reference system” (1, p. 193). It is easy to imagine how unusual it would be to flip heads on a fair coin toss 6 or 7, or n, consecutive times without an intervening tail. For the example of the memory game described above, the S value is greater than 13, a huge surprise value.
The second advantage of the S value is that it doesn’t come burdened with the legacy of dichotomization. As yet, there are no standard cutoffs for a surprise value. Third, the P value is measured on the probability scale, and is readily misinterpreted as a betting odds measure regarding the truth of the null hypothesis, or whatever other hypothesis is being tested. The S value avoids this problem.
Are there drawbacks to using the S value? S values increase as P values decrease, so they require a reverse perspective. That shouldn’t pose any difficulty, although there is some appeal to the idea of a compatibility index that is greatest for the hypothesis with which the data are most compatible. Another drawback is that S values, like P values, combine strength of relationship with precision in a single measure. Of greater concern is the difficulty of getting scientists to accustom themselves to a new paradigm for interpreting data. Based on the experience of trying to convert a country to using metric units, it might require using a dual approach for a time, until S values become second nature. On the other hand, in the United States, where metrification has been difficult, scientists readily adopted the use of metric units, whereas others have not, so it seems that we could do this. If the S value moves us away from focusing on significance testing, it is worth a try.
ACKNOWLEDGMENTS
Author affiliations: RTI Health Solutions, Research Triangle Institute, Research Triangle Park, North Carolina (Kenneth J. Rothman); and Department of Epidemiology, School of Public Health, Boston University, Boston, Massachusetts (Kenneth J. Rothman).
Conflict of interest: none declared.