WHY PUBLISHED SIGNIFICANCE VALUES ARE (MOSTLY) LIES
-->


Jim Wood

In my recent post advocating the abandonment of NHST (null-hypothesis significance testing), I skimmed over two important issues that I think need more elaboration: the multiple-test problem and publication bias (also called the “file-drawer” bias).  The two are deeply related and ought to make everyone profoundly uncomfortable about the true meaning of achieved significance levels reported in the scientific literature.

The multiple-test problem is wonderfully illustrated by this cartoon, xkcd's 'Significant', by the incomparable Randall Munroe.  The cartoon was subject of Monday's post here on MT (so you can scroll down to see it), but I thought it worthy of further reflection.  (If you don’t know Munroe's cartoons, check ‘em out.  In combination they’re like a two-dimensional “Big Bang Theory”, only deeper.  I especially like “Frequentists vs. Bayesians”.)  


xkcd: Frequentists v Baysians

If you don’t get the "Significant" cartoon, you need to study up on the multiple-test problem.  But briefly stated, here it is:  You (in your charming innocence) are doing the Neyman-Pearson version of NHST.  As required, you preset an a value, i.e. the highest probability of making a Type 1 error (rejecting a true null hypothesis) that you’re willing to accept as compatible with a rejection of the null.  Like most people, you choose a = 0.05, which means that if your test were to be repeated 20 times on different samples or using different versions of the test/model, you’d expect to find about one “significant” test even when your null hypothesis is absolutely true.  Well, okay, most people seem to find 0.05 a sufficiently small value to proceed with the test and reject the null whenever p < 0.05 on a single test.  But what if you do multiple tests in the course of, for example, refining your model, examining confounding effects, or just exploring the data-set?  If, for example, you do 20 tests on different colors like the jellybean scientists, then there’s a quite high probability of getting at least one result with p< 0.05 even if the null hypothesis is eternally true.  If you own up to the multiplicity of tests – or, indeed, if you’re even aware of the multiple-test problem – then there are various corrections you can do to adjust your putative p values (the Bonferroni correction is undoubtedly the best-known and most widely used of these).  But if you don’t and if you only publish the one “significant” result (see the cartoon), then you are, whether consciously or unconsciously, lying to your readers about your test results.  Green jellybeans cause acne!!  (Who knew?)

If you do multiple tests and include all of them in your publication (a very rare practice), then even if you don’t do the Bonferroni correction (or whatever) your readers can.  But what if you do a whole bunch of tests and include only the “significant” result(s) without acknowledging the other tests?  Then you’re lying.  You’re publishing the apparent green jellybean effect without revealing that you tested 19 other colors, thus invalidating the p < 0.05 you achieved for the greenies.  Let me say it again: you’re lying, whether you know it or not.

This problem is greatly compounded by publication bias.  Understandably wanting your paper to be cited and to have an impact, you submit only your significant results for publication.  Or your editor won’t accept a paper for review without significant results.  Or your reviewers find “negative” results uninteresting and unworthy of publication.  Then significant test results end up in print and non-significant ones are filtered out – thus the bias.  As a consequence, we have no idea how to interpret your putative p value, even if we buy into the NHST approach.  Are you lying?  Do you even know you’re lying?

This problem is real and serious.  Many publications have explored how widespread the problem is, and their findings are not encouraging.  I haven’t done a thorough review of those publications, but a few results stick in my mind.  (If anyone can point me to the sources, I’d be grateful.)  One study (in psychology, I think) found that the probability of submitting significant test results for publication was about 75% whereas the probability of submitting non-significant results was about 5%.  (The non-significant results are, or used to be, stuck in a file cabinet, hence the alternative name for the bias.)  Another study of randomized clinical trials found that failure to achieve significance was the single most common reason for not writing up the results of completed trials.  Another study of journals in the behavioral and health sciences found that something like 85-90% of all published papers contain significant results (p < 0.05), which cannot even remotely reflect the reality of applied statistical testing.  Again, please don’t take these numbers too literally (and please don’t cite me as their source) since they’re popping out of my rather aged brainpan rather than from the original publications.

The multiple-test/publication-bias problem is increasingly being seen as a major crisis in several areas of science.  In research involving clinical trials, it’s making people think that many – perhaps most – reported results in epidemiology and pharmacology are bogus.  Ben Goldacre of “Bad Science” and “Bad Pharma” fame has been especially effective in making this point.  There are now several groups advocating the archiving of negative results for public access; see, for example, the Cochrane Collaboration.  But in my own field of biological anthropology, the problem is scarcely even acknowledged.  This is why N. T. Longford, the researcher cited in my previous post, called the scientific literature based on NHST a “junkyard” of unwarranted positive results.  Yet more reason to abandon NHST.

Comments 0


EmoticonEmoticon

:)
:(
=(
^_^
:D
=D
-_-
|o|
@@
;)
(y)
:-d
:p
<3
(>o<)