February 17, 2016

Source: Shutterstock

Similarly, Malcolm Gladwell-ish experiments can be often rescued after the fact by comparing multiple effects across subdivisions of the sample. Because you need to achieve a single result that would happen only 5 percent of the time by chance, if you can crunch your data twenty different ways, you have a 50-50 shot at statistical significance.

One way to think of the Replication and Repetition Crises is as emanating from opposite abuses of degrees of freedom. That cool-sounding phrase from early-20th-century statistics has been adopted over the years by mechanical engineering, rocket science, and robotics, although its statistical definition“€””€œthe number of values in the final calculation of a statistic that are free to vary“€”€”remains notoriously frustrating for statistics instructors to get across verbally.

The term “€œdegrees of freedom”€ was popularized by Ronald A. Fisher in the 1920s based on a 1908 paper published under the pseudonym “€œStudent“€ by a quality-control expert at the Guinness brewery in Dublin. William Sealy Gosset was among the first to think rigorously about how much a statistical analyst’s confidence in his own conclusions ought to be reduced by the limited sample sizes he was forced to work with.

An influential 2011 paper on the Replication Crisis by Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn offered the term “€œresearcher degrees of freedom”€ as a critique of the growing ability of researchers to slice and dice their way to statistically significant but temporary or even nonexistent correlations:

[I]t is unacceptably easy to publish “€œstatistically significant”€ evidence consistent with any hypothesis.

The culprit is a construct we refer to as researcher degrees of freedom. In the course of collecting and analyzing data, researchers have many decisions to make: Should more data be collected? Should some observations be excluded? Which conditions should be combined and which ones compared? Which control variables should be considered? Should specific measures be combined or transformed or both?

It is rare, and sometimes impractical, for researchers to make all these decisions beforehand. Rather, it is common (and accepted practice) for researchers to explore various analytic alternatives, to search for a combination that yields “€œstatistical significance,”€ and to then report only what “€œworked.”€ The problem, of course, is that the likelihood of at least one (of many) analyses producing a falsely positive finding at the 5% level is necessarily greater than 5%.

This term, “€œresearcher degrees of freedom,”€ is even more useful if we recognize that just as analysts can overfit models that therefore won”€™t be replicable, they can also underfit by not being allowed adequate intellectual degrees of freedom to offer “€œcontroversial”€ explanations, driving them into endless repetitions of aging mantras about racism and sexism. The issue for Student was that data were expensive while potential explanatory factors were cheap. Today, the mirror image often reigns: Data are readily available, but honest explanatory factors can cost you your job.

Too many researcher degrees of freedom permit trickery; but too few cause stupidity.


Sign Up to Receive Our Latest Updates!