Why So Much Psychology Research is Wrong
The replication crisis, fraud, and questionable research practices
Please hit the ❤️ “Like” button at the top or bottom of this article if you enjoy it. It helps others find it.
Psychology research has a problem.
That problem is the replication crisis. A good portion of psychology research can't be reproduced, casting doubt on a wide swath of psychological findings.
Psych 101 students (myself included) have long been taught that exposing someone to words about the elderly was reported to make people walk more slowly, and priming studies made up chapter 4 of the best-selling Thinking, Fast and Slow by Nobel Prize winning Daniel Kahneman. But the priming research has been described as a train wreck. Many of the findings confidently taught to students or written up in popular books just aren't real.
Priming isn't the only area of psychology under scrutiny. In a high-powered replication effort of 28 classic psychology papers, only 50% replicated.
But it isn't just psychology. More worryingly, medicine is having its own crisis—researchers have tried and failed to replicate much of cancer biology research.
Fraud is bad, mmkay?
In talking to people about it, my impression is that those watching from the outside conflate two issues: outright fraud and the replication crisis.
There have been some high-profile cases of fraud and blatant research misconduct that have led to faulty research.
Michael LaCour published a study on how conversations with gay individuals changed minds on the issue of gay marriage. The study received a huge amount of popular press. Then some smart grad students showed definitively that LaCour had made up all the data. Again there was a huge amount of press, this time about the fraud.
The journal retracted the paper, LaCour lost an offer for a tenure-track professorship, and now every time you google "Michael LaCour" everything that comes up is about the study he faked. He had to leave academia and even legally changed his name, but hasn't been able to escape the shame of what he's done (just kidding, if you look at the depths of this guy's lies, he's clearly shameless).
More recently, there has been a ton of controversy over fraud in Dan Ariely's research. Ariely is a massive figure in psychology and has published a bunch of popular press books, including one about dishonesty. To the delight of media outlets everywhere, a study that used fabricated data was about dishonesty. There have been credible concerns about some other studies Ariely has been involved with, casting a shadow on his research.
Fraud in psychology research is bad, but certainly doesn't compare to the damage done by fraudulent research in medicine.
It's suspected that Sylvain Lesné tampered with images for multiple papers supporting one of the major theories of the cause of Alzheimer's. A landmark paper in Alzheimer's research was retracted due to fraudulent manipulation of images, but not until after the paper influenced the research directions and funding decisions of Alzheimer's research for the past two decades.
An article by Andrew Wakefield in 1998 kicked off the "vaccine-autism" scare. This issue is still affecting public discourse despite the original article being fraudulent, Wakefield having undisclosed conflicts of interest (he was planning to launch a business that would sell vaccine-caused-autism-detection kits and an alternative to the MMR vaccine), and many follow-up studies showing no link between vaccines and autism.
Research fraud can cause real harm. The people who commit fraud should see major consequences and the research community should build more safeguards to prevent it. It also probably happens more often than people think (one paper estimates 9% of psychology researchers have committed fraud).
Research fraud is bad (look, I even made a meme about it so you understand the severity of it).
That said, it isn't the major cause of the reproducibility crisis.
The real problem: questionable research practices
There aren't many headlines about the boring reality of the real cause of the replication crisis: everyday shitty research practices.
The same paper that estimates 9% of psychology researchers have committed outright fraud also shows a majority admit to having used each of a long list of questionable research practices.
Unlike fraud, these questionable research practices aren't because a researcher is trying to deceive. They come from a mix of the decisions researchers often have to make during analysis combined with the incentive structure of academic publishing.
To make this concrete, let's go through a study answering that age-old question: Does listening to The Wiggles make you feel old? (This example is inspired by this paper about questionable research practices).
So you set up the experiment. You recruit people to come into the lab to participate. You're not sure how many people you need, so you ask around and settle on 30.
Before the study participants listen to the music, you ask them how old they are feeling on a 5-point scale from "very young" to "very old". Then they listen to a song by the Wiggles, or a control song you'll compare to. After they listen to the song, you ask again how old they are feeling.
It takes three months to set up the experiment, recruit participants, and run the experiment.
Now you have to analyze the data.
Results are inherently probabilistic. Any individual person might have a change in their feeling of how old they are for reasons unrelated to The Wiggles. Maybe one person started feeling older because they found the chair they were sitting in uncomfortable and their back started hurting. Another was daydreaming about frolicking through a meadow and that made them feel younger. Someone else didn't understand the question but was too embarrassed to ask for clarification so they answered randomly.
This is why researchers use statistics. Statistics allows you to compare measures and quantify the probability the difference is due to idiosyncratic factors like those above.
You want to compare people who listened to the Wiggles to those that don't, and to know if any difference between the group is due to chance.
The standard way of quantifying this is with a p-value, and every researcher knows that if you want to have a publishable result, you need a p-value below 0.05 (which itself raises a host of other problems we won't get into here). Results with a p-value below 0.05 are referred to as "statistically significant", and having this threshold determine what gets published ostensibly means having a false-positive rate (finding an effect when there isn't one) below 5%.
Given this threshold, a naïve estimate would be that about one in twenty studies might report a statistically significant result that isn't really there. But attempts to replicate studies seem to show a much higher rate of false results.
What's going on?
Researcher degrees of freedom
Let's go back to the lab with your Wiggles data. As you're analyzing the data, you have a bunch of choices to make:
Are you comparing the answer on the 5-point scale from "very young" to "very old" after listening to music, or the change in their answer from before to after?
You collected some demographic information (gender and age). Do you use statistical tools to control for age, gender, or both?
You used two different control songs to compare to. Do you combine the two control songs into one condition, or separate into two? Do you even use both control songs?
For each of these, you might not have a strong reason for preferring one to another. So you arbitrarily choose one way to do it. If you get what looks like a nice big robust effect from listening to the Wiggles, you happily move on. If you don't, you worry your arbitrary decisions are affecting the outcome—so you go back and tinker with them. You made the choices arbitrarily originally, so what's the harm in changing those arbitrary decisions?
You believe the effect should be there, so when it isn't, it's easy to convince yourself it's just a problem with one of the decisions you made about how to look at the data.
Let's say you've now chosen the way to analyze the data, but you still have a problem: p = 0.064. Just slightly too high to be considered statistically significant. You can't publish that. What do you do? You could walk away—throw away months worth of work. But you only recruited 30 participants, and that was a number you pulled out of a hat. Why not just get a few more participants to rescue the experiment?
You recruit a few more participants, and check the data every few participants to see if you've slid under that 0.05 mark yet. Once you do, you stop running the experiment.
These little decisions are sometimes called researcher degrees of freedom (a joke playing on the concept of statistical degrees of freedom). Researchers have a bunch of different knobs they can play with when analyzing (or collecting) data.
The trouble is, statistics is inherently probabilistic, and each time you look at the data a new way, you're asking a slightly different statistical question that might give a statistically significant result due to chance. It's like flipping a coin to see who gets the bigger slice of cake, and when you lose, deciding the conditions weren't quite right for the coin flip so you have to try it again.
This isn't a small effect, especially if you combine multiple degrees of freedom. A researcher who combined the four above degrees of freedom raises their "false positive rate" (the proportion of statistically significant results due to chance) from 5% to over 60%.
Many researchers are (perhaps willingly) ignorant of these issues. Studies have shown most psychology researchers engage in these sorts of questionable research practices and see them as defensible. Anecdotally, as a grad student I heard other students and postdocs talk openly about using some of these practices. Even the professor leading my ethics class shrugged off concerns about questionable research practices, making a comment about how you're not supposed to "but everyone does".
Questionable research practices aren't the whole story, either. A major problem in psychology are demand characteristics, where the participants can figure out what the study is about and (perhaps subconsciously) alter their responses to better match what the study is "expecting". These effects can be difficult to root out, but subtle expectations of the participants (or experimenters) can have a big impact on participant behavior, causing "effects" that only exist in the lab, not in the real world.
And finally, if a researcher ends up without a statistically significant result, they generally won't publish it. Putting together a manuscript and going through the review process is a lot of work, and most "null results" (those that aren't statistically significant) won't be accepted by any journal that's prestigious to publish in. This is called the file drawer effect, and it means the same experiment might have been carried out in 20 labs, and if 19 found no effect, it will be the one lab that did that gets published in the scientific record while the others stay in the "file drawer" of the researchers (XKCD did a humorous take on this years ago).
Hopium
Not that all psychological (or scientific) results are false. But it should give a lot of pause when trying to make confident conclusions from a single published result.
Multiple replications across different labs are required to have a lot of confidence in a result, and there are plenty of findings that have this.
But equally important, there's a huge opportunity here to improve how science is conducted. There is hope.
The trouble with researcher degrees of freedom comes from 1) researchers making decisions based on data they've already looked at and 2) publication being contingent on a "statistically significant" result.
Pre-registering provides a strong counter-balance to these problems.
Pre-registration means registering ahead of time a study's methodology, including the analysis to be performed. Ideally, a journal accepts a proposed paper based on the merits of the proposed experiment rather than the results.
Since academia is very prestige-oriented, if the most prestigious journals adopted these practices and/or researchers saw pre-registered studies as more prestigious, it would shift the culture towards these practices.
There are complications with this. Sometimes there is some skill involved to get an experiment to "work": failing to find a result might be because you had a crappy research assistant who didn't play the Wiggles loud enough (or you didn't take precautions against having wiggly participants in the fMRI scanner so you got noisy data). But if we saw the experiments that didn't produce significant results more regularly, we would be able to properly assess what it means when someone doesn't find an effect. We should expect some "null results" even for real effects, and some "false positives" when an effect really is there, and assess accordingly.
There are less extreme alternatives—like adding a checklist of questions to check for standard questionable research practices during the review process. In my mind, these aren't exclusive options. Pre-registration should be more prestigious, but we should at minimum ask about questionable research practices during review.
A world where the default was pre-registered studies (combined with explicitly labeled "exploratory" analyses that weren't pre-planned but generate hypotheses for future pre-registered studies) is a world with much fewer studies that fail to replicate.
Please hit the ❤️ “Like” button below if you enjoyed this post, it helps others find this article.
If you’re a Substack writer and have been enjoying Cognitive Wonderland, consider adding it to your recommendations. I really appreciate the support.
Severe kudos for how well researched and thought out this article is. I shared it with several friends who work in psychology. I’m a subbie for life!
Love the idea of preregistration. Knowing that a study failed could be as important as knowing it succeeded!