Perhaps the most surprising (and concerning) thing about reading medical literature is how few studies share any raw data. So many seem to just give averages. I am aware that the data would have to be anonymised. And perhaps there are other reasons I'm not aware of, like proprietary reasons?
While physics has more raw data sharing, I was still quite concerned about the paucity of studies that shared their raw data.
It is much more difficult to hide poor scientific practice with raw data. The Data Colada blog series on Francesca Gino's raw data is a good example.
Agreed. If medicine is anything like psychology, the reason for not sharing the raw data is pretty straightforward: there's no incentive to. Journals rarely ask researchers to, and people rarely look at it, so it's just more work to put the data in a presentable format, write up an explanation of the data, etc.
The incentives need to be aligned, which means, again, attaching prestige to studies that share data and having prestigious journals require it.
That is a good point and makes a lot of sense Tommy - though I would add when I was submitting raw data as a physics researcher, it was as raw as it is possible to be raw. In fact a lot of the time it would literally be a .raw file! Not presentable, no explanation attached, just a data dump into a repository and the paper would have a link to it at the end. Unlikely to be looked at, but it was a more of a full transparency nothing to hide type gesture. Though I was surprised how frequently I would use something like a dataset for a paper no one had cited 15 years ago - plenty of PhD students out there putting off writing their theses!
I do like the idea that the incentives need to be aligned. Thinking out loud how would one attach prestige to studies that preregister and share data? Would it be something like a dual-pronged approach - holding journals to account through something like a retraction-watch (https://retractionwatch.com/) table, and then studies into what things produce papers more likely to be reproducible and higher quality? I mean, I'm making the assumption raw data would help - but I don't actually know...
Meandering thoughts aside, good article Tommy - it was a pleasure reading it!
The biggest area of leverage for aligning incentives IMO sits with editors. Make it journal policy that studies need to make data publicly available (some journals already do this, so there's precedent, and as you point out it doesn't need to be onerous). Preregistration is a bit harder since it is more of a structural change, but if the big journals and their editors at least put the pieces in place (making a big deal about submitting pre-registration to them, clearly mark papers that were pre-registered), then the community could align around pre-registered studies holding more weight, things will shift in that direction. There's already some movement in this direction, e.g. https://www.nature.com/nature/for-authors/registered-reports, we just have to push much harder in that direction
You make a lot of sense, and based on your arguments, I would be inclined to agree that the onus lies with the editors. You can count me in on the push to make preregistration and raw data hold more weight. I'm not entirely sure what that would entail, but I'm in! Thanks for the conversation, Tommy. I really enjoyed it. Happy holidays!
When we were doing landing page testing for marketing, which generally has a high participation, but also massive amounts of noise, we would find people stopping tests when the version they liked happened to be winning.
Pregistration is a good idea but it's also true that sometimes the data can answer a question you didn't ask. In which case, I suppose that is worth another string if experiments.
I'm all for using the data you have to try to answer additional questions. This can be a great way to generate new theories and hypotheses and is more economical than running a new experiment for every new idea -- but these should be explicitly labeled as exploratory analyses and the results held as more tentative than pre-registered ones!
You can report unexpected findings even with a preregistration; the thing to do would just be to make clear that this isn't something you preregistered.
Nice one Tommy! In ecology, we had our own major drama over fabrication a few years ago. Search Jonathan Pruitt. He seemed equally unready to take blame despite dozens of retractions after detailed research into his papers and the fallout for all his coauthors was huge. It even got the name “Pruittgate”.
Preregistration has been called for for a while. I like the idea. Probably works best for clinical stuff or highly controlled experiments though.
Great article. Quantitative research is fraught with paradoxes. Likert scales, are they ordinal or scaled? What if I do put a number in front or not, horizontal or vertical displayed answer options? Negative or positively worded questions. The degrees of freedom are endless…
Tommy, I frequently work in the field of management and leadership, and have been exposed on numerous occasions to academic research there. It feels like just an absolute wild west of questionable claims, research practices, and irreplicability. I've not gone to the depth of research you have on psych research but when I read the papers, so many of them just don't 'feel' right. There are so many contributing, subjective variables to how someone performs at work or the effects of 'leadership' that frequently aren't duly acknowledged or accounted for that often I don't understand how they've reached the conclusions they did.
Thank you for the depth you provided. I have read somewhere a similar article about this. I am not schooled in this area, but I have always gravitated towards the "scientific spectrum" and as a lay person I place more value on a scientific result than just an "opinion" or "belief" of someone. I also have noticed how science gets adjusted over time as the things that we can comprehend or access become "better" or different. For me personally I think we are all better off because of the people who work on things such as this. Also when a person is presented a scientific conclusion most people do not consider what goes into getting to this conclusion. I know that there are very strict and stringent frameworks for experiments, and if we can account for every instance of possibilities, to create a solid conclusion we must do so. If only because of the reach that comes from the presentation, we cannot be sure where or what bedrock the presentation will become part of. Meaning we cannot know in advance how the presentation will be used to create other conclusions. It is a very difficult field. Mind boggling in a way.
I work in psychology so this article addresses issues in my field. I started my PhD in 2015, a few years into the maelstrom at a time when the replication crisis was the big topic. Things seemed to have quieted down since then. A few changes have been introduced, but I haven't seen the kinds of revolutionary shifts in the field I'd have hoped for.
If anything, I regard the situation as far more bleak. Even if we set aside fraud cases, which are significant but not as big of a contributor to the problem of low replication rates as p-hacking and other bad research practices, there are still deeper issues. Here are a few:
(1) There is a lack of any substantive, unifying theories grounding a great deal of research. One might say (and there are papers on this) that we have a Theory Crisis.
My own work focuses on problems of measurement and validity in the instruments used to assess how nonphilosophers think about philosophical questions, primarily moral realism and antirealism. I believe available evidence hints that the measures we use for these purposes are generally invalid. So, even if they replicated, it would be moot: they aren't measuring what people think they are, and cannot serve the empirical purposes they're put to.
This highlights a more insidious problem then low replication rates: studies can be uninformative or misleading even if they replicate.
The XKCD cartoon was perfect, and I thought the term Hopium was a brilliant name for an addictive factor in research. It seems that the actually fraudulent research papers should be immortalized in some way as cautionary tales so when something is asserted people have a healthy dose of skepticism. Like a vaccine...
There is some awareness and progress in the field. When I did my Master's Thesis (pre-COVID) some professors talked about and were convincing regarding the marvels of per-registering our studies... but without a universally recognized platform and clear standards, the laboriousness of the registration process, and the fact that this is only my second "serious" publication where I may actually make some type of mistake that I legitimately need to fix (purely design-wise, completely removed from even having analyzed any data yet)... this option was simply too unattractive. Just extra work for questionable benefit and very humble bragging rights at best. Also, depending on what scale and type of study you choose to do, to my understanding back then, you could in theory run your whole study and then register it afterwards so it's not even a 100% fool-proof system that could still be abused by bad actors (or at least I believed so back then, maybe now it has matured).
However, I still took quite some lengths to make sure it is good: Like a power/sample-size calculation to figure out how many people I need to recruit to detect [an effect of the size X with at least probability Y]. Nowadays that type of calculator even comes pre-packaged: https://www.ai-therapy.com/psychology-statistics/power-calculator
It was also hilarious how utterly crippled my year (and the ones after me) were in our ability to set up a decent study by the ethics board of Teletubby social-signalling. If psychology as a field put even 20% of that ethics effort into pre-judging if your study makes statistical sense by people who know their stuff (and that it is pre-registered before you do it) instead of wasting it on bullshit-level nit-picking we'd probably make some actual scientific advances again.
The pendulum has swung from Milgram experiment all the way to "barely allowed to ask adult people anonymously in an online questionnaire how satisfied they feel with their life lest they become sad, unless I pay a 3-4 digit euro sum to some ethics board that will be perfectly happy to find a reason to deny me my study anyway for utter and complete bullshit reasons".
Imagine getting called in by some HR bimbo twice removed from social reality who proceeds to nit-pick through your entire work life to then settle on the criticism that one of your shoe-laces is a slightly different color of cream white than the other and that one of them was clearly replaced. You blink and ask if anyone has ever complained about that and they stare at you with a straight face and say: "No, I'm the only one who noticed or cares."
I sort of agree… there has been fraud, but there has always been fraud, and there always will be fraud. The bigger problem is the over reliance on arbitrary and mindless rules… like p < .05, and under reliance on good science (i.e., replication). The irony is that back when Fisher proposed p < .05 as a decent cut-off, he meant that rejecting the null hypothesis was a good reason for trying TO replicate the result (or so I have read). Similarly, I am not biggest fan of overlying on pre registration because it sort of circumvents serendipity. A quick glance at the history of science will reveal the importance of dumb blind luck. Science is not engineering, with neat, predictable results. We need both the messiness of observing the unexpected and the highly controlled experimentation for confirmation. Anything that limits either of these slows us down. That’s just my opinion.
There are plenty of methods for sharing serendipitous discoveries, including in the writeups of pre-registered studies (you just explicitly mention these were exploratory analyses that weren't pre-registered). Case reports, reviews, or papers just not pretending to have a rigorous theoretical basis being tested all can and do still exist alongside pre-registered studies.
I understand the logic. I am just ld enough to know that nothing is without cost. The file draw problem was and is a real problem. Anything that justifies not publishing findings (null results or non experimental methods) , or discourages risky studies slows knowledge. More rules don’t tend to fix the problem. Replications do. So, the question is how do we increase the number of replications?
I'm not sure number of replications matters if the methods for producing replications involve so many researcher degrees of freedom that they could find whatever they wanted. Baumeister claims ego depletion is the most replicated phenomenon in psychology, but plenty of people have failed to replicate and most in the field I know of don't believe it's a real effect
Isn’t that the way science works (or is supposed to)? One researcher observes something, makes a theoretical argument which holds until someone pokes a hole in it by discovering a situation in which it doesn’t. Our problem is that we treat failure to replicate as proof that the first observation was wrong instead of treating it as an opportunity to learn more about the original observation. Remember that the open science movement really exploded after the many studies replication studies.
But if the "observation" is just spurious due to bad statistical methodology, and the alternative theory is just "researcher degrees of freedom caused false positives", and that happens with a majority of theories, I think it's worth using better statistical methodologies (like pre-registration) so the field isn't spending so much time chasing ghosts.
Worst case scenario, I say we start a Substack called “Null.”
Seriously, though, I started my career in healthcare research, and now I work with researchers to help promote their work. I think often and talk about the idea that a null finding is still a finding, but you’re right that the incentives to elevating that work just aren’t there. Thanks for this extremely thorough write up about these issues!
Love the idea of preregistration. Knowing that a study failed could be as important as knowing it succeeded!
Absolutely!
Severe kudos for how well researched and thought out this article is. I shared it with several friends who work in psychology. I’m a subbie for life!
Perhaps the most surprising (and concerning) thing about reading medical literature is how few studies share any raw data. So many seem to just give averages. I am aware that the data would have to be anonymised. And perhaps there are other reasons I'm not aware of, like proprietary reasons?
While physics has more raw data sharing, I was still quite concerned about the paucity of studies that shared their raw data.
It is much more difficult to hide poor scientific practice with raw data. The Data Colada blog series on Francesca Gino's raw data is a good example.
Agreed. If medicine is anything like psychology, the reason for not sharing the raw data is pretty straightforward: there's no incentive to. Journals rarely ask researchers to, and people rarely look at it, so it's just more work to put the data in a presentable format, write up an explanation of the data, etc.
The incentives need to be aligned, which means, again, attaching prestige to studies that share data and having prestigious journals require it.
That is a good point and makes a lot of sense Tommy - though I would add when I was submitting raw data as a physics researcher, it was as raw as it is possible to be raw. In fact a lot of the time it would literally be a .raw file! Not presentable, no explanation attached, just a data dump into a repository and the paper would have a link to it at the end. Unlikely to be looked at, but it was a more of a full transparency nothing to hide type gesture. Though I was surprised how frequently I would use something like a dataset for a paper no one had cited 15 years ago - plenty of PhD students out there putting off writing their theses!
I do like the idea that the incentives need to be aligned. Thinking out loud how would one attach prestige to studies that preregister and share data? Would it be something like a dual-pronged approach - holding journals to account through something like a retraction-watch (https://retractionwatch.com/) table, and then studies into what things produce papers more likely to be reproducible and higher quality? I mean, I'm making the assumption raw data would help - but I don't actually know...
Meandering thoughts aside, good article Tommy - it was a pleasure reading it!
The biggest area of leverage for aligning incentives IMO sits with editors. Make it journal policy that studies need to make data publicly available (some journals already do this, so there's precedent, and as you point out it doesn't need to be onerous). Preregistration is a bit harder since it is more of a structural change, but if the big journals and their editors at least put the pieces in place (making a big deal about submitting pre-registration to them, clearly mark papers that were pre-registered), then the community could align around pre-registered studies holding more weight, things will shift in that direction. There's already some movement in this direction, e.g. https://www.nature.com/nature/for-authors/registered-reports, we just have to push much harder in that direction
You make a lot of sense, and based on your arguments, I would be inclined to agree that the onus lies with the editors. You can count me in on the push to make preregistration and raw data hold more weight. I'm not entirely sure what that would entail, but I'm in! Thanks for the conversation, Tommy. I really enjoyed it. Happy holidays!
When we were doing landing page testing for marketing, which generally has a high participation, but also massive amounts of noise, we would find people stopping tests when the version they liked happened to be winning.
We referred to it as “statistical convenience”
Pregistration is a good idea but it's also true that sometimes the data can answer a question you didn't ask. In which case, I suppose that is worth another string if experiments.
I'm all for using the data you have to try to answer additional questions. This can be a great way to generate new theories and hypotheses and is more economical than running a new experiment for every new idea -- but these should be explicitly labeled as exploratory analyses and the results held as more tentative than pre-registered ones!
You can report unexpected findings even with a preregistration; the thing to do would just be to make clear that this isn't something you preregistered.
Nice one Tommy! In ecology, we had our own major drama over fabrication a few years ago. Search Jonathan Pruitt. He seemed equally unready to take blame despite dozens of retractions after detailed research into his papers and the fallout for all his coauthors was huge. It even got the name “Pruittgate”.
Preregistration has been called for for a while. I like the idea. Probably works best for clinical stuff or highly controlled experiments though.
Great article. Quantitative research is fraught with paradoxes. Likert scales, are they ordinal or scaled? What if I do put a number in front or not, horizontal or vertical displayed answer options? Negative or positively worded questions. The degrees of freedom are endless…
Tommy, I frequently work in the field of management and leadership, and have been exposed on numerous occasions to academic research there. It feels like just an absolute wild west of questionable claims, research practices, and irreplicability. I've not gone to the depth of research you have on psych research but when I read the papers, so many of them just don't 'feel' right. There are so many contributing, subjective variables to how someone performs at work or the effects of 'leadership' that frequently aren't duly acknowledged or accounted for that often I don't understand how they've reached the conclusions they did.
Leadership is especially hard. As much of it is about influencing a dependant variable- performance, that there is no universal agreed definition of.
By the way, this is worth checking out if you're interested in replication crisis stuff:
https://www.speakandregret.michaelinzlicht.com/p/revisiting-stereotype-threat
...As is abundantly clear: replication issues are ongoing, and major findings are still crumbling.
Et tu, stereotype threat?
Thank you for the depth you provided. I have read somewhere a similar article about this. I am not schooled in this area, but I have always gravitated towards the "scientific spectrum" and as a lay person I place more value on a scientific result than just an "opinion" or "belief" of someone. I also have noticed how science gets adjusted over time as the things that we can comprehend or access become "better" or different. For me personally I think we are all better off because of the people who work on things such as this. Also when a person is presented a scientific conclusion most people do not consider what goes into getting to this conclusion. I know that there are very strict and stringent frameworks for experiments, and if we can account for every instance of possibilities, to create a solid conclusion we must do so. If only because of the reach that comes from the presentation, we cannot be sure where or what bedrock the presentation will become part of. Meaning we cannot know in advance how the presentation will be used to create other conclusions. It is a very difficult field. Mind boggling in a way.
I work in psychology so this article addresses issues in my field. I started my PhD in 2015, a few years into the maelstrom at a time when the replication crisis was the big topic. Things seemed to have quieted down since then. A few changes have been introduced, but I haven't seen the kinds of revolutionary shifts in the field I'd have hoped for.
If anything, I regard the situation as far more bleak. Even if we set aside fraud cases, which are significant but not as big of a contributor to the problem of low replication rates as p-hacking and other bad research practices, there are still deeper issues. Here are a few:
(1) There is a lack of any substantive, unifying theories grounding a great deal of research. One might say (and there are papers on this) that we have a Theory Crisis.
(2) Many studies have extremely poor generalizability. Call this the Generalizability Crisis (see: https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/generalizability-crisis/AD386115BA539A759ACB3093760F4824)
(3) There are problems of measurement and validity. So we may say we have a "validity crisis" (see: https://replicationindex.com/2019/02/16/the-validation-crisis-in-psychology/).
My own work focuses on problems of measurement and validity in the instruments used to assess how nonphilosophers think about philosophical questions, primarily moral realism and antirealism. I believe available evidence hints that the measures we use for these purposes are generally invalid. So, even if they replicated, it would be moot: they aren't measuring what people think they are, and cannot serve the empirical purposes they're put to.
This highlights a more insidious problem then low replication rates: studies can be uninformative or misleading even if they replicate.
The XKCD cartoon was perfect, and I thought the term Hopium was a brilliant name for an addictive factor in research. It seems that the actually fraudulent research papers should be immortalized in some way as cautionary tales so when something is asserted people have a healthy dose of skepticism. Like a vaccine...
Well researched, synthesised and presented, Tommy.
There are plenty of ways to 'paint the target around the arrow', and some good ways to prevent it.
There is some awareness and progress in the field. When I did my Master's Thesis (pre-COVID) some professors talked about and were convincing regarding the marvels of per-registering our studies... but without a universally recognized platform and clear standards, the laboriousness of the registration process, and the fact that this is only my second "serious" publication where I may actually make some type of mistake that I legitimately need to fix (purely design-wise, completely removed from even having analyzed any data yet)... this option was simply too unattractive. Just extra work for questionable benefit and very humble bragging rights at best. Also, depending on what scale and type of study you choose to do, to my understanding back then, you could in theory run your whole study and then register it afterwards so it's not even a 100% fool-proof system that could still be abused by bad actors (or at least I believed so back then, maybe now it has matured).
However, I still took quite some lengths to make sure it is good: Like a power/sample-size calculation to figure out how many people I need to recruit to detect [an effect of the size X with at least probability Y]. Nowadays that type of calculator even comes pre-packaged: https://www.ai-therapy.com/psychology-statistics/power-calculator
It was also hilarious how utterly crippled my year (and the ones after me) were in our ability to set up a decent study by the ethics board of Teletubby social-signalling. If psychology as a field put even 20% of that ethics effort into pre-judging if your study makes statistical sense by people who know their stuff (and that it is pre-registered before you do it) instead of wasting it on bullshit-level nit-picking we'd probably make some actual scientific advances again.
The pendulum has swung from Milgram experiment all the way to "barely allowed to ask adult people anonymously in an online questionnaire how satisfied they feel with their life lest they become sad, unless I pay a 3-4 digit euro sum to some ethics board that will be perfectly happy to find a reason to deny me my study anyway for utter and complete bullshit reasons".
Imagine getting called in by some HR bimbo twice removed from social reality who proceeds to nit-pick through your entire work life to then settle on the criticism that one of your shoe-laces is a slightly different color of cream white than the other and that one of them was clearly replaced. You blink and ask if anyone has ever complained about that and they stare at you with a straight face and say: "No, I'm the only one who noticed or cares."
I sort of agree… there has been fraud, but there has always been fraud, and there always will be fraud. The bigger problem is the over reliance on arbitrary and mindless rules… like p < .05, and under reliance on good science (i.e., replication). The irony is that back when Fisher proposed p < .05 as a decent cut-off, he meant that rejecting the null hypothesis was a good reason for trying TO replicate the result (or so I have read). Similarly, I am not biggest fan of overlying on pre registration because it sort of circumvents serendipity. A quick glance at the history of science will reveal the importance of dumb blind luck. Science is not engineering, with neat, predictable results. We need both the messiness of observing the unexpected and the highly controlled experimentation for confirmation. Anything that limits either of these slows us down. That’s just my opinion.
There are plenty of methods for sharing serendipitous discoveries, including in the writeups of pre-registered studies (you just explicitly mention these were exploratory analyses that weren't pre-registered). Case reports, reviews, or papers just not pretending to have a rigorous theoretical basis being tested all can and do still exist alongside pre-registered studies.
I understand the logic. I am just ld enough to know that nothing is without cost. The file draw problem was and is a real problem. Anything that justifies not publishing findings (null results or non experimental methods) , or discourages risky studies slows knowledge. More rules don’t tend to fix the problem. Replications do. So, the question is how do we increase the number of replications?
I'm not sure number of replications matters if the methods for producing replications involve so many researcher degrees of freedom that they could find whatever they wanted. Baumeister claims ego depletion is the most replicated phenomenon in psychology, but plenty of people have failed to replicate and most in the field I know of don't believe it's a real effect
Isn’t that the way science works (or is supposed to)? One researcher observes something, makes a theoretical argument which holds until someone pokes a hole in it by discovering a situation in which it doesn’t. Our problem is that we treat failure to replicate as proof that the first observation was wrong instead of treating it as an opportunity to learn more about the original observation. Remember that the open science movement really exploded after the many studies replication studies.
But if the "observation" is just spurious due to bad statistical methodology, and the alternative theory is just "researcher degrees of freedom caused false positives", and that happens with a majority of theories, I think it's worth using better statistical methodologies (like pre-registration) so the field isn't spending so much time chasing ghosts.
I suspect that we are just going to have to disagree on this. I like chasing ghosts.
Worst case scenario, I say we start a Substack called “Null.”
Seriously, though, I started my career in healthcare research, and now I work with researchers to help promote their work. I think often and talk about the idea that a null finding is still a finding, but you’re right that the incentives to elevating that work just aren’t there. Thanks for this extremely thorough write up about these issues!