The psychology/social psychology community has made replication a huge focus over the last year. One reason is the recent, public blow-up over a famous study that did not replicate. There are also concerns about the experimental and conceptual design of these studies that go beyond simple lack of replication. In genomics, a similar scandal occurred due to what amounted to “data fudging”. Although, in the genomics case, much of the blame and focus has been on lack of reproducibility or data availability.
I think one of the reasons that the field of genomics has focused more on reproducibility is that replication is already more consistently performed in genomics. There are two forms for this replication: validation and independent replication. Validation generally refers to a replication experiment performed by the same research lab or group - with a different technology or a different data set. On the other hand, independent replication of results is usually performed by an outside laboratory.
Validation is by far the more common form of replication in genomics. In this article in Science, Ioannidis and Khoury point out that validation has different meaning depending on the subfield of genomics. In GWAS studies, it is now expected that every significant result will be validated in a second large cohort with genome-wide significance for the identified variants.
In gene expression/protein expression/systems biology analyses, there has been no similar definition of the “criteria for validation”. Generally the experiments are performed and if a few/a majority/most of the results are confirmed, the approach is considered validated. My colleagues and I just published a paper where we define a new statistical sampling approach for validating lists of features in genomics studies that is somewhat less ambiguous. But I think this is only a starting point. Just like in psychology, we need to focus not just on reproducibility, but also replicability of our results, and we need new statistical approaches for evaluating whether validation/replication have actually occurred.
I finally got around to reading the IOM report on translational omics and it is very good. The report lays out problems with current practices and how these led to undesired results such as the now infamous Duke trials and the growth in retractions in the scientific literature. Specific recommendations are provided related to reproducibility and validation. I expect the report will improve things. Although I think bigger improvements will come as a result of retirements.
In general, I think the field of genomics (a label that is used quite broadly) is producing great discoveries and I strongly believe we are just getting started. But we can’t help but notice that retraction and questionable findings are particularly high in this field. In my view most of the problems we are currently suffering stem from the fact that a substantial number of the people with positions of power do not understand statistics and have no experience with computing. Nevin’s biggest mistake was not admitting to himself that he did not understand what Baggerly and Coombes were saying. The lack of reproducibility just exacerbated the problem. The same is true for the editors that rejected the letters written by this pair in their effort to expose a serious problem - a problem that was obvious to all the statistics savvy biologists I talked to.
Unfortunately Nevins is not the only head of a large genomics lab that does not understand basic statistical principles and has no programming/data-management experience. So how do people without the necessary statistical and computing skills to be considered experts in genomics become leaders of the field? I think this is due to the speed at which Biology changed from a data poor discipline to a data intensive ones. For example, before microarrays, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure A below). In the mid 90s this suddenly changed to sifting through tens of thousands of numbers (see Figure B).
Note that typically, statistics is not a requirement of the Biology graduate programs associated with genomics. At Hopkins neither of the two major programs (CMM and BCMB) require it. And this is expected, since outside of genomics one can do great Biology without quantitative skills and for most of the 20th century most Biology was like this. So when the genomics revolution first arrived, the great majority of powerful Biology lab heads had no statistical training whatsoever. Nonetheless, a few of these decided to delve into this “sexy” new field and using their copious resources were able to perform some of the first big experiments. Similarly, Biology journals that were not equipped to judge the data analytic component of genomics papers were eager to publish papers in this field, a fact that further compounded the problem.
But I as I mentioned above, in general, the field of genomics is producing wonderful results. Several lab heads did have statistics and computational expertise, while others formed strong partnerships with quantitative types. Here I should mentioned that for these partnerships to be successful the statisticians also needed to expand their knowledge base. The quantitative half of the partnership needs to be biology and technology savvy or they too can make mistakes that lead to retractions.
Nevertheless, the field is riddled with problems; enough to prompt an IOM report. But although the present is somewhat grim, I am optimistic about the future. The new generation of biologists leading the genomics field are clearly more knowledgeable and appreciative about statistics and computing than the previous ones. Natural selection helps, as these new investigators can’t rely on pre-genomics-revolution accomplishments and those that do not posses these skills are simply outperformed by those that do. I am also optimistic because biology graduate programs are starting to incorporate statistics and computation into their curricula. For example, as of last year, our Human Genetics program requires our Biostats 615-616 course.
Reproducibility has been a hot topic for the last several years among computational scientists. A study is reproducible if there is a specific set of computational functions/analyses (usually specified in terms of code) that exactly reproduce all of the numbers in a published paper from raw data. It is now recognized that a critical component of the scientific process is that data analyses can be reproduced. This point has been driven home particularly for personalized medicine applications, where irreproducible results can lead to delays in evaluating new procedures that affect patients’ health.
But just because a study is reproducible does not mean that it is replicable. Replicability is stronger than reproducibility. A study is only replicable if you perform the exact same experiment (at least) twice, collect data in the same way both times, perform the same data analysis, and arrive at the same conclusions. The difference with reproducibility is that to achieve replicability, you have to perform the experiment and collect the data again. This of course introduces all sorts of new potential sources of error in your experiment (new scientists, new materials, new lab, new thinking, different settings on the machines, etc.)
Replicability is getting a lot of attention recently in psychology due to some high-profile studies that did not replicate. First, there was the highly-cited experiment that failed to replicate, leading to a show down between the author of the original experiment and the replicators. Now there is a psychology project that allows researchers to post the results of replications of experiments - whether they succeeded or failed. Finally, the Reproducibility Project, probably better termed the Replicability Project, seeks to replicate the results of every experiment in the journals Psychological Science, the Journal of Personality and Social Psychology,or the Journal of Experimental Psychology: Learning, Memory, and Cognition in the year 2008.
Replicability raises important issues for “big science” projects, ranging from genomics (The Thousand Genomes Project) to physics (The Large Hadron Collider). These experiments are too big and costly to actually replicate. So how do we know the results of these experiments aren’t just errors, that upon replication (if we could do it) would not show up again? Maybe smaller scale replications of sub-projects could be used to help convince us of discoveries in these big projects?
In the meantime, I love the idea that replication is getting the credit it deserves (at least in psychology). The incentives in science often only credit the first person to an idea, not the long tail of folks who replicate the results. For example, replications of experiments are often not considered interesting enough to publish. Maybe these new projects will start to change some of the perverse academic incentives.
Shortly after the Duke trial scandal broke, the Institute of Medicine convened a committee to write a report on translational omics. Several statisticians (including one of our interviewees) either served on the committee or provided key testimony. The report came out yesterday. Nature, Nature Medicine, and Science had posts about the release. Keith Baggerly sent an email with his thoughts and he gave me permission to post it here. He starts by pointing out that the Science piece has a key new observation:
The NCI’s Lisa McShane, who spent months herself trying to validate Duke results, says the IOM committee “did a really fine job” in laying out the issues. NCI now plans to require that its cooperative groups who want to use omics tests follow a checklist similar to that in the IOM report. NCI has not yet decided whether it should add new requirements for omics tests to its peer review process for investigator-initiated grants. But “our hope is that this report will heighten everyone’s awareness,” McShane says.
Some further thoughts from Keith:
First, the report helps clarify the regulatory landscape: if omics-based tests (which the FDA views as medical devices) will direct patient therapy, FDA approval in the form of an Investigational Device Exemption (IDE) is required. This is in keeping with increased guidance FDA has been providing over the past year and a half dealing with companion diagnostics. It seems likely that several of the problems identified with the Duke trials would have been caught by an FDA review, particularly if the agency already had cause for concern, such as a letter to the editor identifying analytical shortcomings.
Second, the report recommends the publication of the full data, code, and metadata used to construct the omics assays prior to their use to guide patient therapy. Had such data and code been available earlier, this would have greatly reduced the amount of effort required for others (including us) to check and potentially extend on the underlying results.
Third, the report emphasizes, repeatedly, that the test must be fully specified (“locked down”) before it is validated, let alone used to guide patient therapy. Quite a bit of effort is given to providing an explicit definition of locked down, in part (we suspect) because both Lisa McShane (NCI) and Robert Becker (FDA) provided testimony that incomplete specification was a problem their agencies encountered frequently. Such specification would have prevented problems such as that identified by the NCI for the Lung Metagene Score (LMS) in 2010, which led the NCI to remove the LMS evaluation as a goal of the Phase III cooperative group trial CALGB-30506.
Finally, the very existence of the report is recognition that reproducibility is an important problem for the omics-test community. This is a necessary step towards fixing the problem.
Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”
My contention is that if someone asks you these questions, start looking for the exits.
Checklist mania has hit clinical medicine thanks to people like Peter Pronovost and many others. The basic idea is that simple and short checklists along with changes to clinical culture can prevent major errors from occurring in medical practice. One particular success story is Pronovost’s central line checklist which dramatically reduced bloodstream infections in hospital intensive care units.
There are three important points about the checklist. First, it neatly summarizes information, bringing the latest evidence directly to clinical practice. It is easy to follow because it is short. Second, it serves to slow you down from whatever you’re doing. Before you cut someone open for surgery, you stop for a second and run the checklist. Third, it is a kind of equalizer that subtly changes the culture: everyone has to follow the checklist, no exceptions. A number of studies have now shown that when clinical units follow checklists, infection rates go down and hospital stays are shorter compared to units using standard procedures.
Here’s a question: What would it take to convince you that an article’s results were reproducible, short of going in and reproducing the results yourself? I recently raised this question in a talk I gave at the Applied Mathematics Perspectives conference. At the time I didn’t get any responses, but I’ve had some time to think about it since then.
I think most people are thinking of this issue along the lines of “The only way I can confirm that an analysis is reproducible is to reproduce it myself”. In order for that to work, everyone needs to have the data and code available to them so that they can do their own independent reproduction. Such a scenario would be sufficient (and perhaps ideal) to claim reproducibility, but is it strictly necessary? For example, if I reproduced a published analysis, would that satisfy you that the work was reproducible, or would you have to independently reproduce the results for yourself? If you had to choose someone to reproduce an analysis for you (not including yourself), who would it be?
This idea is embedded in the reproducible research policy at Biostatistics, but of course we make the data and code available too. There, a (hopefully) trusted third party (the Associate Editor for Reproducibility) reproduces the analysis and confirms that the code was runnable (at least at that moment in time).
It’s important to point out that reproducible research is not only about correctness and prevention of errors. It’s also about making research results available to others so that they may more easily build on the work. However, preventing errors is an important part and the question is then what is the best way to do that? Can we generate a reproducibility checklist?
First of all, thanks to Rafa for scooping me with my own article. Not sure if that’s reverse scooping or recursive scooping or….
The latest issue of Science has a special section on Data Replication and Reproducibility. As part of the section I wrote a brief commentary on the need for reproducible research in computational science. Science has a pretty tight word limit for it’s commentaries and so it was unfortunately necessary to omit a number of relevant topics.
The editorial introducing the special section, as well as a separate editorial in the same issue, seem to emphasize the errors/fraud angle. This might be because Science has once or twice been at the center of instances of scientific fraud. But as I’ve said previously (and a point I tried to make in the commentary), reproducibility is not needed soley to prevent fraud, although that is an important objective. Another important objective is getting ideas across and disseminating knowledge. I think this second objective often gets lost because there’s a sense that knowledge dissemination already happens and that it’s the errors that are new and interesting. While the errors are perhaps new, there is a problem of ideas not getting across as quickly as they could because of a lack of code and/or data. The lack of published code/data is arguably holding up the advancement of science (if not Science).
One important idea I wanted to get across was that we can ramp up to achieve the ideal scenario, if getting there immediately is not possible. People often get hung up on making the data available but I think a substantial step could be made by simply making code available. Why doesn’t every journal just require it? We don’t have to start with a grand strategy involving funding agencies and large consortia. We can start modestly and make useful improvements.
A final interesting question that came up as the issue was going to press was whether I was talking about “reproducibility” or “replication”. As I made clear in the commentary, I define “replication” as independent people going out and collecting new data and “reproducibility” as independent people analyzing the same data. Apparently, others have the reverse definitions for the two words. The confusion is unfortunate because one idea has a centuries long history whereas the importance of the other idea has only recently become relevant. I’m going to stick to my guns here but we’ll have to see how the language evolves.
Over the Thanksgiving recent break I naturally started thinking about reproducible research in between salting the turkey and making the turkey stock. Clearly, these things are all related.
Over the past year, I’ve been doing a lot of talking about reproducible research. Talking to people, talking on panel discussions, and talking about some of my own work. It seems to me that interest in the topic has exploded recently, in part due to some recent scandals, such as the Duke clinical trials fiasco.
If you are unfamiliar with the term “reproducible research”, the basic idea is that authors of published research should make available the necessary materials so that others may reproduce to a very high degree of similarity the published findings. If that definitions seems imprecise, well that’s because it is.
Victoria Stodden is an assistant professor of statistics at Columbia University in New York City. She moved to Columbia after getting her Ph.D. at Stanford University. Victoria has made major contributions to the area of reproducible research and has been appointed to the NSF’s Advisory Committee for Infrastructure. She is the recent recipient of an NSF grant for “Policy Design for Reproducibility and Data Sharing in Computational Science”