To those among us who deal with such research, how common is such documentation in your workflow? How often did you feel the tools were failing you?
If you don't document exact workflow, would you do so if the tools were available?
In recent years, scientists may have inadvertently given up on a key component of the scientific method: reproducibility. That's an argument that's being advanced by a number of people who have been tracking our increasing reliance on computational methods in all areas of science. An apparently simple computerized analysis may now involve a complex pipeline of software tools; reproducing it will require version control for both software and data, along with careful documentation of the precise parameters used at every step. Some researchers are now getting concerned that their peers simply aren't up to the challenge, and we need to start providing the legal and software tools to make it easier for them.
In the past, reproduction was generally a straightforward affair. Given a list of reagents, and an outline of the procedure used to generate some results, other labs should be able to see the same things. If a result couldn't be reproduced, then it could be a sign that the original result was so sensitive to the initial conditions that it probably wasn't generally relevant; more seriously, it could be viewed as a sign of serious error or fraud. In any case, the ability to reproduce a given result is key to its general acceptance and, since a successful experiment is often the foundation of further research, often essential for pushing a field forward.
Computation and the reproducibility crisis
But, when it comes to computational analysis, both the equivalent of reagents and procedures have a series of issues that act against reproducibility. The raw material of computational analysis can be a complex mix of public information and internally generated data—for example, it's not uncommon to see a paper that combines information from the public genome repositories with a gene expression analysis performed by an individual research team.
A lot of this data is in a constant state of flux; new genomes are being completed at a staggering pace, meaning that an analysis performed six months later may produce substantially different results unless careful versioning is used. In some cases, the data resides in proprietary databases. Some of this work may run up against the issues of data preservation, as older information may reside on media that's no longer supported or in file formats that are difficult to read.
And that's just the data. An analysis pipeline may involve dozens of specialized software tools chained together in series, each with a number of parameters that need to be documented for their output to be reproduced. Like the data, some of these tools are proprietary, and many of them undergo frequent revisions that add new features, change algorithms, and so on. Some of them may be developed in-house, where commenting and version control often take a back seat to simply getting software that works. Finally, even the best commercial software has bugs.
The net result is that, even in cases where all the data and tools are public, it may simply be impossible to produce the exact same results. And that has serious implications for the use of reproducibility in every sense: validating existing results by repeating them, determining their robustness by performing a distinct but overlapping analysis, and using the results as a basis for further research. Perhaps most significantly, without a way of determining precisely what was done, it's impossible to identify the differences that sometimes cause similar analyses to produce different results.
Putting the reproducibility into computational code
At this point, one might be tempted to make a misguided argument that the advent of computational tools means that it's time for a major revision of the scientific method. But reproducibility became a key component of that method precisely because it's so useful, which is why a number of people are pushing for the field to adopt standards to ensure that computational tools conform to the existing scientific method.
Victoria Stodden is currently at Yale Law School, and she gave a short talk at the recent Science Online meeting in which she discussed the legal aspects of ensuring that the code behind computational tools is accessible enough for reproducibility. The obvious answer is some sort of Creative Commons or open source license, and Stodden is exploring the legal possibilities in that regard. But she makes a forceful argument that some form of code sharing will be essential.
"You need the code to see what was done," she told Ars. "The myriad computational steps taken to achieve the results are essentially unguessable—parameter settings, function invocation sequences—so the standard for revealing it needs to be raised to that of when the science was, say, lab-based experiment." This sort of openness is also in keeping with the scientific standards for sharing of more traditional materials and results. "It adheres to the scientific norm of transparency but also to the core practice of building on each other's work in scientific research," she said. But the same worries that apply to more traditional data sharing—researchers may have a competitor use that data to publish first—also apply here. In the slides from her talk, she notes that a survey she conducted of computational scientists indicates that many are concerned about attribution and the potential loss of publications in addition to legal issues. (The biggest worry is the effort involved to clean up and document existing code.)
Still, this sort of disclosure, as with other open source software, should provide a key benefit: more interested parties able to evaluate and improve the code. "Not only will we clearly publish better science, but redesigned and updated code bases will be valuable scientific contributions," Stodden said. "Over time, we won't stagnate forever on one set of published code."
Putting the reproducibility into computational code
Even if we solve the legal and computational portions of the problem, however, we're going to run into issues with the fact that many of the people who use computational tools understand what they do, but don't feel compelled to learn the math behind them. That's where a paper in the latest edition of Science comes in. Its author, Jill Mesirov of the Broad Institute, describes how many biologists aren't well versed in computational analysis, but are increasingly reliant on tools created by those who are; she then goes on to describe one type of solution, called GenePattern, that she and her colleagues put together with the help of Microsoft Research.
The idea is that the researchers that rely on computational techniques as part of their day-to-day activities need an entire "reproducible research system" that will make it easier for them to document the sources of their data and the analyses performed on it. The system they've designed shares features with rapid application development environments, as it graphically represents modular computational tools, which can be ordered to create an analysis pipeline, and the individual settings for each can be tweaked. Once complete, the user can trigger the analysis to run; the system documents all of the relevant settings and software information.
Microsoft got involved because Mesirov was apparently inspired by the fact that it's possible to embed a spreadsheet in a word processing document and leave it accessible to all the standard tools for manipulating that data. So, the GenePattern software includes a Word plugin that can embed the output of any data created using these tools, such as a table or chart. When clicked, the full GenePattern environment makes all the information about the experiment—the software and settings used, for example—accessible to the reader. It's even possible to rerun the same analysis, a sort of point-and-click reproducibility.
There are clearly some limitations to this implementation. Although the software is cross-platform, the plug-in is not. Publishers have standardized on PDFs (not Word docs) for distributing research papers, and GenePattern is limited to those analysis tools that have been adapted to work with it. But there's no question that it's the right sort of approach, as it implicitly recognizes that just about anyone doing science is now doing some form of computation, and very few of them are likely to be fully cognizant of everything that's involved in making sure their documentation is sufficient to ensure reproducibility. Unless the tools that do the analysis are designed to do so for them, a lot of this critical information is likely to be lost to the rest of the scientific community.