-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R-words #5
Comments
I will only attempt one because the rest seem too similar/unknown to me. Reimplementable: Is there enough information in the specification (i.e., journal article, or referenced within the journal article) to recreate the model (within the theory or account) from scratch? If yes, then the model is reimplementable. If no, then even if the experiments can be carried out within the original (presumably opaque) codebase, then the model is not able to be reimplemented (given the current specification). |
Rerunable: Is it possible to re-run the model (same computer, same system, same program) and get the exact same results ? It may seem obvious that the answer is yes but it is not that obvious actually. For example, if you're using a random generator and did not set or record the seed, then you cannot guarantee a re-run. Same if you fed manually some parameters when starting your model without a mechanism to save them or read them from a file that is changed after the run. |
What's the history/reason behind the capitalisation of rerun? |
Replicable: Is it possible to re-run the model (same program) on a different computer using a different system or different version. Does your specification give enough information concerning required library and their respective version number ? Does your model relies on system specific libraries (use of system specific library) ? Does it correctly handle system-specific features (float precision, endianess, etc.) |
@oliviaguest None, just correct it. |
Reproducible = Reimplementable (for me) |
Ah, so for me it's more complex, like (although I might need to think about it more): (Rerun + Reimplement) ~ Reproduce = Replicate R-words on LHS are somehow weighted. Edited. OK, not sure. But I think maybe best to nest them? I have seen other definitions out there. |
Something needs to be said about replicating the data also, in my opinion. If you are modelling data, then surely the experiment that produced the training data and testing data has to also be replicable. |
@oliviaguest Your definition of reimplementable looks fine to me, but it should be made clear that it applies to a human-readable document (such as a paper), not to software or computed results like the other R-words. I also think that this term matters to us, because it defines the ideal candidate paper for a replication to be published in ReScience. |
@rougier I am fine with your definition of "rerunable". The tricky part is the transition from there to "replicable". The idea is that there are aspects of a computation that should be modifiable without affecting the results, according to expectations shared in the community. A computation is then called "replicable" if it satisfies those expectations. Typically the expectations include results independent of minor version changes in everything, and of the use of different compilers and operating systems. The big problem is of course that these expectations are never written down explicitly, and it is unlikely that there is a complete consensus about them in any community. But without a clear list of criteria, it is impossible to verify if a computation is replicable. To make it worse, some people's expectations are about obtaining the exact same results at the bit level, whereas others consider it normal that "small" variations happen, though nobody ever seems to be able to define "small" in this context. |
Repeatable: = rerunable. |
Reproducible: the result of a computational study is reproducible if its human-readable description is reimplementable and if a reimplementation leads to results whose scientific interpretation is the same as for the original results. Note that reproducibility can change over time, for two reasons:
|
Reusable: a piece of code or a dataset is reusable if its characteristics are sufficiently well described that it can safely be transferred to another context. For a piece of code, there are two interesting relations to other R-words:
|
Where do changes that are not central to theory, so can be abstracted away in ideal circumstances, get relegated to? For example I have come across cases where a non-theoretically important implementation detail which should not affect the model (e.g., quicksort vs another sorting algo) ends up affecting the model because the authors were not careful. The type of sorting algo used is categorically not part of the theory and should not be, but was still integral to the replication of the results (because careful consistent modelling was not carried out). It should be part of the spec, ideally, but it was neither part of the spec nor abstracted away enough during investigations of the model so the results ended up depending on a theoretically irrelevant point. And - very relatedly - where do details that are important to the theory but have not been discovered as such belong in this r-hierarchy of words? It is a similar but importantly different case in which an implementation detail needs to be promoted to the theory-level because it is actually theoretically important, e.g., it is important for quicksort to be mentioned as the theory depends on it and not just because of sloppy modelling. |
PS: I mentioned "r-word" as a slightly flippant comment. I am now a little sorry it has caught on as it makes me feel conflicted. |
@oliviaguest I'd say that the cases you describe are outside of the R-word universe. They are well covered by traditional terms such as "mistake", "oversight", etc. Their symptom is usually non-reproducibility. In fact, I'd say that a major motivation to test for reproducibility is to catch situations such as those you describe. |
Some general comments about the R-word definitions:
|
I don't understand, if they are outside the words we are defining then I'm really confused. 😕 |
@oliviaguest Don't worry. It's a good shorthand for this discussion. I hope it won't end up in the text of our paper! |
@oliviaguest Perhaps "outside" isn't the best term. They are of course related, being specific cases of non-reproducibility. But I don't think we need a specific new term for each cause of non-reproducibility. We don't want to blow up the cost of future editions of the Oxford Dictionary. |
Aha! Now I see the confusion, @khinsen. |
@oliviaguest The common category is "reproducibility" in my opinion. Your second case is almost the textbook definition of a cause of non-reproducibility. Scientist A publishes a study. Scientist B tries to reproduce the scientific conclusion using a modified study, and fails. Comparison of the two studies then shows that something that everybody considered a technical detail actually is important and should be promoted to a part of the theory. Your first case is very similar, except that the comparison of the two studies shows that study A was not designed carefully enough. The theory has survived another round. So the common point is that a reproduction attempt fails, and the analysis of the failure improves everybody's understanding. Just the Happy End that we need to keep our funders happy. |
My feeling is that the term remixable is very dependent on the details of the license assigned to the work. However, it also has a practical aspect. It is very difficult to remix a model if the codebase is not made up of modular pieces (functions). |
@jsta My main question concerning remixable is: what is it about? Mixing suggests a large collection of things. What are those things? Functions in a library? If so, what is the mix resulting from mixing functions? |
@khinsen I am not sure. A subset of the original? remixable may be a tough one! |
Is everything open source remixable? |
If you take "mixing" from a legal point of view, probably yes. Otherwise, we need to decide first what "remixable" really means! |
It's not always the case. Imagine that you have a hybrid code (open source and proprietary), then you have to acquire the proprietary license as well. Otherwise you cannot mix the hybrid code with any other code. The use of NAG library would be an example. So I think you have to verify that all of the mixing parts are under a "open source" license. |
Perhaps more important than or equally important to definitions: A metric? |
Very important indeed, but in my opinion this is a research topic for many years to come. At this time, we can do no more than mention the problem and refer to papers such the one by Mesnard and Barba. Another reference along these lines is a recent paper in Science about reproducibility of DFT computations in materials science. Note also that the problem concerns only computational models derived from continuous mathematics. That's of course a huge part of computational science, but not all of it. As a consequence, all the R-words can be defined independently, pretending that all of science can be done using discrete maths. All science is based on simplifying assumptions, so this could be ours. |
What about the more general point of criteria? |
I'd say that at the level of generality we work at, these criteria follow from the definitions of each R-word, with one option being "domain-specific, we can't say any more here". We should probably say something about the criteria in each definition. As an example, rerunable makes sense only if the criterion is bitwise identical results, not counting metadata such as time stamps. At the other extreme, the criteria for being reproducible are necessarily domain specific. |
I think some general meta principles might be required though... I might be talking cross-purposes with you, but I have a feeling that criteria or at least meta criteria (criteria for criteria) can be nailed down. |
This articles gives definitions for replication and reproduction very clearly: http://biostatistics.oxfordjournals.org/content/10/3/405.full |
And here is another: Replicability is not Reproducibility: Nor is it Good Science |
@oliviaguest Thanks for those references! I remember the second one well, because I disagree with its conclusion, but its definitions of replicability vs reproducibility are indeed very clear. The first one seems to use the exact inverse definitions, and defines in detail only the one we call replicability. Unfortunately, in the criteria for replicability, there's again the "reasonable bounds for numerical tolerance", which is what Ian wrote about in his blog post. |
Slight tangent but I don't personally think any definition is Gospel from any paper nor do I like the idea of a prescriptive/normative definition-war. Principally, because I think modelling and non-modelling have differences that transcend these words and a modeller telling everybody what word to use just won't work anyway. The best we can do is define terms when we use them. |
I know that's not what's being attempted here, it's just (being Cypriot and reading above about the OED as if it's dictating as opposed to describing) I'm explicitly aware of language centralisation. |
I am not interested in prescribing anything either, assuming that we have the power to do so which I seriously doubt. I would like to see some more standardization of vocabulary, but that's beyond my influence. In the meantime, I just want to be clear about the definitions we use ourselves. |
The ACM just issued an announcement Result and Artifact Review and Badging where they proposed some definitions:
|
The ACM adoption is unfortunate and ahistorical in the computing community. My students and I have been working for a long time on a literature review to sort through the disarray of terminology on reproducibility. Here I will share some notes. (I have a draft of a blog post or essay, but it's abandoned for a few months now. These tidied-up notes will help.) The phrase "reproducible research" in computational science is traced back to geophysicist Jon Claerbout at Stanford, who started in the '90s this tradition in his lab that all the figures and tables in their papers should be easily re-created, even running just one command. The oldest published paper we found that addresses their method is:
Some of the content of that paper is very outdated (it’s 1992 after all), but the way the “goals” of reproducible research are presented is interesting:
Claerbout relates some of the story of “reproducible research” coming out of Stanford in an essay on his website:
He mentions that with Matthias Schwab, they submitted an article to “Computers in Physics” about the reproducible-research concept, but it was rejected—the magazine was later bought by IEEE and turned into “Computing in Science and Engineering,” where it was eventually published years later as:
At Stanford, statistics professor David Donoho learned of Claerbout’s methods in the early 1990s, and began adopting (and later promoting) them. A well-cited early paper from his group is:
This paper is often cited for the quote: Buckheit and Donoho make the commitment: Citing the work of Claerbout, they say: and later: See also:
This last article has this interesting response to the imagined argument “True reproducibility means reproducibility from first principles”: The influence of Claerbout and Donoho permeates through a large portion of the recent reproducibility movement in computational science. Victoria Stodden, de facto spokeswoman for reproducibility in the conference circuit, was Donoho’s PhD student. I mentioned above a paper by Schwab, Karrenback and Claerbout (2000), published in CiSE, a joint publication of the IEEE Computer Society and AIP (American Institute of Physics). The publication ran a Special Issue on Reproducible Research in 2009 that included several more-or-less well-cited papers:
The use of the term “reproducible research” is consistent throughout them. Enter the scene Roger Peng. His paper makes a clear distinction between reproducible research, and full replication study. The distinction also appears in his earlier publication:
which says: And the distinction is accentuated in:
(343 citations, checked today on Google Scholar) But why do we often see an emphasis on reproducing the figures, tables, etc. in a published computational study? I found a nice explanation in:
She writes: OK, so why do we have the terms completely swapped in the ACM adoption? At least within computational fields, the “swapped” terms are traced back to a (frankly, misguided and irate) paper by C. Drummond—already mentioned above in this Issue thread.
For background, bear in mind that the computing community publishes primarily on conferences, which are peer-reviewed. But within the conferences, there are workshops that have less selective review. This is a workshop paper, aimed at the machine-learning community. Drummond admits that he is swapping one term for another one: “I use X for what others call Y.” He’s arbitrarily renaming “replication” … It seems that a lot of people have been influenced by the swapping of terms Drummond made in 2009—I will speculate that the ranting quality of his paper gave it some magnetism (like in political news headings these days). See also:
and commentary by R. Peng in: But especially, read this:
This is an essay by Mark Liberman, Christopher H. Browne Distinguished Professor of Linguistics at the University of Pennsylvania. He teaches introductory linguistics, as well as big data in linguistics, and computational analysis and modeling of biological signals and systems (among other topics). I found this blog post where the author corrected the swapped terminology after becoming aware of this! Additional references using a terminology that is consistent with Claerbout/Donoho/Peng are:
I have more! But I will leave it there for now, as this Issue comment is already 2,000+ words. |
On the question of definitions of 'reproducibility' and 'replicability', I think the idea of convergence on definitions noted above might be impossible because these terms have totally opposite definitions in different fields. @labarba's comprehensive summary of the literature captures what I think is the common and widespread use, outside of the ACM, political science, and one or two other areas. Incidentally, it seems like the two terms are used synonymously in this paper in this sentence "However, good intention are not sufficient and a given computational results can be declared reproducible if and only if it has been actually replicated in a the sense of a brand new open-source and documented implementation." The article What does research reproducibility mean? simlarly summarises the prevailing definitions for most researchers in my field and related areas. They present reproducibility as
This is distinct from replicability:
They further define some new terms: methods reproducibility, results reproducibility, and inferential reproducibility. But, as Lorena has noted (I'm looking forward to seeing the rest of her review!), the definitions in this Science paper, which are also consistent with a long history of discussions of scientific reproducibility, as noted in the linguistic analysis at the Language Log blog, are totally opposite to the ACM, which take their definitions from the International Vocabulary of Metrology. Here are the ACM definitions:
The problem with these definitions is that that IVM is the wrong place to look for modern definitions of these terms, This is because it's exclusively concerned with measurement of physical properties. It does not engage at all with computational contexts. Computational contexts are a big part of the contemporary reproducibility discussion, thanks largely to the work of @victoriastodden. We can also note with interest the recent Nature News articles Muddled meanings hamper efforts to fix reproducibility crisis and 1,500 scientists lift the lid on reproducibility. These report on the general problem of a lack of a common definition of reproducibility, despite a widespread recognition that it's a problem. Those are helpful to demonstrate that there is a range of definitions in common use. The main point here is that any discussion of definitions of these terms needs to acknowledge this diversity as part of the challenge of promoting these values and behaviours in science broadly. If this diversity is neglected, and you are writing for audiences spanning many fields (I hope this for ReScience!), there is a risk of being irrelevant for researchers in fields that have different definitions to the ones you've adopted. I understand that you do have to present some kind of definitions in this paper, and I guess that the ones you choose will depends which research community you want to signify your affiliations with. There's no problem with that, so long as you note (perhaps with a brief comment and carefully chosen citations) that there is substantial diversity in how the terms are used across the sciences. It's great for science generally that more people are concerned with these issues, even if they don't agree on the definitions! |
@benmarwick writes:
[...]
I wholeheartedly agree—going to IVM for inspiration on what definitions to adopt was misguided. (It is possible, too, that some folks in that committee were influenced by the Drummond papers. SIGH.) The ACM is the Association for Computing Machinery. Although we may resign ourselves to the impossibility of a convergence of terminology across all disciplines, within computational disciplines there is a clear history of adoption. I have given more than a dozen references above, spanning 25 years, and there are many more. (If anyone is adding a counter-example—like, "chemists use the opposite meaning"—please, do include a reference, rather than leaving it as hearsay.) The Science paper @benmarwick cited:
... recognizes that: “… basic terms—reproducibility, replicability, reliability, robustness, and generalizability—are not standardized”, while clearly adopting the Claerbout/Donoho/Peng usage. They say: “ … the modern use of ‘reproducible research’ was originally applied not to corroboration, but to transparency, with application in the computational sciences. Computer scientist [mistake: geophysicist] Jon Claerbout coined the term and associated it with a software platform and set of procedures that permit the reader of a paper to see the entire processing trail from the raw data and code to figures and tables.” [used in] “epidemiology, computational biology, economics and clinical trials…" [refs. provided] Goodman et al. propose a new lexicon as a way out of the confusion:
A good portion of this article derives from a talk given by Goodman at a workshop of the National Academy of Sciences, titled: "Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results." Goodman gave there a useful clustering of disciplines into "groups with similar cultures," as follows:
When discussing the diversity of definitions that @benmarwick mentions, we can look at this clustering and see which group each usage falls in. I already gave a dozen+ references for computational sciences. In epidemiology and social science, the meaning is consistent with Claerbout/Donoho/Peng—cf. Peng, Dominici & Zeger (2006), on epidemiology, and the NSF 2015 [PDF] report for social sciences. In clinical research, there is "no clear consensus as to what constitutes a reproducible study" (Goodman et al., 2016) but the usage of the terms is consistent: one replicates the findings (while reproducibility refers to the process of investigations). For the group of natural world-based sciences, I don't have in my notes references for astronomy or ecology (yet), but we heard from @benmarwick that the usage in archaeology is consistent with the above. The pattern of usage is clear: reproducible study and replicable findings. |
While I agree that IVM is not the last word on terminology for science at large, I don't consider it absurd either to turn to it for "prior art" in choosing terms. Computational science has different issues than experimental science, but in the end, both are forms of doing science and their practitioners should be able to talk to each other. It makes more sense to me to extend the traditional terms from experimental science to computational scenarii where this is possible. |
I wanted to follow-on from @labarba's most recent comment with a reference from natural-world sciences (ecology):
It seems that they follow the most common and widespread use of the terms detailed in @labarba's comprehensive summary except that they switch out the term Replicability for Repeatability. |
I published this on Medium: "Barba group reproducibility syllabus" It's not addressing terminology, but rather a summary of the top-10 references chosen in my group as the basic reading list on reproducibility. Topical to this thread as a complement of the lit review I started above. |
I just added a link on: http://rescience.github.io/about/ |
I only became aware of this discussion a few days ago. Even though the trains has left the station a while ago, I would like to add a few comments. That terminology (reproducibility: re-run the same code; replicability: independent implementation) jars with my general (intuitive) understanding of the terms and was probably the reason why I based the terminology proposed in Crook, Davison and Plesser (2013) on Drummond (2009), while otherwise disagreeing with Drummond. Merriam-Webster differentiates "reproduction" and "replication" as follows (see Synonym Discussion section of the entry): "reproduction implies an exact or close imitation of an existing thing. ... replica implies the exact reproduction of a particular item in all details ... but not always in the same scale." A reproduction is a "exact or close imitation", while a replica is an "exact reproduction in all details", thus a replica is a kind of reproduction that is particularly close to the original. Since re-doing the same study running the same software on the same data is closer to the original than an independent implementation, common usage, in my opinion, suggest that "replication" fits better for re-running using the same software as the original. Furthermore, a quick Google search turned up 18.4 million hits for "reproducible research", but only 0.5 million for "replicable research". So "reproducible" seems to be the far more common term. Now I would think that the (scientific) public at large is first and foremost interested in whether we can trust scientific results, whether they are robust overall, reveal the laws of nature---whether they can be corroborated by independent experimentation. In view of this, is it really sensible to narrow "reproducing" to mean the ability to "running the same software on the same input data and obtaining the same results"? In the pioneering work of Claerbout's group (Claerbout and Karrenbach, 1992, Claerbout, undated, Schwab et al, 2000), I haven't found any discussion of why they chose the term "reproducible" for their approach. I wish they had chosen differently, so that the rather young (25 years) reproducible research movement had not ended up with a terminology at odds with the significantly older metrology. |
While you place the emphasis on the phrase "in all details," I could place it on "exact" to make the same argument for "reproduction" instead of "replication." In the end, it is seldom practical to try to get help from the dictionary for discussions about terms of art. Reproducibility is a spectrum of concerns. A most basic question is: can you run my code with my data and get my results? This is the minimum requirement, and often referred to as "reproducible research." I would wager that's why the search results for "reproducible research" are most numerous. Replications, in the sense of Peng and others, are (unfortunately) quite rare. But reproducible research is a pre-requisite for replication studies, because if replication fails—as Donoho points out—only if both author teams worked reproducibly is it possible to find the source of any discrepancies. |
Personally I don't care much about vague analogies to dictionary definitions that were clearly not written with research in mind. But I do agree with @heplesser's argument about the use of "reproducible" in a wider sense, applying to science in general rather than to the specific problems of computer-aided research. I suspect that Claerbout's choice of the term "reproducible research" was meant to be provocative. All scientific research is supposed to be reproducible (in an ideal world), so what he was arguing for was "merely" to adopt this criterion for computer-aided research as well. Back then, 25 years ago, nobody discussed non-reproducibility in experimental contexts. Today, "reproducibility" in a less well-defined sense has become a widespread concern. Most modern uses of the term can clearly not be interpreted in Claerbout's sense, because they don't refer to computation. In the long run, I doubt computational scientists will be able to claim the "reproducibility" label for their particular rather technical issue. But this question won't be settled before the scientific community at large, which is dominated by experimentalists, understands the various issues and agrees on a common terminology. I expect this to take at least another decade, during which ReScience can become anything from a mainstream journal to a relic of the past. In the meantime, the definitions we currently use in ReScience are clear and have some historical justification, which is good enough for me. |
At the risk of being argumentative by going off-topic, I can't help but react to this statement with a bit of skepticism, given the results of: S.J. Hettrick et al. (2014), UK Research Software Survey doi:10.5281/zenodo.14809
|
@labarba The fact that most scientists use software does not imply that the use of software is seen as a major cause of non-reproducibility by them. My impression of the discussion of the "reproducibility crisis" in the literature is that it focuses on statistical issues, such as insufficient sample size in experiments or p-hacking in data analysis. |
Prompted by @oliviaguest on this Twitter discussion, I would like to bring to attention more recent (2019) literature Reproducibility and Replicability in Science (https://doi.org/10.17226/25303). This "Consensus Study Report" tackles the terminological confusion on p. 42-46 and arrives at the following definition on p. 46:
While the report itself admits varied usage of the terms, such consensus reports, which have been compiled across many disciplines, are good rally points for anchoring terminology. The lack of consistency in terminology hinders progress of the topic and if the ReScience journal takes a stance, it may further help the community by stabilizing the volatile terminology. |
@hlageek ReScience has adopted the definitions you quote a while ago. I hope we use them consistently. If you find an incompatible use, please open an issue! |
@khinsen might it be useful to link to some of these on the website where we discuss/define it? http://rescience.github.io/faq/ |
The idea is to converge on definition of these words according to our respective scientific domains. At this point it is not even sure that all words are relevant to all domains and this may also depend on the kind of software we consider (see discussion in #4)
Rerunable:
Repeatable:
Replicable:
Reproducible:
Reusable:
Remixable:
Reimplementable:
The text was updated successfully, but these errors were encountered: