comments 12

A (truly) reproducible R workflow

Since more than two years I have been preaching reproducibility and transparency in data journalism. In the meantime, our team at the Swiss Public Broadcast has published countless RMarkdown scripts on GitHub. A special effort that didn’t go unnoticed and that even got us a nomination for the data journalism awards 2017 in the category “data journalism website of the year”. Nowadays, publishing R or Python scripts on GitHub or the like seems to feel natural for many teams.

My personal tool of choice: R and reproducible reports with RMarkdown.

But these reports aren’t really reproducible.

I don’t know about you, but whenever I dig up an old script (= let’s say older than a year) and try to run it, the chance is high that the RMarkdown compilation fails at a certain point. Having changed nothing with the input data and the script itself, the culprit is quickly identified: new package versions.

The R environment evolves quickly and popular packages like `dplyr` and `ggplot2` got major overhauls in the last few months and years. Quite possibly a plot looks different from the “same” two years ago. Quite possibly a simple process like reading in an Excel sheet fails nowadays because `readxl` now renames columns as `X__n` rather than `Xn`.

The problem is that, up to now, all my scripts always installed the latest versions of packages. And not only my scripts – almost all scripts I see in the wild, doesn’t matter whether by scientists or data journalists, suffer from that shortcoming. But for true reproducibility, someone who wants to re-run a script and replicate the results, needs to run it with the exact same packages*.

So I took some days and tried to figure out how to come around this problem. Of course there are already solutions packages for this: There is `packrat`, which makes some sort of a snapshot of a project’s packages (or their source code). I tried it out and for some reasons couldn’t get it to work flawlessly with my previous setup. Also, I was a bit turned off by versioning the package sources with Git.  I then came across `checkpoint`, which allows to install and load packages from a specific point in time (= from a specific mirror of the whole CRAN, hosted by Microsoft). `checkpoint` saves the installed packages not in a global folder (which is the case for the usual setup) but into a `.checkpoint` folder in one’s home directory, by default. To me, that solution seemed more elegant, and it worked (with some tweaks for RMarkdowns I had to figure out the hard way). See below for some downsides of this approach.

And now, for those who have been bored up until now and are only looking for the solution™️:

I packed my two or three years experience with creating “reproducible” RMarkdowns, working with RStudio and co. into a new template (there is an old one too – now deprecated) which has the following perks:

  • Comes with cutting-edge, tried-and-tested packages for efficient data journalism with R, such as the tidyverse
  • Full reproducibility with package snapshots (thanks to the checkpoint package)
  • Runs out of the box and in one go, user doesn’t have to have anything pre-installed (except R and maybe RStudio)
  • Automatic deployment of knitted RMarkdown files (and zipped source code) to GitHub pages, see this example
  • Code linting according to the tidyverse style guide
  • Preconfigured .gitignore which ignores shadow files, access tokens and the like per default
  • Automatic working directory configuration for multiple users

This template could make your own projects truly reproducible and – most importantly – should allow third parties who have not much experience with R or RMarkdown to quickly reproduce your results in one go. It is branded with “data journalism” but it of course also works with “conventional” science. At the moment, we’re migrating our old projects on srfdata.github.io to this new template (one example) and so far it seems to work – scripts which failed previously miraculously work again once we set the `checkpoint` snapshot to the right point in time.

If you have questions, criticism or issues, use the comments section or GitHub issues (I’m always happy for pull requests) or hit me up at #useR2017 this week.

*or even: with the same R version compiled on the same architecture, but that’s a different story and inconsistencies between package versions are probably way more common.

The downsides of this approach

There have been quite some discussions here and on Reddit  about this template. Here I just want to quickly state my opinion on a couple of arguments that were brought up (taken from the README).

With `checkpoint`, you can only access archived packages from CRAN, i.e. MRAN. As others have pointed out, GitHub repositories don’t fit into this system. I wouldn’t consider this as a big issue as you can install specific versions (i.e. releases/tags) from GitHub and as long as the GitHub repository stays alive, you can access these old versions. This is how the checkpoint package itself is installed in this template, by the way:

devtools::install_github("checkpoint",
                           username = "RevolutionAnalytics",
                           ref = "v0.3.2")

A second possible disadvantage is the reliance on Microsoft’s snapshot system. Once these snapshots are down, the whole system is futile. I reckon/hope there will be third party mirrors though once the system gets really popular.

Some people also mentioned Docker, as it allows to make an image of your whole system, thus guarantees absolute reproducibility (like a time machine). The reason why I decided against that is predominantly accessibility for less technically inclined useRs. See my comment below.

PS: Some weeks later, I came across a case where package reproducibility turned out to be absolutely crucial. Check it out.

12 Comments

  1. Pingback: This is what happens when you use different package versions, Larry! | Timo Grossenbacher

Leave a Reply