comments 8

A (truly) reproducible R workflow

Since more than two years I have been preaching reproducibility and transparency in data journalism. In the meantime, our team at the Swiss Public Broadcast has published countless RMarkdown scripts on GitHub. A special effort that didn’t go unnoticed and that even got us a nomination for the data journalism awards 2017 in the category “data journalism website of the year”. Nowadays, publishing R or Python scripts on GitHub or the like seems to feel natural for many teams.

My personal tool of choice: R and reproducible reports with RMarkdown.

But these reports aren’t really reproducible.

I don’t know about you, but whenever I dig up an old script (= let’s say older than a year) and try to run it, the chance is high that the RMarkdown compilation fails at a certain point. Having changed nothing with the input data and the script itself, the culprit is quickly identified: new package versions.

The R environment evolves quickly and popular packages like dplyr and ggplot2 got major overhauls in the last few months and years. Quite possibly a plot looks different from the “same” two years ago. Quite possibly a simple process like reading in an Excel sheet fails nowadays because readxl now renames columns as X__n rather than Xn.

The problem is that, up to now, all my scripts always installed the latest versions of packages. And not only my scripts – almost all scripts I see in the wild, doesn’t matter whether by scientists or data journalists, suffer from that shortcoming. But for true reproducibility, someone who wants to re-run a script and replicate the results, needs to run it with the exact same packages*.

So I took some days and tried to figure out how to come around this problem. Of course there are already solutions packages for this: There is packrat, which makes some sort of a snapshot of a project’s packages (or their source code). I tried it out and for some reasons couldn’t get it to work flawlessly with my previous setup. Also, I was a bit turned off by versioning the package sources with Git.  I then came across checkpoint, which allows to install and load packages from a specific point in time (= from a specific mirror of the whole CRAN, hosted by Microsoft). checkpoint saves the installed packages not in a global folder (which is the case for the usual setup) but into a .checkpoint folder in one’s home directory, by default. To me, that solution seemed more elegant, and it worked (with some tweaks for RMarkdowns I had to figure out the hard way). See below for some downsides of this approach.

And now, for those who have been bored up until now and are only looking for the solution™️:

I packed my two or three years experience with creating “reproducible” RMarkdowns, working with RStudio and co. into a new template (there is an old one too – now deprecated) which has the following perks:

  • Comes with cutting-edge, tried-and-tested packages for efficient data journalism with R, such as the tidyverse
  • Full reproducibility with package snapshots (thanks to the checkpoint package)
  • Runs out of the box and in one go, user doesn’t have to have anything pre-installed (except R and maybe RStudio)
  • Automatic deployment of knitted RMarkdown files (and zipped source code) to GitHub pages, see this example
  • Code linting according to the tidyverse style guide
  • Preconfigured .gitignore which ignores shadow files, access tokens and the like per default
  • Automatic working directory configuration for multiple users

This template could make your own projects truly reproducible and – most importantly – should allow third parties who have not much experience with R or RMarkdown to quickly reproduce your results in one go. It is branded with “data journalism” but it of course also works with “conventional” science. At the moment, we’re migrating our old projects on srfdata.github.io to this new template (one example) and so far it seems to work – scripts which failed previously miraculously work again once we set the checkpoint snapshot to the right point in time.

If you have questions, criticism or issues, use the comments section or GitHub issues (I’m always happy for pull requests) or hit me up at #useR2017 this week.

*or even: with the same R version compiled on the same architecture, but that’s a different story and inconsistencies between package versions are probably way more common.

The downsides of this approach

There have been quite some discussions here and on Reddit  about this template. Here I just want to quickly state my opinion on a couple of arguments that were brought up (taken from the README).

With checkpoint, you can only access archived packages from CRAN, i.e. MRAN. As others have pointed out, GitHub repositories don’t fit into this system. I wouldn’t consider this as a big issue as you can install specific versions (i.e. releases/tags) from GitHub and as long as the GitHub repository stays alive, you can access these old versions. This is how the checkpoint package itself is installed in this template, by the way:

A second possible disadvantage is the reliance on Microsoft’s snapshot system. Once these snapshots are down, the whole system is futile. I reckon/hope there will be third party mirrors though once the system gets really popular.

Some people also mentioned Docker, as it allows to make an image of your whole system, thus guarantees absolute reproducibility (like a time machine). The reason why I decided against that is predominantly accessibility for less technically inclined useRs. See my comment below.

8 Comments

    • You’re welcome. I have considered Docker quickly but a) I’d like to stay as close to the R environment as possible and b) I somehow think it would be a bit overkill. With my proposed workflow, the only thing people need is R and ideally RStudio. It should really be executable without having to install anything else / manually.
      Concerning RStudio projects: I “heard” of them, but do they archive the used packages in some sort? To be honest I haven’t really tried that out, but I think that the dependency on RStudio’s functionality (even though I love it and it’s great) should/could be avoided here. Yes: I know, my workflow depends on checkpoint and Microsoft’s CRAN snapshots.

  1. Matt Pancia

    This is nice and pretty minimal, but there’s a small issue with moving away from packrat, which is that you cannot include/version packages that are installed via Git. Some packages aren’t on [C/M]RAN, or you might want to use a development version of a package that has more features / is not broken in some way.

    You can’t do this with checkpoint.

    • Hey Joao
      I quickly looked into pacman. It has a lot of cool functions & features, but I don’t really see a solution for installing specific package versions (which is needed for reproducibility as defined here). You can specify a *minimal* package version, but not a *specific* one. Minimal package versions are better than no versions, but still a package can change slightly in the future and not be backwards compatible. Or did I miss something with pacman?

  2. Thanks for this great post. The more people read about this, the more likely change will happen.

    Regarding Docker, we’ve recently presented (http://o2r.info/2017/07/07/useR2017/) a package that should make things easier for “less technically inclined users”: containerit, see https://github.com/o2r-project/containerit/
    So, together with other packages, you could create, build and execute a Docker image using plain R functions. Would that change your assessment of Docker for your use case?

    • I unfortunately missed your talk but I read about it. Sounds promising, I will have a look into it. I recently stumbled upon an issue because of the wrong R version. This would probably be solved by your approach. Thanks for pointing me to it.

Leave a Reply