Recently I made a point for “true” RMarkdown reproducibility via checkpointed package versions. Shortly thereafter I learned the hard way how crucial it is to use exactly the same R packages that were used when the script was initially written.
At work, I was in the process of migrating previously published RMarkdown scripts to the above introduced rddj-template. I cloned each of our RMarkdowns to my machine and tried to execute them. Most of them are almost two years old and so I wasn’t surprised some of them would fail at some point. One script that would run through but still produce completely unexpected results was an analysis of Swiss MPs vested interests.
After having run the script as it was in August 2015 from top to bottom, without errors, I quickly executed git status in the command line and noticed that two output files had been changed (this is a good point for including script output in version control, too, by the way).
I then ran git diff in order to see what had actually been changed.
To my horror, I noticed that a lot of rows in those two CSVs had been deleted (not shifted, but erased). The resulting CSV now only had 127 rows instead of the expected 200 or so.
If I wouldn’t use git to version control my output files, I would have never noticed – or noticed too late.
So why is that? Keep in mind that I executed exactly the same code from two years ago, but with updated package versions. Specifically, I ran it with dplyr Version 0.5.0 (which isn’t even the latest version nowadays), but the original script was written with Version 0.4.2.
I think that dplyr might be the culprit here, more specifically, its inner_join function. Here’s some code from that script:
mps_and_professions %<>% inner_join(professions_lt, by = c("profession_french" = "title.fr")) %>% select(-profession_french, -profession_german, -title.de, -title.en, -title.it, -title.ro) %>% rename(profession_id = id.y)
Somehow, running this with Version 0.4.2 retains all approx. 200 rows in the mps_and_professions dataframe. Running it with Version 0.5.0 shrinks it down to 126 or 127 rows. I tried to figure out the reason for this by looking at the release logs but didn’t find a clue (maybe you can help me?).
Using checkpoint as advocated in my rddj-template, and thus the same package versions as back then (actually as of August 1st, 2015), I can run the script and it produces exactly the same results as back then, i.e. git diff doesn’t show anything.
- Different package versions can introduce tremendous changes – using the same versions is crucial for true reproducibility
- These changes might go unnoticed if script output is not versioned
PS: If you want to reproduce the problem, execute the following commands:
- git clone https://github.com/srfdata/2015-07-elections-parliament-lobbying.git
- Checkout the script as of August 2015 with git checkout 111ee108
- Run the main.Rmd script (with package versions as of today) and look at the changed output with git diff
- Checkout the script as of July 2017 with git checkout master and run it again. In this process, packages as of August 1st, 2015, are loaded via the checkpoint mechanism. Now, git diff should display nothing.
Pingback: A (truly) reproducible R workflow | Timo Grossenbacher
Pingback: Reproducibility: A cautionary tale from data journalism – Cloud Data Architect
Pingback: Checkpointing Code For Reproduction – Curated SQL
Pingback: A cautionary tale from data journalism (Revolutions) – Iot Portal