comments 9

Why data journalists should start using R in 2016

Fellow journalists and colleagues know my (sometimes almost comical) enthusiasm towards the R project. In fact, for me, 2015 was the year of R: In April, I created Rddj.info, a collection of learning resources for R. During the last months it has grown gradually and I hope that I will grow even faster in the course of 2016 (contributors are still welcome).

Also in 2015, I’ve held quite a few talks explaining the reasoning behind and the advantages of transparent and reproducible data journalism (#rddj), e.g. at the German Netzwerk Recherche yearly conference in Hamburg. And just last week I was informed that I’ve been accepted to talk about it at #NICAR16. Yay!

A GIF created solely in R for our story http://www.srf.ch/news/infografik/stadt-und-land-sind-politisch-in-festen-haenden

A GIF created solely in R for our story http://www.srf.ch/news/infografik/stadt-und-land-sind-politisch-in-festen-haenden

At SRF Data the data journalism unit of Swiss public broadcast, the place where I work  almost all of our larger projects in 2015 used R in some form or another. On election day, for example, we published countless fast infographics and charts on Twitter that found great acclaim. With R, we were able to prepare the charts in advance and just had to fetch new results from our SRF-wide API as soon as they were available. Even tweeting the charts directly from within R would have been possible (we’ll do that during the next elections in 2019).

One of the charts published for election day, showing the party strength changes in all 26 provinces of Switzerland.

One of the charts published for election day, showing the party strength changes in all 26 provinces of Switzerland.

Last but not least we set new standards in (European) data journalism by publishing most of our data and analyses on GitHub pages, with the help of RMarkdown. Others have already started adopting our principles.

So, what advantages does R have? And why should (data) journalists finally start using it in 2016? 2015 (or probably already 2014 or 2013) was the year of listicles, so…

6 reasons why you should start using R in 2016 if you haven’t:

  1. R is good at almost everything. But certainly at 90% of the task we as (data) journalists encounter on a daily basis: Getting data from a website, transposing a spreadsheet, combining multiple tables, converting JSON to CSV and vice versa, filtering and sorting data, drawing some exploratory plots, preparing data for further use in an interactive data visualization, creating GIFs, you name it. For all these tasks, there are separate, freely available tools: Think of Excel, Google Refine, Datawrapper, Outwit Hub, ScraperWiki, etc. But R can do everything within one script. There are thousands of R packages for every imaginable tasks and it is very unlikely somebody has not done before what you’re trying to do. The second advantage of R being a software that supports the complete workflow is that you won’t have problems converting data from one tool to another. I mean, how would I get JSON into Excel? I don’t even know, I’d probably have to use an obscure online converter or something. And after I finally had managed to save my spreadsheet as CSV, all the special characters in it would look like hieroglyphs in my web visualization. And in between I would probably have to export and re-import my data from Excel to Refine and back and start all over again and then I would have forgotten to transform an important column and and…
  2. R is free and open source. And it is available for all major platforms. And so is its most popular IDE, RStudio.
  3. R is easy to learn and let’s you get started in 5 minutes. Not all people will agree with me, but in order to complete simple analyses and data wrangling tasks, the only coding concept you need to know are function invocations. In R, a lot of functions are vectorized, meaning they take vectors (sequences of numbers, for example) and return vectors – which liberates you from the hassle of applying or even understanding constructs like for-loops. The hardest thing for me was to get my head around the various data types R uses, and it still bugs me today. But I’d say in 90% of the typical data journalism tasks, everything can be done with simple to understand data frames and manipulations à la dplyr. And by the way: Every software needs to be taught – the question is whether you spend your time getting to know countless tools or just one programming language.
  4. R is a language, not a tool. When talking to former fellow students or colleagues in journalism, this is the most often heard argument against using R. In fact, it’s R’s biggest asset, especially in an environment where methods, and not just results, matter. Which leads to the next point:
  5. R supports a transparent and reproducible workflow. With R, lying is difficult. Once you’re ready to publish your script, and not just your results, everyone will know what you did – and people will hopefully point out your methodological flaws or even errors (I recommend reading this blog post, sorry, notebook, if still not convinced). And that’s something that’s clearly missing in contemporary data journalism: Still too often, data wrangling tasks or analyses are done by a single person, sitting at a computer with a thirty or more Excel tabs open, hardly having a clue which cells were transformed with which function, and which exact steps were applied after each other. Firstly, no one, neither internally nor externally, can double check and understand what actually happened with the data, and secondly, working like this is extremely error prone. Thirdly, what happens if there is a new dataset, or the old one has been updated? Even if you are able to figure out what you did in the first place, applying all the steps again is tedious and, even more, prone to error. R has answers to all these problems. Because everything is scripted, everything can be reconstructed and – most importantly – criticized. Of course this brings with it that everything can be reproduced. Got new data? Start all over with just one command. There’s nothing more rewarding than getting a cup of coffee while your computer does the work somebody else would have spent hours on.
  6. R has an ever-growing community. According to http://githut.info/ R repos have the most new forks on average. A lot of those repos are probably created by people like Hadley Wickham “a giant among data nerds” who turn the cumbersome, weird language that R once was into something more accessible and more fun. Platforms like R-Bloggers host a myriad of helpful resources. And of course there’s Rddj.info (okay, enough of the shameless self promotion).

So, get out and start using R in 2016. And follow me on Twitter for #rddj-related stuff.

Side note: In case you somehow doubt my enthusiasm… Yes, I have to admit there are things I am not really happy with, such as:

  1. RStudio seems to get successively slower each time I run R code (Ctrl+Alt+R), even when I consequently rm() my objects. In order for it to be “fast” again, I have to restart R or even better, restart RStudio. Has somebody observed that problem too? I’m running it on a Ubuntu 64bit Virtual Box with 8 GB virtual RAM.  Generally speaking, I think there’s still a lot of room for improvement in terms of performance and management of larger datasets (>1GB). Hadley has a whole chapter on that, though. And is working on it, too.
  2. JSON. In my experience, parsing JSON and especially transforming R structures into JSON can be a pain, although there exist quite a few packages for it. I still too often end up using complicated, “old-school” R stuff such as lapply().
  3. Grasping the difference between vectorized and “atomized” function calls – sometimes I don’t really know whether there exists a vectorized version of an action I want  to perform on a dataset and end up implementing loops (which always leaves me with some sort of guilty feeling). While I try to do as much as possible with dplyr (in my words some kind of SQL for R), it can get really gnarly and difficult whenever some kind of non-standard, row-wise data manipulation such as Regex-enabled search-and-replace has to be performed in a vectorized form. Look at this snippet, where I spent hours to get something of the like done with dplyr:

    I won’t even bother trying to explain what that does (apart from that I can’t remember). It is a toxic mix between something called “non-standard evaluation”, “standard evaluation”, absolutely bizarre functions such as paste() and random R weirdness.

    But it apparently does the job.

9 Comments

  1. MTrost

    Good post.

    Especially since it gives me the opportunity to bring up on small quibble I had with your (otherwise great) election day graph when you published it back when: it would help if you’d use a different colour for decreased/increased vote shares. Or same colour but slightly transparent for lower vote shares and no transparency for higher vote shares.

    On small screens, i.e. smartphones, and with small changes it isn’t that easy to make out if it’s negative or positive. But it could well be that this additional information decreases the already info-rich graph’s readability.

    Being R, tryin it out is easy (corrobarating the point being made by this post).

      • MTrost

        Thanks for the reply. This other graph shows much more clearly the different outcomes across the cantons for one party, but now comparisons between the parties are difficult because x-axis’ changing scale and the y-axis changing order.

        And you’re right, transparency wouldn’t do the trick either.

        Probably the best thing would be to ‘transpose’ the graph from the blog post, so that the cantons are aligned horizontally and the parties vertically. In my experience such panel graphs work better when you put the more numerous categories on the horizontal axis.

        It’s also easier to imagine and follow a horizontal line than a vertical one, which would help in telling positive from negative values by imagining the line at zero (only my subjective experience, don’t know any studies confirming this).

  2. Interesting post! One nitpick:

    Please don’t feel guilty about using for loops, because it is most often *not* the case that for loops are slower than *apply functions. In my opinion, this is probably the most common myth about the R language.

    See, e.g., the comments here: http://stackoverflow.com/questions/16486661/speed-up-r-loop

    Also see here: http://stackoverflow.com/a/7142982/560791

    In the latter link, Karl Broman discusses the main thing for making sure that for loops are not too slow:

    > Initialize new objects to full length before the loop, rather than increasing their size within the loop.

  3. Thank you for your encouraging post. I say ‘encouraging ‘ because the uptake of R in my location seems to be really slow. Maybe people are intimidated by the scripting. My encounter with R last year was both exhilarating and liberating, and I’m looking to work at spreading the news about this great tool (or is it language?) among my compatriots.

  4. Pingback: 为什么2016年数据新闻记者应该开始使用R – 数艺智训

  5. Pingback: 2 – Why data journalists (and everybody else) should start using R in 2016

  6. Great post mate!

    Regarding your gripe about rm():
    My understanding is that rm() removes the objects from R, but to clear the memory off of RAM, you’ll need to run gc(), garbage collection.

    Looking forward to reading more! Do you post your posts onto rbloggers?

Leave a Reply