Fellow journalists and colleagues know my (sometimes almost comical) enthusiasm towards the R project. In fact, for me, 2015 was the year of R: In April, I created Rddj.info, a collection of learning resources for R. During the last months it has grown gradually and I hope that I will grow even faster in the course of 2016 (contributors are still welcome).
Also in 2015, I’ve held quite a few talks explaining the reasoning behind and the advantages of transparent and reproducible data journalism (#rddj), e.g. at the German Netzwerk Recherche yearly conference in Hamburg. And just last week I was informed that I’ve been accepted to talk about it at #NICAR16. Yay!
At SRF Data – the data journalism unit of Swiss public broadcast, the place where I work – almost all of our larger projects in 2015 used R in some form or another. On election day, for example, we published countless fast infographics and charts on Twitter that found great acclaim. With R, we were able to prepare the charts in advance and just had to fetch new results from our SRF-wide API as soon as they were available. Even tweeting the charts directly from within R would have been possible (we’ll do that during the next elections in 2019).
Last but not least we set new standards in (European) data journalism by publishing most of our data and analyses on GitHub pages, with the help of RMarkdown. Others have already started adopting our principles.
So, what advantages does R have? And why should (data) journalists finally start using it in 2016? 2015 (or probably already 2014 or 2013) was the year of listicles, so…
6 reasons why you should start using R in 2016 if you haven’t:
- R is good at almost everything. But certainly at 90% of the task we as (data) journalists encounter on a daily basis: Getting data from a website, transposing a spreadsheet, combining multiple tables, converting JSON to CSV and vice versa, filtering and sorting data, drawing some exploratory plots, preparing data for further use in an interactive data visualization, creating GIFs, you name it. For all these tasks, there are separate, freely available tools: Think of Excel, Google Refine, Datawrapper, Outwit Hub, ScraperWiki, etc. But R can do everything within one script. There are thousands of R packages for every imaginable tasks and it is very unlikely somebody has not done before what you’re trying to do. The second advantage of R being a software that supports the complete workflow is that you won’t have problems converting data from one tool to another. I mean, how would I get JSON into Excel? I don’t even know, I’d probably have to use an obscure online converter or something. And after I finally had managed to save my spreadsheet as CSV, all the special characters in it would look like hieroglyphs in my web visualization. And in between I would probably have to export and re-import my data from Excel to Refine and back and start all over again and then I would have forgotten to transform an important column and and…
- R is free and open source. And it is available for all major platforms. And so is its most popular IDE, RStudio.
- R is easy to learn and let’s you get started in 5 minutes. Not all people will agree with me, but in order to complete simple analyses and data wrangling tasks, the only coding concept you need to know are function invocations. In R, a lot of functions are vectorized, meaning they take vectors (sequences of numbers, for example) and return vectors – which liberates you from the hassle of applying or even understanding constructs like for-loops. The hardest thing for me was to get my head around the various data types R uses, and it still bugs me today. But I’d say in 90% of the typical data journalism tasks, everything can be done with simple to understand data frames and manipulations à la dplyr. And by the way: Every software needs to be taught – the question is whether you spend your time getting to know countless tools or just one programming language.
- R is a language, not a tool. When talking to former fellow students or colleagues in journalism, this is the most often heard argument against using R. In fact, it’s R’s biggest asset, especially in an environment where methods, and not just results, matter. Which leads to the next point:
- R supports a transparent and reproducible workflow. With R, lying is difficult. Once you’re ready to publish your script, and not just your results, everyone will know what you did – and people will hopefully point out your methodological flaws or even errors (I recommend reading this blog post, sorry, notebook, if still not convinced). And that’s something that’s clearly missing in contemporary data journalism: Still too often, data wrangling tasks or analyses are done by a single person, sitting at a computer with a thirty or more Excel tabs open, hardly having a clue which cells were transformed with which function, and which exact steps were applied after each other. Firstly, no one, neither internally nor externally, can double check and understand what actually happened with the data, and secondly, working like this is extremely error prone. Thirdly, what happens if there is a new dataset, or the old one has been updated? Even if you are able to figure out what you did in the first place, applying all the steps again is tedious and, even more, prone to error. R has answers to all these problems. Because everything is scripted, everything can be reconstructed and – most importantly – criticized. Of course this brings with it that everything can be reproduced. Got new data? Start all over with just one command. There’s nothing more rewarding than getting a cup of coffee while your computer does the work somebody else would have spent hours on.
- R has an ever-growing community. According to http://githut.info/ R repos have the most new forks on average. A lot of those repos are probably created by people like Hadley Wickham – “a giant among data nerds” – who turn the cumbersome, weird language that R once was into something more accessible and more fun. Platforms like R-Bloggers host a myriad of helpful resources. And of course there’s Rddj.info (okay, enough of the shameless self promotion).
So, get out and start using R in 2016. And follow me on Twitter for #rddj-related stuff.
Side note: In case you somehow doubt my enthusiasm… Yes, I have to admit there are things I am not really happy with, such as:
- RStudio seems to get successively slower each time I run R code (Ctrl+Alt+R), even when I consequently `rm()` my objects. In order for it to be “fast” again, I have to restart R or even better, restart RStudio. Has somebody observed that problem too? I’m running it on a Ubuntu 64bit Virtual Box with 8 GB virtual RAM. Generally speaking, I think there’s still a lot of room for improvement in terms of performance and management of larger datasets (>1GB). Hadley has a whole chapter on that, though. And is working on it, too.
- JSON. In my experience, parsing JSON and especially transforming R structures into JSON can be a pain, although there exist quite a few packages for it. I still too often end up using complicated, “old-school” R stuff such as `lapply()`.
- Grasping the difference between vectorized and “atomized” function calls – sometimes I don’t really know whether there exists a vectorized version of an action I want to perform on a dataset and end up implementing loops (which always leaves me with some sort of guilty feeling). While I try to do as much as possible with dplyr (in my words some kind of SQL for R), it can get really gnarly and difficult whenever some kind of non-standard, row-wise data manipulation such as Regex-enabled search-and-replace has to be performed in a vectorized form. Look at this snippet, where I spent hours to get something of the like done with dplyr:
direct_matches %<>% mutate_(.dots = setNames( list( interp(~ ifelse(!is.na(b), paste(a, b), a), a = as.name(combined), b = as.name(new_orga_id)) ), combined) ) direct_matches %<>% select(-match(new_orga_id, names(.))) direct_matches %<>% mutate_(.dots = setNames( list( interp(~ as.numeric(sub("\D*(\d+).*", "\1", a)), a = as.name(combined))), combined ) )
I won’t even bother trying to explain what that does (apart from that I can’t remember). It is a toxic mix between something called “non-standard evaluation”, “standard evaluation”, absolutely bizarre functions such as `paste()` and random R weirdness.
But it apparently does the job.