In this post we show how to create highly aesthetic bivariate choropleth maps, including annotations and a custom legend –exclusively in R.
Here’s what we found in the Collection #1-5 password leaks
This was an investigation deep into the heart of the “Collection #1-5” password leaks that appeared in the web in early 2019. We showed that more than 3 million Swiss email addresses and – more disquietingly – over 20’000 email addresses of Swiss authorities and providers of critical infrastructure appear in the leak.
Aside the usual broadcast and online channels, we also released a short Youtube video for a younger audience that explains the dangers of using a weak password. For demonstration purposes, I gained access to the Instagram account of our host, Lena, within a couple of hours.
In this project, I used a so-called “big data technology”, Spark, for the first time. While we at SRF Data usually publish the source code for our data processing, I decided against doing so in this case. Instead, I wrote a blog post that explains the process and helps other journalist tackling similar “big data” problems.
Deep Fakes – indistinguishable from magic
Any sufficiently advanced technology is indistinguishable from magic.Arthur C. Clarke
When so-called “deep fakes” popped up in late December on Reddit, they caused quite a stir. At SRF Data, about half a year later, we wanted to explain the technology with self-made examples. Without external help, we created a series of deep fake experiments ourselves, using an open-source AI framework. To our surprise, we could produce an astonishingly good fake, replacing the face of one of our most prominent news anchorwomen.
In a lengthy article, we then went into the nitty-gritty of how the technology behind deep fakes work – publishing probably the nerdiest article about deep fakes in mass media up to that date.
To explain the subject, we only used animated GIFs like the one above, and some demonstration videos.
Aside the usual broadcast and online channels, we also released a short Youtube video for a younger audience that explains what deep fakes are – and what dangers there could be. To demonstrate this, we faked the face of a famous (straight) Swiss comedian into a gay porn. While he had been involved from the beginning, you can see from his expression that he’s not very pleased once he sees the results.
The dark chamber of the Swiss prosecution system
If a Swiss prosecutor wants to put somebody in custody during an investigation or needs to employ surveillance methods, he or she needs to get the okay of a so-called “Zwangsmassnahmengericht”, a special court responsible for these “compulsory measures” that need to be brought into force only in special circumstances.
These special courts have been put into action in the year 2011. Since then, they have assessed countless applications by state prosecutors, but have remained largely intransparent and secret – an actual “dark chamber”.
For the first time, my team has profoundly investigated the (often hidden) statistics behind the secret court orders. The number that resulted in the end is rather disquieting: 97 percent. That’s the ratio of applications that get accepted. In other words, state prosecutors almost always get a “thumbs up” for their invasive actions. In that sense, these rather new courts don’t really seem to be a barrier for law enforcement.
The research has stirred up quite some political uproar – many parliamentarians, for example, have assured us that they’d like to go over legislation and make some adjustments. Also, experts have came up with the idea of an attorney for human rights that should be heard during the decision making process.
The Swiss criminal justice system employs an intransparent algorithm to assess its inmates
In this research we could show that the Swiss criminal justice system uses a simple, but intransparent algorithm to categorize inmates into three risk classes: A, B and C. Especially C-class inmates have to undergo more profound screening and are under increased scrutiny, potentially impacting their right to parole. Our research showed that the algorithm, when applied in different cantons (= Swiss provinces), produced very different results: Sometimes, C-class inmates would be the majority, sometimes A-class, etc. The officials didn’t have an explanation for this and wanted to look into it
Also, while the algorithm had previously been publicly known, its weights and exact mechanism had been a black box. Under our journalistic pressure the responsible authority finally released the inner workings of the system for public scrutiny.
The Swiss police use a dubious risk assessment software
In this investigation, I could show that the Swiss police uses a dubious software to assess the risk of individuals of being a “potential danger for the public”. While the use of the software had been publicly known until that point, I exclusively dug out some studies that show that the software actually performs not very well. In fact, out of three people it deems “potentially dangerous”, two actually weren’t.
The Swiss police and other state actors use more and more automated & algorithmic systems that take the burden of hard decisions off of them. I think that now is the right time to lay a special focus on such systems and investigate their (hidden) biases.
This story came together with an interactive simulation that explained the trade-off between false positives and false negatives.
This investigation won the 2019 Surveillance Studies price for journalists.
(Big) Data Journalism with Spark and R
At Swiss Public TV and Radio (SRF) we recently published an investigation of the “Collection #1-5” password leaks. In this post, I show how I searched through 900GB+ of data with Spark and R.
Ist E-Voting in der Schweiz sicher?
Eine journalistische Einschätzung zur E-Voting-Schwachstelle, die er Chaos Computer Club Schweiz veranschaulicht hat.
Categorical spatial interpolation with R
How I used the kknn and ggplot2 packages together with some parallel computation to spatially interpolate several hundred thousand points.
Identifying a large number of fake followers on Instagram
Being a so-called influencer is the dream job of the moment for a lot of young people. Getting a wealth of free products, or even bare cash, in exchange for an Instagram post is enticing, and the advertising industry seems to have discovered a new, effective form of approaching target groups.
However, accusations started appearing that the followers of many influencers are, in fact, fake. Could these accusations be true? There has never been a systematic study on the subject – nobody, neither in or outside of Switzerland, has ever tried to thoroughly quantify the fake follower problem on Instagram.
So that’s what we at SRF Data did. We trained a machine learning model to automatically classify 7 million Instagram accounts regarding their “fakeness”. By doing so, we found out that roughly a third of these accounts, following 115 Swiss influencers, are indeed fake.
Some influencers had more than 50% fake followers, which raises questions about the integrity and authenticity of these follower bases. Consequently, the publication caused quite a stir in the Influencer economy.
If you want to know more about the methodology behind this project, look at this making of or at the original source code behind the analysis.
Pendlerland – A personalized view on Swiss commuting patterns
For some, Switzerland is one big city with high-speed trains and highways functioning as tram and bus lines – people commute from Bern to Zurich like they would from a district to another. Indeed, Switzerland has some of the highest commuter rates in the world, and there’s a plethora of statistics available.
At SRF Data, we took these data sets and tried to find a personal approach to it, besides just reporting numbers with charts. We came up with an interactive and adaptive article that changes its content and layout based on what the reader would fill in as his commuting journey.
Based on this, the reader is presented with a fully personalized view on the topic of commuting.
I did not only preprocess and analyze the raw data with R, I also came up with some pretty neat GIF graphics that add some eye candy to the article.
These are basically a sequence of ggplot’s `geom_point` graphics.
My colleagues from swissinfo.ch translated the article into Russian, Chinese, Spanish and Japanese.
This is what happens when you use different package versions, Larry!
Recently I made a point for “true” RMarkdown reproducibility via checkpointed package versions. Shortly thereafter I learned the hard way how crucial it is to use exactly the same R packages that were used when the script was initially written.
A (truly) reproducible R workflow
Since more than two years I have been preaching reproducibility and transparency in data journalism. My tool of choice: R and reproducible reports with RMarkdown.
But these reports aren’t really reproducible. A solution.
Mapping urban sprawl in Switzerland
Urban sprawl is one of Switzerland’s (few) biggest environmental problems. Since 1985, the population has grown by more than 30 percent, and since then, land of the size of the Lake Geneva has been plastered with concrete.
In our interactive explainer «Bauland» we present facts and figures regarding urban sprawl, but the core element is a feature where the reader can choose its own municipality and then switch between different years to see how urban sprawl has changed its face. This visualization is based on a very detailed Swiss statistic, where every hectare (10k square meters) is surveyed every couple years and classified into categories forest (dark green), farmland (bright green), settlement (dark grey) and unproductive area such as glaciers (bright grey).
The project was nominated as one of three projects in the prestigious Swiss Press Online Award 2017.
Here’s how 670’000 people speak German
In April 2015, when I was still working at Tages-Anzeiger, we published a hugely successful dialect quiz. After a week or two, we had over 2 million unique visitors, also thanks to the co-publication by Spiegel Online.
The quiz predicted someones most likely cities of residence, and users could give feedback on that (see the form on the right side above).
Now comes the thing that stunned me the most: Over a third of all visitors actually filled that form out – we ended up with over 670’000 responses, i.e. people’s answers to the 25 questions and their self-proclaimed location of residence as WGS84 coordinates.
In the R statistical environment, I summarized these point data to hexagons and exported them to GeoJSON/TopoJSON. Now we had 25 different maps (for the 25 different initial questions, better: words) that showed the regional distribution of answers (better: pronunciations), based on the biggest online dialect survey ever conducted in Europe. We published these maps on the online presence of the Swiss Public Broadcast (SRF) as well as on tagesanzeiger.ch and spiegel.de.
Vested interests of Swiss universities
In our largest data-driven research so far we examined the vested interests of Swiss universities. We researched, among other things, more than 1000 secondary employments of professors and more than 300 sponsored professorships. The investigation resulted in publications in dozens of different radio and television programs of the Swiss Public Broadcast SRF.
The research launched a national debate on the independence of the Swiss Universities. Over the course of the following year, some universities have already implemented systems for more transparency. In the meantime, we were transparent ourselves and published our curated and tediously preprocessed database on GitHub.
The project was awarded the prestigious “Prix Média Newcomer” of the Swiss Academies of the Arts and Sciences.
As I outlined in this blog post, the statistical software environment R is becoming more and more popular among journalists.
However, finding an entry point to the R programming language is not that easy, especially for people without programming experience.
That’s why I built the continuously updated Rddj.info – a resource collection for learning how to do data journalism with R. It showcases a great variety of tutorials for every skill level and a lot of helpful, quick recipes.
Beautiful thematic maps with ggplot2 (only)
In this blog post, I explain step by step how I (eventually) achieved a nice thematic map with pure ggplot2 – from a very basic, useless, ugly, default map to the publication-ready and (in my opinion) highly aesthetic choropleth.
Implications of the new Swiss surveillance law
End of September 2016, the Swiss people accepted a new federal law that grants the intelligence agency new competences, e.g. the agency can now search Internet traffic that “leaves or enters” the country for suspicious keywords (similar to XKEYSCORE).
Questioning whether one can actually speak of an “outside” or “inside” when it comes to Internet traffic, we at SRF Data wanted to explain the reader if it is theoretically possible to be surveilled when browsing a Swiss website (even one physically hosted in Switzerland). Turns out that the large majority of requests to the top 180 Swiss websites “leave” Switzerland and are routed over Germany or France or even the US – and are thus subject to surveillance. In order to visualize this, we rebuilt a terminal that allows the reader to fire up a “traceroute” requests (nerds galore yay!).
Switzerland’s dual-use exports
May 2015 to January 2018
Dual-use goods are goods that can be used for civil and military purposes. One example for these kinds of goods are so-called IMSI catchers, devices used to surveil mobile phones. In Switzerland, these goods are governed with a special legislation – unlike in other countries, where they are looked at as conventional arms exports. At SRF Data, we took the effort to parse and visualize the recently released data from the State Secretariat for Economic Affairs SECO.
Because our data processing workflow is fully reproducible, we can re-publish the vis again and again as soon as new data are available. One of the cool things about this project is that it can be updated every year once the SECO releases new data. We already did that in 2015, 2016, 2017 and 2018. The data and methodology are freely available on GitHub, as with other stories published by SRF Data.
Technologies used: D3.js, DC.js.
«Sandalen im Schnee»
As one of the first projects at SRF Data we published a web documentary, called «Sandalen im Schnee» (sandals in the snow), about the asylum seeking process in Switzerland. The migrant crisis was at its peak and we wanted to explain and document the controversially discussed Swiss asylum seeking process – without resorting too much to the emotional side and without taking party.
We interviewed six asylum seekers from different colors and enriched the explanatory parts (entry, registration, the long wait, decision stay/go, etc.) with their personal stories. The core element and red line of the web documentary is – among video, interactive data visualizations and, of course, text – an interactive navigation that a) always tells the reader where he stands in the text and b) shows a simplified schema of the asylum seeking process in Switzerland.
The project was nominated for the German Reporter Prize 2015 and for the Grimme Online Award 2016.
«Sprachatlas» – an interactive dialect quiz for the German-speaking region
Inspired by the hugely successful New York Times dialect quiz, our team at Tages-Anzeiger, consisting of me and Marc Brupbacher, teamed up with language scientist Dr. Adrian Leemann (back then conducting research at the University of Zurich) to launch a similar application for the German-speaking region. Dr. Leemann provided the data and I was responsible for coding the frontend. I mainly used AngularJS, and LeafletJS for the map.
Our quiz was very similar to the NYT one: The user has to answer 25 questions on how certain words are spoken out in his region (I was baffled to learn that there exist more than 20 different ways of saying that you have the hiccups).
Based on the answer, a probabilistic model calculated the most likely cities of residence, and based on that, a form of a heatmap shows the likely region of residence.
On the result screen, the user was also able to rate his personal prediction and to share the results via Twitter. The answers from the feedback form are now an invaluable data basis for new research conducted by Dr. Leemann.
Fortunately, we found a worthy partner in Spiegel Online for publishing the project. This allowed us to reach a huge audience: In the first 2-3 days, we had over 1.5 million unique visitors. To me, this project is a very good example on how journalists and scientists can launch awesome applications together and profit from each other – journalists get interesting and new data sets and scientists have a platform to publish their research which would otherwise remain in the academic domain.
Why data journalists should start using R in 2016
For me, 2015 was the year of R. The year I finally started to use R productively and on an almost daily basis (after years of learning and forgetting and learning all over again). In this post, I share my experiences and tell you why you should start using it for your next data journalism project in 2016.
The Spatial Distribution of Swiss Soccer Fans
Together with fellow data journalists Mario Stäuble, Patrice Siegrist and Julian Schmidli, I realized this CartoDB-based map of the distribution of Swiss soccer fans while I was still working for Tages-Anzeiger. Find a detailed description and implementation details on the CartoDB-Map-of-the-Week-Blog.
This project won a prize in the category “data journalism” of the 16. European Newspaper Award (p. 44), together with another project we realized at Tages-Anzeiger.
Back when I was working at Tages-Anzeiger, I was asked to find a way to condense the content of several hundred PDF files into one spreadsheet. These PDFs contained indicator variables about the performance of nursing and retirement homes, and for some strange reason, they were only available as individual PDFs. I took it as an opportunity to learn new features of Node.js and it turned out to be a really good solution. In this post, I explain what I came up with.
Tippinho – The Data Driven Soccer Prediction Game
About a month before the start of the 2014 soccer world cup in Brazil, my colleague and fellow data journalist Julian Schmidli, with whom I worked together at Tages-Anzeiger and now at SRF Data, came up with the idea of creating an interactive soccer prediction game. Not an ordinary one, but one that would assist users in finding an optimized prediction for all encounters, based on the variables they would judge important for winning the tournament.
In the end, after the user has weighted each of the eight variables such as ball possession, scored goals, etc, he is presented with a complete tournament simulation. As he or she alters the weights on the right hand side of the screen, the tournament is dynamically being recalculated.
The application was designed as “mobile first” and is perfectly playable on mobile devices, too. There, less information is shown in the tournament overview, and the weighting controls do not show accumulated scores, but current weights only.
This project won a prize in the category “data journalism” of the 16. European Newspaper Award (p. 44), together with another project we realized at Tages-Anzeiger.
Truth and Beauty in Georeferenced Social Media
As georeferenced data from social media, be it in the form of Tweets, Foursquare Check-Ins, Instragram photos, Flickr pictures, etc., are increasingly available, so do (geospatial) analyses and visualizations done with them become more and more popular. Often, such studies and applications claim to be able to infer social, cultural, and even political insights from these data, spatially fine-grained and referenced down to the level of countries and cities.
I haven’t seen a single one which actually succeeded in plausibly explaining the how to me.
Together with my friends from the SoMePolis project, I drafted, designed and implemented an interactive application which allows to monitor and analyze the Twitter accounts of Swiss parliament members. This blog post sums up pretty well what is possible with the app — unfortunately it’s written in German.
In short, the app allows the user to rank parliament members based on their activity and interactivity on Twitter — two concepts which were operationalized with key figures such as the number of Tweets within a timespan and the number of interactions with other users. The goal was to provide a simple yet transparent and not too complex measure of how politicians perform on Twitter, i.e., exactly the opposite of what other services such as Klout provide — nontransparent, proprietary ranking algorithms.
The user interface supports common interactions such as sorting, filtering and searching, and all these actions automatically trigger a dynamic update of the visible data set.
One thing that I am quite proud of is the responsiveness of the app. This was not implemented with the help of a common framework such as Bootstrap.js and thus does not rely on CSS media queries (but probably could have). In fact, the window size is extracted and, based on this information, the corresponding data items are arranged dynamically. Naturally, each resize of the window triggers a re-arrangement if needed (try it out by resizing your browser window).
I again implemented this visualization with D3. In this case, it was particularly helpful for data filtering and sorting, but also for the animations that happen on page change and window resize.
Global Oil Production & Consumption Since 1965
Instead of writing yet another paper, I handed in this visualization for the LERU Bright 2013 Student Conference which will be held in August in Freiburg, Germany. This year’s conference topic is “Energy Transition in the 21st Century” and I am part of the “Dependencies” working group.
This “Atlas der Globalisierung”-inspired visualization, based on very recent data by BP, allows the reader to quickly grasp the temporal and spatial differences in oil consumption and production. On one hand, during certain periods of history, some nations consumed almost as much oil as the rest of the world together. On the other hand, the data of the last ten years show a growing divergence between consumption and production. After all, I hope this work makes clear that nations are heavily interdependent when it comes to oil – the main driver of our global economy.
Crafted with D3.js.
The visualization was later adapted by the Swiss daily newspaper “Neue Zürcher Zeitung” :
In my opinion, the team around Sylke Gruhnwald did a very good job in taking the essence out of my visualization and also in leveraging it in terms of usability. The according time series are not shown on mouseover, but triggered by mouse clicks and touch events, which makes it easier for mobile device users to study a country’s oil consumption as well as production. I also prefer the adaptive “stacking” of all chart lines in the background, which does not necessarily give more information but is still very aesthetic. What is missing, in my opinion, is the bar chart that directly compares the huge differences in consumption and production between different countries. The small data tables at the lower right corner give an impression of this, but do not visually convey it.
If you are interested in coming projects, follow me on Twitter.
Swiss Votes Explorer
I implemented this interactive visualization for a university course about geovisualization. It allows the user to visually compare two variables (either the outcome of a vote or a socio-demographic factor, per district). Two maps are used to show the geographical distribution of the variables, and quartiles are computed to distribute the values into four classes.
A scatterplot allows the user to contrast the two chosen variables and explore possible associations. Both views (map and scatterplot) are linked, so the user always knows which district he or she is dealing with.
On top of that, the user has the possibility to draw a rectangle around the points in the scatterplot and thus select a set of districts (“brushing”). After this, the corresponding districts are highlighted in the map. This helps, for instance, to quickly identify regions where certain variables have high or low values, etc.
In order to better compare differences between subsequent variables, a linear, visual transition of half a second is used. This helps to mitigate the problem ofchange blindness.
Crafted with D3.js, color palettes by Cynthia Brewer.
How to install SpatiaLite and SpatiaLite GUI on Ubuntu 12.04
SpatiaLite is an OpenGIS-enabled spatial extension to SQLite, similar to PostGIS. Unfortunately, the packaged version for Ubuntu, especially the GUI, is rather outdated. Therefore, this post shows how to compile and install the latest versions (4.0.0 and 1.6 respectively) and all needed dependencies from source.
Wenn das Handy endgültig zur Wanze wird (Teil 2)
Im zweiten Teil meiner Serie zur Auswertung von Mobilfunk-Standortdaten fühle ich den Schweizer Anbietern auf den Zahn und gehe auf die datenschutzrechtliche Situation in der Schweiz ein. Dürfen personenbezogene Standortdaten überhaupt weiterverkauft werden? Und wieso fallen anonymisierte Standortdaten nicht unter das Datenschutzgesetz?
Anonym auf Twitter?
Anonym auf Twitter seine Meinung kundgeben? Ist dies überhaupt möglich? Eine Replik auf einen Tweet.
Wenn das Handy endgültig zur Wanze wird (Teil 1)
Laut einer Medienmitteilung will der spanische Telekommunikationsanbieter Telefónica Standortdaten von Mobilfunkteilnehmern an Werbekunden verkaufen. Dass dies aus Sicht des Schutzes der Privatsphäre äusserst brisant ist, steht ausser Frage. Im ersten Beitrag dieser zweidreiteiligen Serie stelle ich die zahlreichen technischen Möglichkeiten zur Auswertung jener Standortdaten vor und zeige, wie man vermeintlich anonyme Daten auf einzelne Benutzer rückführen kann.
Was, wenn Facebook deinen Einkaufskorb kennt?
Kürzlich wurde bekannt, dass Facebook mit Dritten kooperiert, um Kaufverhalten von Benutzern auszuwerten. In meinem ersten deutschsprachigen Post fasse ich die technischen Details zusammen und wage eine Prognose, was damit möglich wäre.
Whats Up With WhatsApp? A Summary Of The Recent Security Flaws For The Ignorant User
Over the past few months, tremendous security leaks have been reported for WhatsApp. This blog post gives an updated, easy-to-read summary of how and why the app is vulnerable to certain attacks.
Stunning Examples of The Modern Graphical Web with SVG and WebGL
SVG and WebGL will revolutionize the graphical web, and this blog post shows some impressive examples of what is already happening.
Updated: How Simple It Is To Remove Facebook Ads In All Major Browsers
In this post I will demonstrate how to remove ads on Facebook in all major browsers, including Firefox, Chrome, Safari and Internet Explorer.