At Swiss Public TV and Radio (SRF) we recently published an investigation of the “Collection #1-5” password leaks. In this post, I show how I searched through 900GB+ of data with Spark and R.
Eine journalistische Einschätzung zur E-Voting-Schwachstelle, die er Chaos Computer Club Schweiz veranschaulicht hat.
How I used the kknn and ggplot2 packages together with some parallel computation to spatially interpolate several hundred thousand points.
Being a so-called influencer is the dream job of the moment for a lot of young people. Getting a wealth of free products, or even bare cash, in exchange for an Instagram post is enticing, and the advertising industry seems to have discovered a new, effective form of approaching target groups.
However, accusations started appearing that the followers of many influencers are, in fact, fake. Could these accusations be true? There has never been a systematic study on the subject – nobody, neither in or outside of Switzerland, has ever tried to thoroughly quantify the fake follower problem on Instagram.
So that’s what we at SRF Data did. We trained a machine learning model to automatically classify 7 million Instagram accounts regarding their “fakeness”. By doing so, we found out that roughly a third of these accounts, following 115 Swiss influencers, are indeed fake.
Some influencers had more than 50% fake followers, which raises questions about the integrity and authenticity of these follower bases. Consequently, the publication caused quite a stir in the Influencer economy.
For some, Switzerland is one big city with high-speed trains and highways functioning as tram and bus lines – people commute from Bern to Zurich like they would from a district to another. Indeed, Switzerland has some of the highest commuter rates in the world, and there’s a plethora of statistics available.
At SRF Data, we took these data sets and tried to find a personal approach to it, besides just reporting numbers with charts. We came up with an interactive and adaptive article that changes its content and layout based on what the reader would fill in as his commuting journey.
Based on this, the reader is presented with a fully personalized view on the topic of commuting.
I did not only preprocess and analyze the raw data with R, I also came up with some pretty neat GIF graphics that add some eye candy to the article.
These are basically a sequence of ggplot’s
Recently I made a point for “true” RMarkdown reproducibility via checkpointed package versions. Shortly thereafter I learned the hard way how crucial it is to use exactly the same R packages that were used when the script was initially written.
Since more than two years I have been preaching reproducibility and transparency in data journalism. My tool of choice: R and reproducible reports with RMarkdown.
But these reports aren’t really reproducible. A solution.
Urban sprawl is one of Switzerland’s (few) biggest environmental problems. Since 1985, the population has grown by more than 30 percent, and since then, land of the size of the Lake Geneva has been plastered with concrete.
In our interactive explainer «Bauland» we present facts and figures regarding urban sprawl, but the core element is a feature where the reader can choose its own municipality and then switch between different years to see how urban sprawl has changed its face. This visualization is based on a very detailed Swiss statistic, where every hectare (10k square meters) is surveyed every couple years and classified into categories forest (dark green), farmland (bright green), settlement (dark grey) and unproductive area such as glaciers (bright grey).
The project was nominated as one of three projects in the prestigious Swiss Press Online Award 2017.
In April 2015, when I was still working at Tages-Anzeiger, we published a hugely successful dialect quiz. After a week or two, we had over 2 million unique visitors, also thanks to the co-publication by Spiegel Online.
The quiz predicted someones most likely cities of residence, and users could give feedback on that (see the form on the right side above).
Now comes the thing that stunned me the most: Over a third of all visitors actually filled that form out – we ended up with over 670’000 responses, i.e. people’s answers to the 25 questions and their self-proclaimed location of residence as WGS84 coordinates.
In the R statistical environment, I summarized these point data to hexagons and exported them to GeoJSON/TopoJSON. Now we had 25 different maps (for the 25 different initial questions, better: words) that showed the regional distribution of answers (better: pronunciations), based on the biggest online dialect survey ever conducted in Europe. We published these maps on the online presence of the Swiss Public Broadcast (SRF) as well as on tagesanzeiger.ch and spiegel.de.
In our largest data-driven research so far we examined the vested interests of Swiss universities. We researched, among other things, more than 1000 secondary employments of professors and more than 300 sponsored professorships. The investigation resulted in publications in dozens of different radio and television programs of the Swiss Public Broadcast SRF.
The research launched a national debate on the independence of the Swiss Universities. Over the course of the following year, some universities have already implemented systems for more transparency. In the meantime, we were transparent ourselves and published our curated and tediously preprocessed database on GitHub.
The project was awarded the prestigious “Prix Média Newcomer” of the Swiss Academies of the Arts and Sciences.
As I outlined in this blog post, the statistical software environment R is becoming more and more popular among journalists.
However, finding an entry point to the R programming language is not that easy, especially for people without programming experience.
That’s why I built the continuously updated Rddj.info – a resource collection for learning how to do data journalism with R. It showcases a great variety of tutorials for every skill level and a lot of helpful, quick recipes.
In this blog post, I explain step by step how I (eventually) achieved a nice thematic map with pure ggplot2 – from a very basic, useless, ugly, default map to the publication-ready and (in my opinion) highly aesthetic choropleth.
End of September 2016, the Swiss people accepted a new federal law that grants the intelligence agency new competences, e.g. the agency can now search Internet traffic that “leaves or enters” the country for suspicious keywords (similar to XKEYSCORE).
Questioning whether one can actually speak of an “outside” or “inside” when it comes to Internet traffic, we at SRF Data wanted to explain the reader if it is theoretically possible to be surveilled when browsing a Swiss website (even one physically hosted in Switzerland). Turns out that the large majority of requests to the top 180 Swiss websites “leave” Switzerland and are routed over Germany or France or even the US – and are thus subject to surveillance. In order to visualize this, we rebuilt a terminal that allows the reader to fire up a “traceroute” requests (nerds galore yay!).
May 2015 to January 2018
Dual-use goods are goods that can be used for civil and military purposes. One example for these kinds of goods are so-called IMSI catchers, devices used to surveil mobile phones. In Switzerland, these goods are governed with a special legislation – unlike in other countries, where they are looked at as conventional arms exports. At SRF Data, we took the effort to parse and visualize the recently released data from the State Secretariat for Economic Affairs SECO.
Because our data processing workflow is fully reproducible, we can re-publish the vis again and again as soon as new data are available. One of the cool things about this project is that it can be updated every year once the SECO releases new data. We already did that in 2015, 2016, 2017 and 2018. The data and methodology are freely available on GitHub, as with other stories published by SRF Data.
Technologies used: D3.js, DC.js.
As one of the first projects at SRF Data we published a web documentary, called «Sandalen im Schnee» (sandals in the snow), about the asylum seeking process in Switzerland. The migrant crisis was at its peak and we wanted to explain and document the controversially discussed Swiss asylum seeking process – without resorting too much to the emotional side and without taking party.
We interviewed six asylum seekers from different colors and enriched the explanatory parts (entry, registration, the long wait, decision stay/go, etc.) with their personal stories. The core element and red line of the web documentary is – among video, interactive data visualizations and, of course, text – an interactive navigation that a) always tells the reader where he stands in the text and b) shows a simplified schema of the asylum seeking process in Switzerland.
The project was nominated for the German Reporter Prize 2015 and for the Grimme Online Award 2016.
Inspired by the hugely successful New York Times dialect quiz, our team at Tages-Anzeiger, consisting of me and Marc Brupbacher, teamed up with language scientist Dr. Adrian Leemann (back then conducting research at the University of Zurich) to launch a similar application for the German-speaking region. Dr. Leemann provided the data and I was responsible for coding the frontend. I mainly used AngularJS, and LeafletJS for the map.
Our quiz was very similar to the NYT one: The user has to answer 25 questions on how certain words are spoken out in his region (I was baffled to learn that there exist more than 20 different ways of saying that you have the hiccups).
Based on the answer, a probabilistic model calculated the most likely cities of residence, and based on that, a form of a heatmap shows the likely region of residence.
On the result screen, the user was also able to rate his personal prediction and to share the results via Twitter. The answers from the feedback form are now an invaluable data basis for new research conducted by Dr. Leemann.
Fortunately, we found a worthy partner in Spiegel Online for publishing the project. This allowed us to reach a huge audience: In the first 2-3 days, we had over 1.5 million unique visitors. To me, this project is a very good example on how journalists and scientists can launch awesome applications together and profit from each other – journalists get interesting and new data sets and scientists have a platform to publish their research which would otherwise remain in the academic domain.
For me, 2015 was the year of R. The year I finally started to use R productively and on an almost daily basis (after years of learning and forgetting and learning all over again). In this post, I share my experiences and tell you why you should start using it for your next data journalism project in 2016.
Together with fellow data journalists Mario Stäuble, Patrice Siegrist and Julian Schmidli, I realized this CartoDB-based map of the distribution of Swiss soccer fans while I was still working for Tages-Anzeiger. Find a detailed description and implementation details on the CartoDB-Map-of-the-Week-Blog.
This project won a prize in the category “data journalism” of the 16. European Newspaper Award (p. 44), together with another project we realized at Tages-Anzeiger.
Back when I was working at Tages-Anzeiger, I was asked to find a way to condense the content of several hundred PDF files into one spreadsheet. These PDFs contained indicator variables about the performance of nursing and retirement homes, and for some strange reason, they were only available as individual PDFs. I took it as an opportunity to learn new features of Node.js and it turned out to be a really good solution. In this post, I explain what I came up with.
About a month before the start of the 2014 soccer world cup in Brazil, my colleague and fellow data journalist Julian Schmidli, with whom I worked together at Tages-Anzeiger and now at SRF Data, came up with the idea of creating an interactive soccer prediction game. Not an ordinary one, but one that would assist users in finding an optimized prediction for all encounters, based on the variables they would judge important for winning the tournament.
In the end, after the user has weighted each of the eight variables such as ball possession, scored goals, etc, he is presented with a complete tournament simulation. As he or she alters the weights on the right hand side of the screen, the tournament is dynamically being recalculated.
The application was designed as “mobile first” and is perfectly playable on mobile devices, too. There, less information is shown in the tournament overview, and the weighting controls do not show accumulated scores, but current weights only.
This project won a prize in the category “data journalism” of the 16. European Newspaper Award (p. 44), together with another project we realized at Tages-Anzeiger.
As georeferenced data from social media, be it in the form of Tweets, Foursquare Check-Ins, Instragram photos, Flickr pictures, etc., are increasingly available, so do (geospatial) analyses and visualizations done with them become more and more popular. Often, such studies and applications claim to be able to infer social, cultural, and even political insights from these data, spatially fine-grained and referenced down to the level of countries and cities.
I haven’t seen a single one which actually succeeded in plausibly explaining the how to me.
Together with my friends from the SoMePolis project, I drafted, designed and implemented an interactive application which allows to monitor and analyze the Twitter accounts of Swiss parliament members. This blog post sums up pretty well what is possible with the app — unfortunately it’s written in German.
In short, the app allows the user to rank parliament members based on their activity and interactivity on Twitter — two concepts which were operationalized with key figures such as the number of Tweets within a timespan and the number of interactions with other users. The goal was to provide a simple yet transparent and not too complex measure of how politicians perform on Twitter, i.e., exactly the opposite of what other services such as Klout provide — nontransparent, proprietary ranking algorithms.
The user interface supports common interactions such as sorting, filtering and searching, and all these actions automatically trigger a dynamic update of the visible data set.
One thing that I am quite proud of is the responsiveness of the app. This was not implemented with the help of a common framework such as Bootstrap.js and thus does not rely on CSS media queries (but probably could have). In fact, the window size is extracted and, based on this information, the corresponding data items are arranged dynamically. Naturally, each resize of the window triggers a re-arrangement if needed (try it out by resizing your browser window).
I again implemented this visualization with D3. In this case, it was particularly helpful for data filtering and sorting, but also for the animations that happen on page change and window resize.
Instead of writing yet another paper, I handed in this visualization for the LERU Bright 2013 Student Conference which will be held in August in Freiburg, Germany. This year’s conference topic is “Energy Transition in the 21st Century” and I am part of the “Dependencies” working group.
This “Atlas der Globalisierung”-inspired visualization, based on very recent data by BP, allows the reader to quickly grasp the temporal and spatial differences in oil consumption and production. On one hand, during certain periods of history, some nations consumed almost as much oil as the rest of the world together. On the other hand, the data of the last ten years show a growing divergence between consumption and production. After all, I hope this work makes clear that nations are heavily interdependent when it comes to oil – the main driver of our global economy.
Crafted with D3.js.
The visualization was later adapted by the Swiss daily newspaper “Neue Zürcher Zeitung” :
In my opinion, the team around Sylke Gruhnwald did a very good job in taking the essence out of my visualization and also in leveraging it in terms of usability. The according time series are not shown on mouseover, but triggered by mouse clicks and touch events, which makes it easier for mobile device users to study a country’s oil consumption as well as production. I also prefer the adaptive “stacking” of all chart lines in the background, which does not necessarily give more information but is still very aesthetic. What is missing, in my opinion, is the bar chart that directly compares the huge differences in consumption and production between different countries. The small data tables at the lower right corner give an impression of this, but do not visually convey it.
If you are interested in coming projects, follow me on Twitter.
I implemented this interactive visualization for a university course about geovisualization. It allows the user to visually compare two variables (either the outcome of a vote or a socio-demographic factor, per district). Two maps are used to show the geographical distribution of the variables, and quartiles are computed to distribute the values into four classes.
A scatterplot allows the user to contrast the two chosen variables and explore possible associations. Both views (map and scatterplot) are linked, so the user always knows which district he or she is dealing with.
On top of that, the user has the possibility to draw a rectangle around the points in the scatterplot and thus select a set of districts (“brushing”). After this, the corresponding districts are highlighted in the map. This helps, for instance, to quickly identify regions where certain variables have high or low values, etc.
In order to better compare differences between subsequent variables, a linear, visual transition of half a second is used. This helps to mitigate the problem ofchange blindness.
SpatiaLite is an OpenGIS-enabled spatial extension to SQLite, similar to PostGIS. Unfortunately, the packaged version for Ubuntu, especially the GUI, is rather outdated. Therefore, this post shows how to compile and install the latest versions (4.0.0 and 1.6 respectively) and all needed dependencies from source.
Im zweiten Teil meiner Serie zur Auswertung von Mobilfunk-Standortdaten fühle ich den Schweizer Anbietern auf den Zahn und gehe auf die datenschutzrechtliche Situation in der Schweiz ein. Dürfen personenbezogene Standortdaten überhaupt weiterverkauft werden? Und wieso fallen anonymisierte Standortdaten nicht unter das Datenschutzgesetz?
Anonym auf Twitter seine Meinung kundgeben? Ist dies überhaupt möglich? Eine Replik auf einen Tweet.
Laut einer Medienmitteilung will der spanische Telekommunikationsanbieter Telefónica Standortdaten von Mobilfunkteilnehmern an Werbekunden verkaufen. Dass dies aus Sicht des Schutzes der Privatsphäre äusserst brisant ist, steht ausser Frage. Im ersten Beitrag dieser zweidreiteiligen Serie stelle ich die zahlreichen technischen Möglichkeiten zur Auswertung jener Standortdaten vor und zeige, wie man vermeintlich anonyme Daten auf einzelne Benutzer rückführen kann.
Kürzlich wurde bekannt, dass Facebook mit Dritten kooperiert, um Kaufverhalten von Benutzern auszuwerten. In meinem ersten deutschsprachigen Post fasse ich die technischen Details zusammen und wage eine Prognose, was damit möglich wäre.
Over the past few months, tremendous security leaks have been reported for WhatsApp. This blog post gives an updated, easy-to-read summary of how and why the app is vulnerable to certain attacks.
SVG and WebGL will revolutionize the graphical web, and this blog post shows some impressive examples of what is already happening.
In this post I will demonstrate how to remove ads on Facebook in all major browsers, including Firefox, Chrome, Safari and Internet Explorer.