This was an investigation deep into the heart of the “Collection #1-5” password leaks that appeared in the web in early 2019. We showed that more than 3 million Swiss email addresses and – more disquietingly – over 20’000 email addresses of Swiss authorities and providers of critical infrastructure appear in the leak.
Aside the usual broadcast and online channels, we also released a short Youtube video for a younger audience that explains the dangers of using a weak password. For demonstration purposes, I gained access to the Instagram account of our host, Lena, within a couple of hours.
In this project, I used a so-called “big data technology”, Spark, for the first time. While we at SRF Data usually publish the source code for our data processing, I decided against doing so in this case. Instead, I wrote a blog post that explains the process and helps other journalist tackling similar “big data” problems.
Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke
When so-called “deep fakes” popped up in late December on Reddit, they caused quite a stir. At SRF Data, about half a year later, we wanted to explain the technology with self-made examples. Without external help, we created a series of deep fake experiments ourselves, using an open-source AI framework. To our surprise, we could produce an astonishingly good fake, replacing the face of one of our most prominent news anchorwomen.
In a lengthy article, we then went into the nitty-gritty of how the technology behind deep fakes work – publishing probably the nerdiest article about deep fakes in mass media up to that date.
To explain the subject, we only used animated GIFs like the one above, and some demonstration videos.
Aside the usual broadcast and online channels, we also released a short Youtube video for a younger audience that explains what deep fakes are – and what dangers there could be. To demonstrate this, we faked the face of a famous (straight) Swiss comedian into a gay porn. While he had been involved from the beginning, you can see from his expression that he’s not very pleased once he sees the results.
If a Swiss prosecutor wants to put somebody in custody during an investigation or needs to employ surveillance methods, he or she needs to get the okay of a so-called “Zwangsmassnahmengericht”, a special court responsible for these “compulsory measures” that need to be brought into force only in special circumstances.
These special courts have been put into action in the year 2011. Since then, they have assessed countless applications by state prosecutors, but have remained largely intransparent and secret – an actual “dark chamber”.
For the first time, my team has profoundly investigated the (often hidden) statistics behind the secret court orders. The number that resulted in the end is rather disquieting: 97 percent. That’s the ratio of applications that get accepted. In other words, state prosecutors almost always get a “thumbs up” for their invasive actions. In that sense, these rather new courts don’t really seem to be a barrier for law enforcement.
The research has stirred up quite some political uproar – many parliamentarians, for example, have assured us that they’d like to go over legislation and make some adjustments. Also, experts have came up with the idea of an attorney for human rights that should be heard during the decision making process.
In this research we could show that the Swiss criminal justice system uses a simple, but intransparent algorithm to categorize inmates into three risk classes: A, B and C. Especially C-class inmates have to undergo more profound screening and are under increased scrutiny, potentially impacting their right to parole. Our research showed that the algorithm, when applied in different cantons (= Swiss provinces), produced very different results: Sometimes, C-class inmates would be the majority, sometimes A-class, etc. The officials didn’t have an explanation for this and wanted to look into it
Also, while the algorithm had previously been publicly known, its weights and exact mechanism had been a black box. Under our journalistic pressure the responsible authority finally released the inner workings of the system for public scrutiny.
In this investigation, I could show that the Swiss police uses a dubious software to assess the risk of individuals of being a “potential danger for the public”. While the use of the software had been publicly known until that point, I exclusively dug out some studies that show that the software actually performs not very well. In fact, out of three people it deems “potentially dangerous”, two actually weren’t.
The Swiss police and other state actors use more and more automated & algorithmic systems that take the burden of hard decisions off of them. I think that now is the right time to lay a special focus on such systems and investigate their (hidden) biases.
This story came together with an interactive simulation that explained the trade-off between false positives and false negatives.
Being a so-called influencer is the dream job of the moment for a lot of young people. Getting a wealth of free products, or even bare cash, in exchange for an Instagram post is enticing, and the advertising industry seems to have discovered a new, effective form of approaching target groups.
However, accusations started appearing that the followers of many influencers are, in fact, fake. Could these accusations be true? There has never been a systematic study on the subject – nobody, neither in or outside of Switzerland, has ever tried to thoroughly quantify the fake follower problem on Instagram.
So that’s what we at SRF Data did. We trained a machine learning model to automatically classify 7 million Instagram accounts regarding their “fakeness”. By doing so, we found out that roughly a third of these accounts, following 115 Swiss influencers, are indeed fake.
Some influencers had more than 50% fake followers, which raises questions about the integrity and authenticity of these follower bases. Consequently, the publication caused quite a stir in the Influencer economy.
For some, Switzerland is one big city with high-speed trains and highways functioning as tram and bus lines – people commute from Bern to Zurich like they would from a district to another. Indeed, Switzerland has some of the highest commuter rates in the world, and there’s a plethora of statistics available.
At SRF Data, we took these data sets and tried to find a personal approach to it, besides just reporting numbers with charts. We came up with an interactive and adaptive article that changes its content and layout based on what the reader would fill in as his commuting journey.
Based on this, the reader is presented with a fully personalized view on the topic of commuting.
I did not only preprocess and analyze the raw data with R, I also came up with some pretty neat GIF graphics that add some eye candy to the article.
These are basically a sequence of ggplot’s `geom_point` graphics.
Urban sprawl is one of Switzerland’s (few) biggest environmental problems. Since 1985, the population has grown by more than 30 percent, and since then, land of the size of the Lake Geneva has been plastered with concrete.
The interactive feature that lets you have a look at your own municipality
In our interactive explainer «Bauland» we present facts and figures regarding urban sprawl, but the core element is a feature where the reader can choose its own municipality and then switch between different years to see how urban sprawl has changed its face. This visualization is based on a very detailed Swiss statistic, where every hectare (10k square meters) is surveyed every couple years and classified into categories forest (dark green), farmland (bright green), settlement (dark grey) and unproductive area such as glaciers (bright grey).
In April 2015, when I was still working at Tages-Anzeiger, we published a hugely successful dialect quiz. After a week or two, we had over 2 million unique visitors, also thanks to the co-publication by Spiegel Online.
The resulting prediction
The quiz predicted someones most likely cities of residence, and users could give feedback on that (see the form on the right side above).
Now comes the thing that stunned me the most: Over a third of all visitors actually filled that form out – we ended up with over 670’000 responses, i.e. people’s answers to the 25 questions and their self-proclaimed location of residence as WGS84 coordinates.
In the R statistical environment, I summarized these point data to hexagons and exported them to GeoJSON/TopoJSON. Now we had 25 different maps (for the 25 different initial questions, better: words) that showed the regional distribution of answers (better: pronunciations), based on the biggest online dialect survey ever conducted in Europe. We published these maps on the online presence of the Swiss Public Broadcast (SRF) as well as on tagesanzeiger.ch and spiegel.de.
One of the resulting maps, showing the distribution of pronunciations for the phrase “quarter past 10” in German-speaking Europe
In our largest data-driven research so far we examined the vested interests of Swiss universities. We researched, among other things, more than 1000 secondary employments of professors and more than 300 sponsored professorships. The investigation resulted in publications in dozens of different radio and television programs of the Swiss Public Broadcast SRF.
As I outlined in this blog post, the statistical software environment R is becoming more and more popular among journalists.
However, finding an entry point to the R programming language is not that easy, especially for people without programming experience.
That’s why I built the continuously updated Rddj.info – a resource collection for learning how to do data journalism with R. It showcases a great variety of tutorials for every skill level and a lot of helpful, quick recipes.
End of September 2016, the Swiss people accepted a new federal law that grants the intelligence agency new competences, e.g. the agency can now search Internet traffic that “leaves or enters” the country for suspicious keywords (similar to XKEYSCORE).
Questioning whether one can actually speak of an “outside” or “inside” when it comes to Internet traffic, we at SRF Data wanted to explain the reader if it is theoretically possible to be surveilled when browsing a Swiss website (even one physically hosted in Switzerland). Turns out that the large majority of requests to the top 180 Swiss websites “leave” Switzerland and are routed over Germany or France or even the US – and are thus subject to surveillance. In order to visualize this, we rebuilt a terminal that allows the reader to fire up a “traceroute” requests (nerds galore yay!).
Dual-use goods are goods that can be used for civil and military purposes. One example for these kinds of goods are so-called IMSI catchers, devices used to surveil mobile phones. In Switzerland, these goods are governed with a special legislation – unlike in other countries, where they are looked at as conventional arms exports. At SRF Data, we took the effort to parse and visualize the recently released data from the State Secretariat for Economic Affairs SECO.
The interactive visualization allows the reader to dig into the highly detailed dual-use exports data.
Because our data processing workflow is fully reproducible, we can re-publish the vis again and again as soon as new data are available. One of the cool things about this project is that it can be updated every year once the SECO releases new data. We already did that in 2015, 2016, 2017 and 2018. The data and methodology are freely available on GitHub, as with other stories published by SRF Data.
As one of the first projects at SRF Data we published a web documentary, called «Sandalen im Schnee» (sandals in the snow), about the asylum seeking process in Switzerland. The migrant crisis was at its peak and we wanted to explain and document the controversially discussed Swiss asylum seeking process – without resorting too much to the emotional side and without taking party.
We interviewed six asylum seekers from different colors and enriched the explanatory parts (entry, registration, the long wait, decision stay/go, etc.) with their personal stories. The core element and red line of the web documentary is – among video, interactive data visualizations and, of course, text – an interactive navigation that a) always tells the reader where he stands in the text and b) shows a simplified schema of the asylum seeking process in Switzerland.
The web documentary was published as a full page piece with a navigation that shows the asylum seeking process at the same time.
One of the data visualizations showing origins of asylum seekers over time.
The project was nominated for the German Reporter Prize 2015 and for the Grimme Online Award 2016.
Inspired by the hugely successful New York Times dialect quiz, our team at Tages-Anzeiger, consisting of me and Marc Brupbacher, teamed up with language scientist Dr. Adrian Leemann (back then conducting research at the University of Zurich) to launch a similar application for the German-speaking region. Dr. Leemann provided the data and I was responsible for coding the frontend. I mainly used AngularJS, and LeafletJS for the map.
Our quiz was very similar to the NYT one: The user has to answer 25 questions on how certain words are spoken out in his region (I was baffled to learn that there exist more than 20 different ways of saying that you have the hiccups).
The user has to choose between a variety of ways to speak out a commonly used word
Based on the answer, a probabilistic model calculated the most likely cities of residence, and based on that, a form of a heatmap shows the likely region of residence.
The resulting prediction
On the result screen, the user was also able to rate his personal prediction and to share the results via Twitter. The answers from the feedback form are now an invaluable data basis for new research conducted by Dr. Leemann.
Fortunately, we found a worthy partner in Spiegel Online for publishing the project. This allowed us to reach a huge audience: In the first 2-3 days, we had over 1.5 million unique visitors. To me, this project is a very good example on how journalists and scientists can launch awesome applications together and profit from each other – journalists get interesting and new data sets and scientists have a platform to publish their research which would otherwise remain in the academic domain.
About a month before the start of the 2014 soccer world cup in Brazil, my colleague and fellow data journalist Julian Schmidli, with whom I worked together at Tages-Anzeiger and now at SRF Data, came up with the idea of creating an interactive soccer prediction game. Not an ordinary one, but one that would assist users in finding an optimized prediction for all encounters, based on the variables they would judge important for winning the tournament.
According to how a user weights a certain variable such as ball possession, the teams score differently. Each dot on the horizontal axis represents a team, and the height of the bar represents its accumulated score.
In the end, after the user has weighted each of the eight variables such as ball possession, scored goals, etc, he is presented with a complete tournament simulation. As he or she alters the weights on the right hand side of the screen, the tournament is dynamically being recalculated.
At the end, the user is presented with a simulated tournament overview.
The application was designed as “mobile first” and is perfectly playable on mobile devices, too. There, less information is shown in the tournament overview, and the weighting controls do not show accumulated scores, but current weights only.
On mobile devices, the scores are only shown for the current variable/weights.
Together with my friends from the SoMePolis project, I drafted, designed and implemented an interactive application which allows to monitor and analyze the Twitter accounts of Swiss parliament members. This blog post sums up pretty well what is possible with the app — unfortunately it’s written in German.
In short, the app allows the user to rank parliament members based on their activity and interactivity on Twitter — two concepts which were operationalized with key figures such as the number of Tweets within a timespan and the number of interactions with other users. The goal was to provide a simple yet transparent and not too complex measure of how politicians perform on Twitter, i.e., exactly the opposite of what other services such as Klout provide — nontransparent, proprietary ranking algorithms.
The user interface supports common interactions such as sorting, filtering and searching, and all these actions automatically trigger a dynamic update of the visible data set.
The user interface allows for sorting, filtering and searching.
One thing that I am quite proud of is the responsiveness of the app. This was not implemented with the help of a common framework such as Bootstrap.js and thus does not rely on CSS media queries (but probably could have). In fact, the window size is extracted and, based on this information, the corresponding data items are arranged dynamically. Naturally, each resize of the window triggers a re-arrangement if needed (try it out by resizing your browser window).
The arrangement of the data items is dynamically changed based on the window size and thus based on the type of device.
I again implemented this visualization with D3. In this case, it was particularly helpful for data filtering and sorting, but also for the animations that happen on page change and window resize.
Instead of writing yet another paper, I handed in this visualization for the LERU Bright 2013 Student Conference which will be held in August in Freiburg, Germany. This year’s conference topic is “Energy Transition in the 21st Century” and I am part of the “Dependencies” working group.
This “Atlas der Globalisierung”-inspired visualization, based on very recent data by BP, allows the reader to quickly grasp the temporal and spatial differences in oil consumption and production. On one hand, during certain periods of history, some nations consumed almost as much oil as the rest of the world together. On the other hand, the data of the last ten years show a growing divergence between consumption and production. After all, I hope this work makes clear that nations are heavily interdependent when it comes to oil – the main driver of our global economy.
A linked view allows to interactively explore the data on oil consumption and production throughout the years.
Adaption of the visualization by the “Neue Zürcher Zeitung”
In my opinion, the team around Sylke Gruhnwald did a very good job in taking the essence out of my visualization and also in leveraging it in terms of usability. The according time series are not shown on mouseover, but triggered by mouse clicks and touch events, which makes it easier for mobile device users to study a country’s oil consumption as well as production. I also prefer the adaptive “stacking” of all chart lines in the background, which does not necessarily give more information but is still very aesthetic. What is missing, in my opinion, is the bar chart that directly compares the huge differences in consumption and production between different countries. The small data tables at the lower right corner give an impression of this, but do not visually convey it.
If you are interested in coming projects, follow me on Twitter.
I implemented this interactive visualization for a university course about geovisualization. It allows the user to visually compare two variables (either the outcome of a vote or a socio-demographic factor, per district). Two maps are used to show the geographical distribution of the variables, and quartiles are computed to distribute the values into four classes.
A scatterplot allows the user to contrast the two chosen variables and explore possible associations. Both views (map and scatterplot) are linked, so the user always knows which district he or she is dealing with.
The scatterplot, linked with the map, allows exploratory analysis of the data.
On top of that, the user has the possibility to draw a rectangle around the points in the scatterplot and thus select a set of districts (“brushing”). After this, the corresponding districts are highlighted in the map. This helps, for instance, to quickly identify regions where certain variables have high or low values, etc.
In order to better compare differences between subsequent variables, a linear, visual transition of half a second is used. This helps to mitigate the problem ofchange blindness.