As georeferenced data from social media, be it in the form of Tweets, Foursquare Check-Ins, Instragram photos, Flickr pictures, etc., are increasingly available, so do (geospatial) analyses and visualizations done with them become more and more popular. Often, such studies and applications claim to be able to infer social, cultural, and even political insights from these data, spatially fine-grained and referenced down to the level of countries and cities.
I haven’t seen a single one which actually succeeded in plausibly explaining the how to me.
In my MSc thesis at the Department of Geography, University of Zurich, I set out to critically examine the properties of such data and how they can be used to infer statements about the mobility patterns of a society. As I was looking at dozens of very recently published studies and examples in popular media, I quickly realized how incautiously such data are dealt with, and how boldly certain, unverifiable assumptions are made. The thesis is available here. Ready to be taken apart, free of charge. I already got my degree. 🙂
Now, you may think that my rants belong to the academic domain. But think of all the beautiful maps, the aesthetic visualizations that have been published in the last few months and that have been soaked up by popular media and newspapers all around the world. Think of “Sentiment in New York”, think of Phototrails, of SelfieCity. Think of Eric Fisher’s “Locals and Tourists”, of Moritz Stefaner’s Stadtbilder. Think of the “Twitter Heartbeat” and a dozen of other publications that go into the same direction.
While some of these projects come from the academia and others are certainly intended to be artistic rather than scientific, while some of them actually make statements about society as a whole and others only want to show the visual beauty of such data, they still have a common denominator: their visual products look real. They convey a certain feeling of ultimate reality and actuality. Or, in a more general sense, their output, e.g., their findings, their statements, and their visual products, are based on data produced and shared by (allegedly) real people.
But who are these people? Exactly that question is the crux of the matter.
In this — admittedly quite long — post, I will identify the most common methodological and conceptual flaws in the above mentioned works, and I will back these allegations with the results of my Master’s thesis, namely, with a detailed analysis of the spatial and socio-demographic representativeness of a sample of approximately 12 million geotagged Tweets collected over the course of a full year.
Origin and Authenticity: Locals? Tourists? Bots?
In my thesis, I had the goal of using these Tweets, and more importantly, the quasi-continuous location history of their somewhat 25,000 authors, as a data base for studying mobility in Switzerland. Since I wanted to study the mobility behavior of Swiss residents, and compare my findings to official statistics, I first needed to verify the origin of my subjects.
I quickly discovered that this sounds easier than it is. On Twitter and on other location-enabled API platforms such as Instagram, it is almost impossible to pinpoint the actual home town or even the country of users by just looking at profile information. In the same vein, it is — at least currently — impossible to only gather users coming from a certain region — the only thing one can do is collecting Tweets originating from a certain region. But: Is the origin of somebody’s content a highly certain indicator of somebody’s actual residence? I don’t think so.
Let’s say we only sample a few georeferenced Tweets from a user, over the period of two weeks. Even if all these Tweets are originating from more or less the same region, we have no way of assuredly telling whether the user is actually a local or a tourist. It could also be a temporary resident, e.g., an exchange student. It could be somebody who only uses the location feature of Twitter while on holidays, for instance, to show off. We just don’t know.
So why should somebody who has tweeted from the same “city” over the range of a month be deemed a local? It’s not so much the strictly one-dimensional measure which was used to distinguish locals from tourists that bothers me in this example. It’s more the lack of critical reflection about it. Why was it chosen over other measures? Would a different temporal period have yielded significantly different results?
And while the authors of “Sentiment in New York City” did not explicitly attempt to map the mood of residents in New York, they neither explicitly denied the fact that their maps, popularly published in The Atlantic and other outlets, show Tweets and not individual sentiments by residents. The map, though, suggests that people living around the area marked with E1, for example, are miserable, something that cannot be verified. It could just as well be that the apparent, bad smell at this location motivates random visitors as well as residents to complain about it in a Tweet.
Another problem arises from the fact that Twitter houses quite a few bots, for instance, automatic earthquake or weather broadcast services. How can they be detected and removed, so that they don’t bias or distort an analysis? How are they dealt with in the above examples?
In my case, I came up with a filtering mechanism that discards users based on several indicators of their spatio-temporal behavior. For instance, a user that has an average velocity of about 100km/h over an extended period is unlikely to be human. A user that posted the majority of his or her Tweets outside the area of interest is also not likely to be a resident of that area.
I initially gathered a lot of these users, and I would be happy if somebody told me how to circumvent this problem, as it is not possible to query the API for “Tweets coming from residents of area A but not area B”. Using a very simple heuristic, I looked for the authors of georeferenced Tweets posted in the study area and started tracking them, but I could only approximately tell whether they were locals after having collected their Tweets over the course of several months.
To summarize, it is very difficult to automatically verify someone’s origin, and a lot of assumptions need to be made. I don’t claim to have found the best method for this, but my problem with the above mentioned works (especially the ones coming from academia) is that the question of origin and authenticity is almost never even considered. This is especially problematic in the conjunction with visual output such as maps: Pinpointing something on a map makes it look static, it conveys the notion of residency and origin as well as authenticity, something that can never be taken for granted when dealing with data from social media.
To further illustrate this, look at a map that shows census data about religious affiliations in Swiss municipalities. Clearly, what the map shows in terms of religious affiliation can be pinpointed to particular locations, to particular origins, because census data is used. A darker red means that there are indeed proportionally more Roman-Catholic residents than in other municipalities — a fundamentally different ontology than the one underlying the sentiment maps or others made with data from social media. In my opinion, the authors of the latter do not state clearly enough that the assumption of origin and authenticity does likely not hold true.
Representativeness or Why The “Geography of Twitter” Does Not Exist
I am certainly not the first and only one to doubt that socio-demographic representativeness of social media data, gathered over an API, is given at all:
“Taken together, participation inequality, demographic bias and spatial bias point to a very skewed group that is producing most of the content that we see on the GeoWeb.”
And Crampton et al. (2013, p. 132) argue that “… there is little that can be said definitively about society-at-large using only these kinds of user-generated data, as such data generally skews toward a more wealthy, more educated, more Western, more white and more male demographic.”.
Now, in my opinion, and as my own research shows, not every analysis and application has to rely on completely socio-demographically representative data. For the analysis of commuter flows, for example, it does not really matter whether one has data from a mostly male cohort or a mostly female cohort as these are likely to exhibit the same spatio-temporal behavior when it comes to commuting.
Still, this statement is based on the assumption that men and women have more or less the same or similar jobs and go to the same workplaces — an assumption that likely holds true in the Western sphere but might be wrong in other places of the world, which leads me to my next point.
In many studies and visualizations that look for global patterns, it is implicitly assumed that Twitter, especially its geotagging feature, is used in the same fashion all over the globe. The question one should initially pose is whether usage contexts and user cohorts differ fundamentally in different parts of the world. Secondly, one should ask what implications this has for analyses that try to make statements about society as a whole.
I haven’t verified it and, unfortunately, I haven’t found a study which actually deals with this problem, but I would assume that there are essential differences on the macro scale. For example, it seems plausible that Twitter is more widespread throughout different parts of society where it has been established for quite a long time (i.e., USA, Canada). In contrast, in countries where it is still in the early-adopter-phase it is more likely used by a mostly urban, tech-savvy, well-educated and male cohort. One might also assume that, in developing countries, Twitter geolocation is predominantly popular among people with access to location-enabled smartphones and thus among a rather small and high-income cohort.
While it is quite difficult to study the distribution of social media among different cohorts all over the world, this can at least be done on the regional or micro scale. As part of my thesis, I heuristically identified the residential municipality of about 2,200 Swiss users (the ones which passed the above mentioned plausibility test), using somewhat 2 million georeferenced Tweets. With these figures, I could then compare the observed count of Twitter users per municipality with the expected one, given Twitter users were spatially distributed according to the true population. It turns out that while the observed count is more or less correlated with the actual population …
… there are stark regional differences. Using geodemographic typologies as defined by the Swiss Federal Statistical Office, I was able to run the same analysis on the level of different types of regions.
For instance, there are significant differences between urban and rural municipalities. Overall, the core cities’ population is clearly overrepresented while the rural population, which amounts to almost the same proportion in reality, is heavily underrepresented on Twitter.
Even more striking are the differences between the four linguistic regions of Switzerland, especially between the Western, French-speaking part, and the more populous German-speaking part.
A closer look at so-called “metropolitan regions” gives even more clues as to why this is the case.
It seems that the cities of Geneva and Lausanne in the far south-west are likely not only responsible for the stark imbalance between language regions but also for the divide between urban and rural areas.
To further investigate this, I looked at a few hundred profiles sampled from different language regions and came to another important conclusion. While the typical (geoactive) German-speaking Twitter user is around 30 years old, male, and has a job in either tech, journalism, PR or politics, the demography of the French-speaking user is completely different. According to my sample, the gender of this user is more evenly distributed, he or she is in his or her late teens, he or she is likely to have an immigration background and uses Twitter in a fundamentally different fashion than the German-speaking user.
In the French-speaking region, especially in the metropolitan region of Geneva-Lausanne, Twitter seems to be used more as a messaging application than a broadcast service and information source, as it is the case in the German-speaking region. The French-speaking user group is way more prolific when it comes to georeferencing Tweets, which makes it overproportionally likely to detect a French Tweet than a German one when sampling “Swiss” Tweets.
What does this tell us? Motivations to georeference a Tweet or an Instagram picture differ — some do it for a concrete purpose, others do it for pure self representation, some don’t do it because they still care about their geoprivacy, and some probably do not even know that they’re doing it and/or forgot to opt-out. But, as my research shows, motivations and usage contexts do not only fundamentally differ by cohort but also by region, which inevitably results in spatial bias — some regions are significantly underrepresented and other overrepresented. And this in a small country like Switzerland, which is sometimes looked at as a large urban area, even a large city-state.
While I do not claim to know how this spatial bias manifests itself in the New York Cities and Londons of the world every researcher and data artist is keen on visualizing, I am pretty confident that these inequalities exist, everywhere, on different geographical scales. And they certainly need to be considered before making statements about the “residents” of a city, of a country, of the world.
User Contribution Bias: The Few and The Many
Last but not least, the beautiful Pareto distribution.
Data from social media is extremely prone to it. Leetaru et al., authors of “Mapping the Global Twitter Heartbeat”, found a striking figure after having looked at a ridiculous amount of georeferenced Tweets: 1% of all observed users account for 66% of all Tweets. Now that really is something to be savoured.
The same authors also found out that only between 1% and 3% of all Tweets are georeferenced, with significant differences from place to place. I just wonder how they were able to geographically pinpoint these figures to cities as they actually have no way of telling where the other 97-99% are from. But I digress.
Thought experiment: Let’s say you sample a thousand georeferenced Tweets from a hundred distinct users (who you assume to live in a certain, spatially confined region). 660 Tweets are produced by the guy who dislikes the smell at E1, the rest are unevenly (!) distributed among the others. Is this even close to representative? To meaningful?
In my thesis, I also looked at the Tweeting behavior of my sample, and I found similar figures.
Note that the curve shows the data before preprocessing, i.e., before removing users that were unlikely to be authentic residents.
After removing those, the curve looked pretty much exactly the same.
How to deal with this? Analyze users, not content! Or, at least, throw out very prolific users (and be disappointed that your sample just shrunk by two thirds).
Unfortunately, still too many academic papers only look at content instead of users — this much Tweets at this place around 10 AM, that much at the other place, so much less an hour later, et cetera. Of course, many results gathered with such methodologies look absolutely plausible, but often, they do not tell us more than we already know. If we want to gather new insights, we need to rethink our approach to user-generated content on the “Geoweb”. Yes, it is more complicated to algorithmically analyze the behavior of individual users than to count Tweets in certain regions at certain times. And it is more frustrating, because the individual data are in many cases so darn sparse — what we collectively label “big data” is often produced by a small minority and not by our beloved study subjects. But we need to do it, and if you have some spare time, you can read about a possible approach in my thesis or wait until I condense another blog post from it.
- Origin and authenticity of users can never be assumed and, thus, data needs to be thoroughly preprocessed.
- Representativeness is neither given nor always necessary — but it should be accounted for, too, especially with regards to spatial bias.
- Too few produce too much. Probably the hardest problem to deal with, as only user-centered analyses seem to be able to produce new, meaningful insights.
Of course, not everything produced with data from the “Geoweb” is questionable per se. I thoroughly admire many of the above mentioned authors and artists for their work, and I think they are pioneers. I just wished some of them would pay a little more attention to the above posed questions of representativeness, origin, and authenticity, and more transparently reflect on their methodological decisions.
That this can be done is proven, for example, by Floating Sheep, a bunch of people that regularly publishes small analyses done with georeferenced data. In contrast to others, they often question their own methods and try to justify their decisions, for instance in the case of their controversially discussed map of racist Tweets.
In a world where everybody is talking about big data and the benefits it apparently has for our society, we, as scientists, owe the public to properly document our methodology and to reflect upon possible biases, we, as visual artists and designers, owe the public to communicate our data sources and our pre-processing steps, and lastly, we, as journalists and editors, owe the public to critically examine and question the works of the former two before we share them with a larger audience.
“For every two degrees the temperature goes up,
check-ins at ice cream shops go up by 2%.”