Big Data in the social sciences

Post date: Jan 29, 2016 3:50:08 PM

Alexandros Tokhi and Christian Rauh

Big Data has become a ubiquitous buzzword. Social networks, smartphones, and various online applications and websites constantly produce and provide information on an unprecedented scale and level of detail. Data storage is cheap and ever-improving analytical tools seem to herald a revolution in the way we understand the world. Some argue that this also fundamentally transforms science. Just crunching more data, it is assumed, allows us to take better decisions and solve social and political problems. So is the Big Data revolution the end of the social sciences as we know them?

We contend that it is not. As social scientists we are particularly well trained to know that observations alone – no matter how many of them there are – hardly lead to meaningful inferences without carefully specifying the underlying assumptions and constructing valid research designs. It is the combination between the principles of social scientific inquiry with novel methods of automated data generation and processing that bears the potential to generate new insights into socially relevant questions.

Don’t believe the hype – at least not all of it

The bold statements on the transformative power of the Big Data revolution are often rooted in rather naïve views on data analysis and largely unclear specifications of what Big Data actually is. One major pitfall is the belief – popular amongst computer scientists, Internet pundits and data journalists – that Big Data signals the end of theory. Corresponding arguments essentially equate Big Data with an N=all approach in which simple correlations reveal the ultimate truth about the world. For social scientists this clearly misses a crucial point: data is meaningful only with regards to a particular theory and we need to explicate the assumptions that drive our conclusions. Theory influences and even determines what we observe in a given data set, which aspects or phenomena we identify as relevant, and how we distinguish spurious correlations from causally meaningful relationships. Google’s search routines, for example, are built on the assumption that more incoming links indicate higher importance of a website. You might or might not buy into this assumption, but you should be aware of it when interpreting the results.

A related pitfall is reducing Big Data only to really, really large data sets. In the social sciences we judge the number (and the composition) of our observations relative to the population of interest. The Big Data hype usually invokes images of the billions of observations that e.g. social media produce. But individuals select themselves into these networks – do all of your friends use Twitter? – and if we ignore these processes we might quickly draw skewed conclusions. In other, socially equally relevant, areas – such as the political commitment to international treaties discussed below – a comparatively modest number of observations might already represent the universe of cases quite well. From a social science point of view, there is simply no absolute criterion for drawing the line between ‘big’ and ‘small’ data without having a specific social phenomenon in mind. And accordingly, big and small data are subject to the same analytical challenges when it comes to measurement validity or selection bias. The modern social sciences have developed an encompassing toolkit to draw valid inferences from observational data – and with the abundance of digitally available information this asset has become more rather than less important.

From our perspective, thus, Big Data does not put an end to the scientific method – but it holds significant potential to enhance it: What we indeed consider as ‘revolutionary’ is the ever expanding set of methods to automatically collect, process, and analyze digital information. In various research fields, these methods can help us in systematically analyzing only loosely structured data sources such as websites or text documents. Big Data, in other words, provides inexpensive and timesaving means to tap uncharted sources of empirical evidence. Combined with explicit theory and sound social scientific methods, these innovative and replicable means of data collection and analysis contribute to answering substantial questions.

Analyzing politics beyond the nation state with big data technologies

The study of international and supranational politics is probably not the first research area that comes to mind in the context of Big Data. Still, two examples from our own research are well suited to exemplify the potential of Big Data methods.

The first example concerns a central question in international relations research: do demanding legal obligations encourage or discourage the ratification of international treaties by states? Past scholarship has produced equivocal findings, because it analyzed only subsets of relevant treaties; as a result, the question remains unresolved.

The usual expectation is that states ignore demanding treaties in order to retain their freedom of action. We argue, however, that this decision varies with the substantive issue area a treaty covers. Where tougher treaty rules do commit oneself and especially other states to a specific course of action (e.g. reduction of air pollution, lower trade barriers), states should be more willing to sign onto more demanding obligations. When demanding treaty obligations mainly constrain one’s own freedom of action without binding others, more demanding obligations discourage ratification. The latter mechanism prevails in international human rights law, while the former applies in areas where inter-state cooperation is needed to provide common public goods (e.g. clean air, security).

Our research question and theory inform our data choices: we need to capture individual states’ willingness for treaty ratification, here measured by the time needed to take this step, to isolate issue areas and their treaties from each another, and to seize the entire variation in demanding treaty obligations therein. We are quickly confronted with several thousand observations if we consider each of the 193 states for about 80 treaties from human rights and environmental regulation over the past 50 years.

This calls for Big Data methods. Accordingly, we program and implement a webscraping algorithm in the programming language Python to automatically extract and reshape data from the United Nations Treaty Collection Database. Within 2,5 minutes we gather around 140,000 observations. Speed is, though, not the only advantage. To isolate demanding obligations, we exploit the fact that different types of treaties, which regulate the same set of rights, differ only in their degree of obligations. Framework conventions impose fewer obligations on states than their optional protocols. The Python algorithm automatically recognizes the type of treaty and then codes our demandingness indicator. In the statistical analysis of these data, we provide consistent evidence in support of our argument and resolve the question about demanding obligations.

The second example from our current research focuses on EU affairs in national parliaments. To the extent that parliaments provide publically visible debates about the EU, they might dampen democratic deficits of supranational policy-making. While some argue that the incentives to politicize EU issues have increased with each transfer of political competences to the supranational level, others claim that selective partisan incentives drive public emphasis of EU topics. Extant research restricts itself to selected parliamentary debates where explicit EU questions figure on the formal agenda – ignoring the fact that the EU provides constraints and opportunities almost over the entire range of domestic issues possibly discussed in parliament.

For a systematic evaluation of these competing claims, we need information on the parliamentary salience of EU affairs over time across various issue areas, different levels of EU authority, and different settings of domestic partisan competition. To meet these considerable informational requirements, we scrape all plenary protocols between 1991 and 2013 from the document server of the German Bundestag. Using our method of automated pattern recognition, we split these texts into more than 148,000 individual speeches by parliamentarians from all parties. Finally, we count all term-level references to the policies, politics, and polity of the EU in each speech with a dictionary-based text-mining algorithm implemented in the R environment.

The data reveal that the amount of EU references in the German Bundestag has indeed systematically increased with each consecutive revision of EU treaties. Our results are robust to the inclusion of partisan differences and other control variables. Whether this holds beyond the German case remains to be seen and we currently extend this data collection strategy to other parliaments in EU member states.

Big Data is what we make of it

Big Data alone is hardly a panacea for all challenges modern societies face. If no meaningful theory is leveraged to contextualize and make sense of the abundant information, we at best only stare at huge piles of numbers. At worst, we infer policy recommendations from spurious correlations and skewed samples. That’s why the social sciences should make their voice heard in the current debates. Big Data does not end theory. It rather highlights the need for social scientists to critically reflect and give meaning to the massive flows of digits.

But this also requires that the social sciences open up to Big Data technologies. First, we should be able to understand the assumptions that drive modern algorithms when assessing their societal implications. Second, we can make technologies such as webscraping, pattern recognition, and text mining part of our methodological toolkit in order to save time and other costs involved in procuring the information we need for the questions we consider relevant. Big data will not transform the social sciences, but we have a lot to add to and a lot to gain from the technologies that it offers.

This post originally appeared in the December 2015 issue of the WZB-Mitteilungen.