Crowdsourcing and Games with a Purpose

Crowdsourcing - using workers contacted via the Web to label data - has become the de facto standard for small and medium scale annotation in CL ever since the Snow et al (2008) paper (Poesio et al, 2017). In our research, as well, we have been using crowdsourcing systematically, in particular for summarization (first to prepare the 2009 Arabic Summarization data for MULTILING, more recently in the SENSEI project) and text classification (in our KTP with Minority Rights Group). But we have also been developers of crowdsourcing technology, in two areas in particular: using Games-With-A-Purpose (Poesio et al, 2013) to collect data, and analyzing data collected using crowdsourcing using Bayesian models (Paun et al, 2018a, 2018b).

Phrase Detectives

Phrase Detectives (Poesio et al, 2008; Chamberlain et al, 2008; Poesio et al, 2013; Poesio et al, 2017; Poesio et al, 2019) is a Game-With-a-Purpose developed to annotate anaphoric information. It is one of the most successful GWAPs for Computational Linguistics, having collected over the years more than 4 million judgments. The second release of the dataset, consisting of 542 completely annotated English documents half from Wikipedia and half fiction from Project Gutenberg, for a total of slightly over 408,000 tokens and 2.5 million judgments, was released in 2019 (Poesio et al, 2019).

Other GWAPs

Three other GWAPs have been developed so far as part of the DALI project:

Analysing and aggregating crowdsourced data

The data collected using crowdsourcing tend to be very noisy; some method is required to identify unreliable workers and assign a reliability to labels. Bayesian models of annotation (Dawid and Skene, 1979; Carpenter, 2008; Hovy et al, 2013; Passonneau and Carpenter, 2014; Paun et al, 2018a) have proven much more effective than majority voting in assessing the reliability of the label and are becoming the new standard. In our research, we have used such methods to assess the reliability of labels obtained with a variety of methods, and to identify the most reliable workers. One of the main contribution of the DALI project is the development of a Bayesian annotation model for anaphoric information, Mention Pair Annotation (MPA) (Paun et al, 2018b). We also organized an EMNLP 2019 workshop on aggregating non-standard labels.

Using crowdsourcing for summary evaluation

As part of the ongoing SENSEI project, we organized the Online Forums Summarization Task of MULTILING-2015 in which system summaries were evaluated using crowdsourcing (Kabadjov et al, submitted).


Projects (in inverse chronological order)

  • The DALI project, funded by ERC (2016-2021), is concerned with using the data collected through Games-With-A-Purpose to study anaphora and disagreements in anaphora.
  • SENSEI , funded by the EU (2013-2016). This ongoing project is concerned with the use of discourse to summarize spoken and online conversations such as those in online forums.
  • AnaWiki (2007-2009, funded by EPSRC) was the project in which Phrase Detectives was developed.

Workshops (in inverse chronological order)

  • The AnnoNLP workshop at EMNLP 2019.

Main publications (in inverse chronological order)