Aggregating and Learning from Multiple Annotators
(EACL 2021)

You can watch the recorded video here. The slides can be downloaded from here.

Most of the material discussed (and a lot more) is now part of a book you can read more about here.

Tutorial abstract: The success of NLP research is founded on high-quality annotated datasets, which are usually obtained from multiple expert annotators or crowd workers. The standard practice to training machine learning models is to first adjudicate the disagreements and then perform the training. To this end, there has been a lot of work on aggregating annotations, particularly for classification tasks. However, many other tasks, particularly in NLP, have unique characteristics not considered by standard models of annotation, e.g., label interdependencies in sequence labelling tasks, unrestricted labels for anaphoric annotation, or preference labels for ranking texts. In recent years, researchers have picked up on this and are covering the gap. A first objective of this tutorial is to connect NLP researchers with state-of-the-art aggregation models for a diverse set of canonical language annotation tasks. There is also a growing body of recent work arguing that following the convention and training with adjudicated labels ignores any uncertainty the labellers had in their classifications, which results in models with poorer generalisation capabilities. Therefore, a second objective of this tutorial is to teach NLP workers how they can augment their (deep) neural models to learn from data with multiple interpretations.


The disagreement between annotators stems from ambiguous or subjective annotation tasks as well as annotator errors. Crowdsourcing with non-expert annotators is especially prone to annotation errors, sometimes caused by workers who do not attempt to provide correct annotations (spammers). The traditional resolution to this problem is redundant labeling: collect multiple interpretations from distinct coders, allowing the resource creators to later aggregate these labels. To this end, probabilistic models of annotation have been successfully used to learn the coders’ behavior and distill the labels from noise.

The research on models of annotation contains a large body of work spanning multiple decades (from the work on latent structure analysis back in the early 70s), and has been substantially debated over the years at dedicated conferences such as HCOMP and workshops, e.g., from The People’s Web Meets NLP (Gurevych and Zesch, 2009), to CrowdML (, and more recently AnnoNLP (Paun and Hovy, 2019). The plethora of models that had been published even prompted some researchers to ask, challengingly, whether the problem of aggregating crowd labels had been solved (Zheng et al., 2017). As anticipated, there are still unaddressed issues – in particular, the bulk of work has focused on classification tasks, leaving room for innovation in other areas. The NLP field specifically contains a number of tasks with unique characteristics not considered by standard models of annotation. For example, in sequence labeling tasks such as part of speech tagging or named entity recognition, nearby labels have known inter dependencies. In other tasks such as anaphoric annotation for coreference resolution, the coders are asked to provide labels that are not from a fixed set of categories but consist of textual mentions. Another example is pairwise preference labelling, where coders are asked to choose the instance from a pair that most strongly reflects a quality of interest, such as relevance to a topic or convincingness of an argument, with the goal of inferring an overall ranking of text instances. Researchers have observed these gaps in the literature and are addressing them. A key objective of this tutorial is to connect NLP researchers with state-of-the-art aggregation methods suitable for canonical NLP tasks, covering classifications (Yan et al., 2014), sequence labels (Nguyen et al., 2017; Simpson and Gurevych, 2019), anaphoric interpretations (Paun et al., 2018b) and pairwise preference labels (Simpson and Gurevych, 2020).

Resource creators can use aggregation methods to adjudicate the disagreements inherent in annotated data, but at times, when the resource is to serve as training data to a machine learning model, the noise distillation procedure does not have to be separated and can be integrated into the learning process. In fact, by following the convention and training with adjudicated labels we ignore any of the uncertainty the labellers had in their classifications. Including the coders’ disagreements in the learning signal offers the models a richer source of information compared to adjudicated labels: they include not only the consensus, but may also indicate ambiguity, and how the humans make mistakes. This improves the generalisation capability of the models and offers them a more graceful degradation with less ridiculous mistakes (Peterson et al., 2019; Guan et al., 2018). Some of these approaches can also be used for their noise distillation capabilities, as their learning processes also produce aggregated labels that leverage not only coder annotation patterns but also the knowledge of the task accumulated by the model (Cao et al., 2018; Rodrigues and Pereira, 2018; Albarqouni et al., 2016; Chu et al., 2020). Often, this means that fewer redundant labels are required to attain the desired level of accuracy for the aggregated labels. Thus, a second objective of the tutorial is to teach NLP researchers how they can augment their existing (deep) neural architectures to learn from data with disagreements.


S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis,S. Demirci, and N. Navab. 2016. Aggnet: Deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Transactions on Medical Imaging, 35(5):1313–1321.

Peng Cao, Yilun Xu, Yuqing Kong, and Yizhou Wang. 2018. Max-mig: an information theoretic approach for joint learning from crowds. In International Conference on Learning Representations.

Zhendong Chu, Jing Ma, and Hongning Wang. 2020. Learning from crowds by modeling common confusions. Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer errorrates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28.

Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. 2018. Who said what: Modeling individual labelers improves classification. Iryna Gurevych and Torsten Zesch, editors. 2009. Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web). Association for Computational Linguistics, Suntec, Singapore.

An Thanh Nguyen, ByronWallace, Junyi Jessy Li, Ani Nenkova, and Matthew Lease. 2017. Aggregating and predicting sequence labels from crowd annotations. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 299–309, Vancouver, Canada. Association for Computational Linguistics.

Rebecca J. Passonneau and Bob Carpenter. 2014. The benefits of a model of annotation. Transactions of the Association for Computational Linguistics, 2:311–326.

Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, and Massimo Poesio. 2018a. Comparing bayesian models of annotation. Transactions of the Association for Computational Linguistics, 6:571–585.

Silviu Paun, Jon Chamberlain, Udo Kruschwitz, Juntao Yu, and Massimo Poesio. 2018b. A probabilistic annotation model for crowdsourcing coreference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1926–1937, Brussels, Belgium. Association for Computational Linguistics.

Silviu Paun and Dirk Hovy, editors. 2019. Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP. Association for Computational Linguistics, Hong Kong.

Joshua C. Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and Olga Russakovsky. 2019. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

Filipe Rodrigues and Francisco C Pereira. 2018. Deep learning from crowds. In Thirty-Second AAAI Conference on Artificial Intelligence.

Edwin Simpson, Erik-Lˆan Do Dinh, Tristan Miller, and Iryna Gurevych. 2019. Predicting humorousness and metaphor novelty with Gaussian process preference learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5716–5728, Florence, Italy. Association for Computational Linguistics.

Edwin Simpson and Iryna Gurevych. 2020. Scalable Bayesian preference learning for crowds. Machine Learning, pages 1–30.

Edwin D. Simpson and Iryna Gurevych. 2019. A Bayesian approach for sequence tagging with crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1093–1104, Hong Kong, China. Association for Computational Linguistics.

Yan Yan, R´omer Rosales, Glenn Fung, Ramanathan Subramanian, and Jennifer Dy. 2014. Learning from multiple annotators with varying expertise. Machine Learning, 95(3):291–327.

Li’ang Yin, Jianhua Han, Weinan Zhang, and Yong Yu. 2017. Aggregating crowd wisdoms with label-aware autoencoders. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 1325–1331.

Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proc. VLDB Endow., 10(5):541–552.