Part 1. Motivation and Early Approaches to Annotation Analysis
Introduction to the field. Shortcomings of early practices.
Modeling the annotation process with a probabilistic model. How to encode our assumptions about the coders, the difficulty of the items, and their interactions. Using hierarchical models to alleviate sparsity.
Recommended readings: (Passonneau and Carpenter, 2014; Paun et al., 2018a)
Part 2. Advanced Models of Annotation
Aggregating sequence labels. In such tasks the labels of nearby items have known interdependencies. We discuss probabilistic approaches that model these sequential dependencies both between the ground truth labels and the annotations. We exemplify the utility of the methods on a NER task.
Aggregating anaphoric judgements for coreference resolution. For this task the annotation scheme does not use a fixed class space. The judgements here consist of labels assigned to textual mentions that mark when new entities are introduced into the discourse, non refering expressions such as expletives or predicative NPs, and recent antecedents of previously discussed entities. We explain how to apply a probabilistic mention-pair model to aggregate the labels and build coreference chains.
Preference labels: why comparisons can be more reliable than ratings or classifications. We show how to reformulate NLP tasks with ambiguous categories or scores as preference learning, giving an example applications related to argument persuasiveness. We introduce probabilistic approaches for aggregating preference judgements to infer a gold standard ranking.
Aggregation with Variational Autoencoders. This framework allows us to use neural networks to capture complex non linear relationships between the annotations and the ground truth. By doing so, we avoid having to manually identify and specify these relationships as in standard probabilistic models.
Recommended readings: (Simpson et al., 2019; Yin et al., 2017)
Part 3. Learning with Multiple Annotators
Learning with human uncertainty. The standard for training classifiers is to learn from data where each example has a single label. In doing so however any uncertainty the labellers had in their classification is ignored. We discuss here a few approaches to learning from the label distributions produced by the coders, which can improve classifier performance.
Humans are noisy. The success of the approaches from the previous point relies on the quality of the target distributions, i.e., whether the collected annotations offer a good representation of the coders’ dissent. That may not always be the case, e.g., when their number is too low to get a good proxy for the human uncertainty, or when noise intervenes and skews the distributions. For this purpose we discuss a few training approaches that also capture the accuracy and alleviate the bias of the coders, with an emphasis on neural methods.
Recommended readings: (Peterson et al., 2019, Rodrigues and Pereira, 2018)
Part 4. Practical Session
Introduce the audience to an implementation of a probabilistic (Dawid and Skene, 1979) and a neural (Rodrigues and Pereira, 2018) model of annotation. The instructors will provide an example dataset and implementations of the two models then run through a few short exercises that will help the audience to understand and apply the methods to a real NLP task. The exercises will include comparing majority voting with the model of Dawid and Skene (1979) and training a downstream model on adjudicated labels compared to training directly on crowdsourced labels with (Rodrigues and Pereira, 2018). The dataset and code will be provided freely on the tutorial website.
The practical session will take you through a notebook, which you can download by checking out the Github repo here: https://github.com/UKPLab/arxiv2018-bayesian-ensembles, then running `jupyter notebook` on the command line from inside the repository, then opening aggregation_tutorial.ipynb.
References
Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer errorrates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28.
Rebecca J. Passonneau and Bob Carpenter. 2014. The benefits of a model of annotation. Transactions of the Association for Computational Linguistics, 2:311–326.
Silviu Paun, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz, and Massimo Poesio. 2018a. Comparing bayesian models of annotation. Transactions of the Association for Computational Linguistics, 6:571–585.
Joshua C. Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and Olga Russakovsky. 2019. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Filipe Rodrigues and Francisco C Pereira. 2018. Deep learning from crowds. In Thirty-Second AAAI Conference on Artificial Intelligence.
Edwin Simpson, Erik-Lan Do Dinh, Tristan Miller, and Iryna Gurevych. 2019. Predicting humorousness and metaphor novelty with Gaussian process preference learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5716–5728, Florence, Italy. Association for Computational Linguistics.
Li’ang Yin, Jianhua Han, Weinan Zhang, and Yong Yu. 2017. Aggregating crowd wisdoms with label-aware autoencoders. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 1325–1331.