Poster Session

POSTERS:

Poster #1: How (Non-)Optimal is the Lexicon?

Author: Tiago Pimentel

Abstract: The mapping of lexical meanings to wordforms is a major feature of natural languages. While usage pressures might assign short words to frequent meanings (Zipf's law of abbreviation), the need for a productive and open-ended vocabulary, local constraints on sequences of symbols, and various other factors all shape the lexicons of the world's languages. Despite their importance in shaping lexical structure, the relative contributions of these factors have not been fully quantified. Taking a coding-theoretic view of the lexicon and making use of a novel generative statistical model, we define upper bounds for the compressibility of the lexicon under various constraints. Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon's optimality and to explore the relative costs of major constraints on natural codes. We find that (compositional) morphology and graphotactics can sufficiently account for most of the complexity of natural codes -- as measured by code length.

Poster #2: Measuring and Increasing Context Usage in Context-Aware Machine Translation machine translation

Author: Patrick Fernandes

Abstract: Recent work in neural machine translation has demonstrated both the necessity and feasibility of using inter-sentential context -- context from sentences other than those currently being translated. However, while many current methods present model architectures that theoretically can use this extra context, it is often not clear how much they do actually utilize it at translation time. In this paper, we introduce a new metric, conditional cross-mutual information, to quantify the usage of context by these models. Using this metric, we measure how much document-level machine translation systems use particular varieties of context. We find that target context is referenced more than source context, and that conditioning on a longer context has a diminishing effect on results. We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models. Experiments show that our method increases context usage and that this reflects on the translation quality according to metrics such as BLEU and COMET, as well as performance on anaphoric pronoun resolution and lexical cohesion contrastive datasets.

Poster #3: NLP Across Languages andNLP Across Languages and/or Tasks

Author: Rob van der Goot

Abstract: This poster consists of two parts: 1) multi-lingual dataset creation. Recent efforts have led to a dataset collection of individual datasets for the lexical normalization task for 13 languages. For the tasks of slot-and intent detection for digital assistants, an English dataset has been translated and annotated for 12 other languages. 2) multi-task multi-lingual learning. While neural network approaches in general have resulted in new approaches to multi-task learning, massively pre-trained contextual embeddings have led to a boost in performance for multi-lingual setups. We propose MaChAmp, a toolkit that can be used to easily train models for multiple tasks and/or languages. It supports 8 task-types, including sequence labeling, generation and masked language modeling.

Poster #4: AAA: Fair Evaluation for Abuse Detection Systems Wanted

Author: Agostina Calabrese

Abstract: User-generated web content is rife with abusive language that can harm others and discourage participation. Thus, a primary research aim is to develop abuse detection systems that can be used to alert and support human moderators of online communities. Such systems are notoriously hard to develop and evaluate. Even when they appear to achieve satisfactory performance on current evaluation metrics, they may fail in practice on new data. This is partly because datasets commonly used in this field suffer from selection bias, and consequently, existing supervised models overrely on cue words such as group identifiers (e.g., gay and black) which are not inherently abusive. Although there are attempts to mitigate this bias, current evaluation metrics do not adequately quantify their progress. In this work, we introduce Adversarial Attacks against Abuse (AAA), a new evaluation strategy and associated metric that better captures a model's performance on certain classes of hard-to-classify microposts, and for example penalises systems which are biased on low-level lexical features. It does so by adversarially modifying the model developer's training and test data to generate plausible test samples dynamically. We make AAA available as an easy-to-use tool, and show its effectiveness in error analysis by comparing the AAA performance of several state-of-the-art models on multiple datasets. This work will inform the development of detection systems and contribute to the fight against abusive language online.

Poster #5: What About the Precedent: An Information-Theoretic Analysis of Common Law

Author: Josef Valvoda

Abstract: In common law, the outcome of a new case is determined mostly by precedent cases, rather than by existing statutes. However, how exactly does the precedent influence the outcome of a new case? Answering this question is crucial for guaranteeing fair and consistent judicial decision-making. We are the first to approach this question computationally by comparing two longstanding jurisprudential views; Halsbury's, who believes that the arguments of the precedent are the main determinant of the outcome, and Goodhart's, who believes that what matters most is the precedent's facts. We base our study on the corpus of legal cases from the European Court of Human Rights (ECtHR), which allows us to access not only the case itself, but also cases cited in the judges' arguments (i.e. the precedent cases). Taking an information-theoretic view, and modelling the question as a case outcome classification task, we find that the precedent's arguments share 0.38 nats of information with the case's outcome, whereas precedent's facts only share 0.18 nats of information (i.e., 58% less); suggesting Halsbury's view may be more accurate in this specific court. We found however in a qualitative analysis that there are specific statues where Goodhart's view dominates, and present some evidence these are the ones where the legal concept at hand is less straightforward.

Poster #6: Mirror-BERT: Transforming BERT to Universal Language Encoders with Self-Supervised Learning

Author: Fangyu Liu

Abstract: Previous work has indicated that pretrained Masked Language Models (MLMs) are not effective as universal lexical and sentence encoders off-the-shelf, i.e., without further task-specific fine-tuning on NLI, sentence similarity, or paraphrasing tasks using annotated task data. In this work, we demonstrate that it is possible to turn MLMs into effective lexical and sentence encoders even without any additional data, relying simply on self-supervision. We propose an extremely simple, fast, and effective contrastive learning technique, termed Mirror-BERT, which converts MLMs (e.g., BERT and RoBERTa) into such encoders in less than a minute with no access to additional external knowledge. Mirror-BERT relies on identical and/or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during “identity fine-tuning”. We report huge gains over off-the-shelf MLMs with Mirror-BERT both in lexical-level and in sentence-level tasks, across different domains and different languages. Notably, in sentence similarity (STS) and question-answer entailment (QNLI) tasks, our self-supervised Mirror-BERT model even matches the performance of the Sentence-BERT models from prior work which rely on annotated task data. Finally, we delve deeper into the inner workings of MLMs, and suggest some evidence on why this simple Mirror-BERT fine-tuning approach can yield effective universal lexical and sentence encoders.

Poster #7: Smoothing and Shrinking the Sparse Seq2Seq Search Space

Author: Ben Peters

Abstract: Current sequence-to-sequence models are trained to minimize cross-entropy and use softmax to compute the locally normalized probabilities over target sequences. While this setup has led to strong results in a variety of tasks, one unsatisfying aspect is its length bias: models give high scores to short, inadequate hypotheses and often make the empty string the argmax—the so-called cat got your tongue problem. Recently proposed entmax-based sparse sequence-to-sequence models present a possible solution, since they can shrink the search space by assigning zero probability to bad hypotheses, but their ability to handle word-level tasks with transformers has never been tested. In this work, we show that entmax-based models effectively solve the cat got your tongue problem, removing a major source of model error for neural machine translation. In addition, we generalize label smoothing, a critical regularization technique, to the broader family of Fenchel-Young losses, which includes both cross-entropy and the entmax losses. Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion and deliver improvements and better calibration properties on cross-lingual morphological inflection and machine translation for 7 language pairs.

Poster #8: Time-Aware Evidence Ranking for Fact-Checking

Author: Liesbeth Allein

Abstract: Truth can vary over time. Fact-checking decisions on claim veracity should therefore take into account temporal information of both the claim and supporting or refuting evidence. In this work, we investigate the hypothesis that the timestamp of a Web page is crucial to how it should be ranked for a given claim. We delineate four temporal ranking methods that constrain evidence ranking differently and simulate hypothesis-specific evidence rankings given the evidence timestamps as gold standard. Evidence ranking in three fact-checking models is ultimately optimized using a learning-to-rank loss function. Our study reveals that time-aware evidence ranking not only surpasses relevance assumptions based purely on semantic similarity or position in a search results list, but also improves veracity predictions of time-sensitive claims in particular.

Poster #9: BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief

Author: Nora Kassner

Abstract: Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually ``believes'' about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our approach is to embed a PTLM in a broader system that also includes an evolving, symbolic memory of beliefs - a BeliefBank - that records but then may modify the raw PTLM answers. We describe two mechanisms to improve belief consistency in the overall system. First, a reasoning component - a weighted MaxSAT solver - revises beliefs that significantly clash with others. Second, a feedback component issues future queries to the PTLM using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time. This is significant as it is a first step towards PTLM-based architectures with a systematic notion of belief, enabling them to construct a more coherent picture of the world, and improve over time without model retraining.

Poster #10: On Sparsifying Encoder Outputs in Sequence-to-Sequence Models

Author: Biao Zhang

Abstract: Sequence-to-sequence models usually transfer all encoder outputs to the decoder for generation. In this work, by contrast, we hypothesize that these encoder outputs can be compressed to shorten the sequence delivered for decoding. We take Transformer as the testbed and introduce a layer of stochastic gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsityinducing L0 penalty, resulting in completely masking-out a subset of encoder outputs. In other words, via joint training, the L0DROP layer forces Transformer to route information through a subset of its encoder states. We investigate the effects of this sparsification on two machine translation and two summarization tasks. Experiments show that, depending on the task, around 40–70% of source encodings can be pruned without significantly compromising quality. The decrease of the output length endows L0DROP with the potential of improving decoding efficiency, where it yields a speedup of up to 1.65x on document summarization and 1.20× on character-based machine translation against the standard Transformer. We analyze the L0DROP behaviour and observe that it exhibits systematic preferences for pruning certain word types, e.g., function words and punctuation get pruned most. Inspired by these observations, we explore the feasibility of specifying rule-based patterns that mask out encoder outputs based on information such as part-of-speech tags, word frequency and word position.

Poster #11: Wikipedia Entities as Rendezvous across Languages: Grounding Multilingual Language Models by Predicting Wikipedia Hyperlinks

Author: Iacer Calixto

Abstract: Masked language models have quickly be- come the de facto standard when processing text. Recently, several approaches have been proposed to further enrich word representations with external knowledge sources such as knowledge graphs. However, these models are devised and evaluated in a monolingual setting only. In this work, we propose a language-independent entity prediction task as an intermediate training procedure to ground word representations on entity semantics and bridge the gap across different languages by means of a shared vocabulary of entities. We show that our approach effectively injects new lexical-semantic knowledge into neural models, improving their performance on different semantic tasks in the zero-shot crosslingual setting. As an additional advantage, our intermediate training does not require any supplementary input, allowing our models to be applied to new datasets right away. In our experiments, we use Wikipedia articles in up to 100 languages and already observe consistent gains compared to strong baselines when predicting entities using only the English Wikipedia. Further adding extra languages lead to improvements in most tasks up to a certain point, but overall we found it non-trivial to scale improvements in model transferability by training on ever increasing amounts of Wikipedia languages.

Poster #12: The paradox of the compositionality of natural language: a neural machine translation case study

Author: Verna Dankers

Abstract: To move towards human-like linguistic performance, neural networks need to show compositional generalisation. Whether they exhibit this property is often studied using artificial languages and highly controlled datasets, for which the compositionality of input fragments can be guaranteed and their meanings algebraically composed.However, natural language contains many phenomena that are not strictly compositional, which makes it unclear how conclusions drawn about the effective of modelling techniques for compositional generalisation would also hold for models of natural language. In this work, we re-instantiate three tests from the literature measuring compositional generalisation, applying them to neural machine translation (NMT).The results highlight two main issues: the inconsistent behaviour of NMT models and a lack of reaching the right level of processing.Our work presents an experimental study and is a call to action: we should rethink the evaluation of compositionality in neural networks of natural language, where composing meaning is not as straightforward as doing the math.

Poster #13: The Causal Effect of Grammatical Gender on Contextualized Representations

Author: Afra Amini

Abstract: Probing is increasingly being used to interpret and analyze deep neural models in natural language processing. Yet, the limitations and weaknesses of probes has recently become a subject of much debate. In this work, we investigate the ability of probes to ascertain causal relationships. To this end, we propose a causal framework, which incorporates input-level interventions on real world data, in order to compute the causal effect of a property of interest on a pre-trained deep neural model. We apply this methodology in a case study on the effects of gender on contextualized representations of pre-trained models. In our experiments, we see that standard probes fail to learn gender in a counterfactual setting. We additionally find that the causal effect of gender is stable across different datasets, and further, is aligned with the first principal component of gender subspace, which we show is a low-dimensional space.

Poster #14: Analysing Human Strategies of Information Transmission

Author: Mario Giulianelli

Abstract: Speakers are thought to use rational information transmission strategies for efficient communication; for example, they keep the information density of their sentences uniform over the course of written texts (Genzel and Charniak, 2003; 2003)—especially so within coherent contextual units, such as paragraphs. In this work, we test whether, and within which contextual units, speakers adhere to the principle of uniform information density in monologue and in task-oriented dialogue. Using a pre-trained Transformer-based language model, which provides more robust measurements than the n-gram models used in prior work, we confirm that speakers adhere to the principle in newspaper articles and present new evidence that they also do in written cooperative reference games as well as in spoken dialogues involving instruction giving and following. Because patterns of information transmission vary within different contextual units, we then use the context window of our language model to estimate information density as a function of the relevant utterance context; this was never explicitly measured in previous related work. We find that, when context is explicitly factored in, speakers transmit information at a stable rate in newspaper articles but that this rate decreases in spoken open domain and written task-oriented dialogues. We suggest that a more faithful model of communication should include production efforts and goal-oriented rewards. Our hope is that this line of work will inform the development of dialogue generation models that organise the transmission of information in a more human-like fashion.

Poster #15: PADA: A Prompt-based Autoregressive Approach for Adaptation to Unseen Domains

Author: Eyal Ben-David

Abstract: Natural Language Processing algorithms have made incredible progress, but they still struggle when applied to out-of-distribution examples. We address a challenging and underexplored version of this domain adaptation problem, where an algorithm is trained on several source domains, and then applied to examples from an unseen domain that is unknown at training time. Particularly, no examples, labeled or unlabeled, or any other knowledge about the target domain are available to the algorithm at training time. We present PADA: A Prompt-based Autoregressive Domain Adaptation algorithm, based on the T5 model. Given a test example, PADA first generates a unique prompt and then, conditioned on this prompt, labels the example with respect to the NLP task. The prompt is a sequence of unrestricted length, consisting of pre-defined Domain Related Features (DRFs) that characterize each of the source domains. Intuitively, the prompt is a unique signature that maps the test example to the semantic space spanned by the source domains. In experiments with 3 tasks (text classification and sequence tagging), for a total of 14 multi-source adaptation scenarios, PADA substantially outperforms strong baselines.

Poster #16: Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions

Author: Daniel Rosenberg

Abstract: Deep learning algorithms have shown promising results in visual question answering (VQA) tasks, but a more careful look reveals that they often do not understand the rich signal they are being fed with. To understand and better measure the generalization capabilities of VQA systems, we look at their robustness to counterfactually augmented data. Our proposed augmentations are designed to make a focused intervention on a specific property of the question such that the answer changes. Using these augmentations, we propose a new robustness measure, Robustness to Augmented Data (RAD), which measures the consistency of model predictions between original and augmented examples. Through extensive experimentation, we show that RAD, unlike classical accuracy measures, can quantify when state-of-the-art systems are not robust to counterfactuals. We find substantial failure cases which reveal that current VQA systems are still brittle. Finally, we connect between robustness and generalization, demonstrating the predictive power of RAD for performance on unseen augmentations.

Poster #17: Mixing Formal Linguistics and Optimization

Author: Lucas Weber

Abstract: Language models (LMs) have become the backbone of state-of-the-art NLP, but their learning dynamics remain an understudied domain. Recently, Weber et al. (2021) suggested investigating the learning dynamics of LMs by considering them implicit multi-task learners, where language construction rules are seen as implicit subtasks.

For a number of reasons, Weber et al.'s approach of filtering specific phenomena out of the training set is restricted in the types of phenomena it can be expanded to. Here, we solve these problems by introducing the easy-to-apply and computationally efficient method of similarity probing, and we use it to create a topology of a wide range of linguistic phenomena. We further interconnect multi-task learning and language modelling by analysing our models’ gradients following Yu et al. (2020)’s idea of forgetting through conflicting gradients. Interestingly, we find that gradients in language models are (almost) never conflicting, making it impossible to confirm Yu et al.’s relationship in language models; however, gradients of data-points that contain different variations of the same linguistic phenomenon tend to have similar gradients. With our study, we connect formal linguistic phenomena with the core feature of optimization: gradients. This connection sparks exciting questions for computational linguistics (e.g. what connects phenomena that have similar gradients?) as well as for machine learning practitioners (e.g. how can we use linguistic knowledge to improve model performance?). Further, we show that despite the similarities of implicit and explicit multi-task learning settings, gradients in both types of problems differ substantially, evoking the question about the nature of the difference.

Poster #18: Interpreting knowledge graph relation representation from word embeddings

Author: Carl Allen

Abstract: Many models learn representations of knowledge graph data by exploiting its low-rank latent structure, encoding known relations between entities and enabling unknown facts to be inferred. To predict whether a relation holds between entities, embeddings are typically compared in the latent space following a relation-specific mapping. Whilst their predictive performance has steadily improved, how such models capture the underlying latent structure of semantic information remains unexplained. Building on recent theoretical understanding of word embeddings, we categorise knowledge graph relations into three types and for each derive explicit requirements of their representations. We show that empirical properties of relation representations and the relative performance of leading knowledge graph representation methods are justified by our analysis.

Poster #19: Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals

Author: Yanai Elazar

Abstract: A growing body of work makes use of probing to investigate the working of neural models, often considered black boxes. Recently, an ongoing debate emerged surrounding the limitations of the probing paradigm. In this work, we point out the inability to infer behavioral conclusions from probing results and offer an alternative method that focuses on how the information is being used, rather than on what information is encoded. Our method, Amnesic Probing, follows the intuition that the utility of a property for a given task can be assessed by measuring the influence of a causal intervention that removes it from the representation. Equipped with this new analysis tool, we can ask questions that were not possible before, e.g. is part-of-speech information important for word prediction? We perform a series of analyses on BERT to answer these types of questions. Our findings demonstrate that conventional probing performance is not correlated to task importance, and we call for increased scrutiny of claims that draw behavioral or causal conclusions from probing results.

Poster #20: Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT

Author: Lena Voita

Abstract: Differently from the traditional statistical MT that decomposes the translation task into distinct separately learned components, neural machine translation uses a single neural network to model the entire translation process. Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training, and how this mirrors the different models in traditional SMT. In this work, we look at the competences related to three core SMT components and find that during training, NMT first focuses on learning target-side language modeling, then improves translation quality approaching word-by-word translation, and finally learns more complicated reordering patterns. We show that this behavior holds for several models and language pairs. Additionally, we explain how such an understanding of the training process can be useful in practice and, as an example, show how it can be used to improve vanilla non-autoregressive neural machine translation by guiding teacher model selection.