Explainable Natural Language Inference


This collaborative project between Stony Brook University (SBU) and University of Arizona (UA) aims to develop explanation-centered approaches for natural language inference applications. This work is supported in part by the National Science Foundation.

Award title:

III: Small: Collaborative Research: Explainable Natural Language Inference


This material is based upon work supported by the National Science Foundation under Grant Numbers:

1815358 (Stony Brook University)

1815948 (University of Arizona)


Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


September 1st 2018 to August 31, 2021



1. Niranjan Balasubramanian, Stony Brook University. (PI)

2. Peter Jansen, University of Arizona. (Co-Investigator)

3. Mihai Surdeanu, University of Arizona. (Co-Investigator)

Graduate Students

  1. Heeyoung Kwon (SBU)
  2. Harsh Trivedi (SBU)
  3. Noah Weber (SBU)
  4. Qingqing Cao (SBU)
  5. Mohaddeseh Bastan (SBU)
  6. Zeyu Zhang (UA)


  1. Tushar Khot, Allen Institute for Artificial Intelligence
  2. Ashish Sabharwal, Allen Institute for Artificial Intelligence
  3. Oyvind Tafjord, Allen Institute for Artificial Intelligence
  4. Peter Clark, Allen Institute for Artificial Intelligence
  5. Aruna Balasubramanian, Stony Brook University

Project Goals

The major goals of this project are to develop explainable inference methods. Text-based inference methods today support question answering, and information extraction capabilities. However, a key deficiency in these methods is that it is not easy to explain how these models arrive at their decisions. Our work aims to address this gap. In particular, our focus is to develop explanation centered approaches for inference for complex question answering, and information extraction tasks.

Research Challenges

The main research challenges are:

  • lack of large scale annotated datasets that support explainable reasoning.
  • the combinatorial possibilities when aggregating information for inference, requiring targeted exploration
  • the presence of distracting information that can cause inference drift, requiring careful incorporation of information

Summary of Current Results

  • Measuring and Reducing Non-multifact Reasoning: Despite availability of large scale datasets for multihop QA, models trained on them do not appear to aggregate information from multiple facts. This is a central issue to be addressed if we are to get explainable models which aggregate information to find answers. We propose a formal condition (called DiRe) for deciding when a model is not performing valid multihop reasoning. The DiRe condition captures a form of bad reasoning, where models are able to independently identify facts to answer a question without connecting information in them to arrive at the answer. For a recent large-scale model (XLNet) on HotPotQA, we show that only 18% of its answer score is obtained through multifact reasoning, roughly the same as that of a simpler RNN baseline. Our transformation shows a substantial reduction in disconnected reasoning (nearly 19 points in answer F1). It is complementary to other adversarial approaches for reducing bad reasoning, yielding further reductions in conjunction.
  • Unsupervised strategies for constructing explanation texts for multi-hop QA: The team proposed multiple unsupervised methods that extract and aggregate sentences that explain answers to multi-hop questions, i.e., questions that require multiple inference steps to reach the correct answer. The team showed that our approaches produce better explanations than other, supervised neural strategies. Further, when these evidence sentences are fed into a supervised neural answer classification component, they lead to better answer selection in two multi-hop QA datasets (MultiRC and QASC).
  • Entailment-based Question Answering: A fundamental issue when reasoning with multiple pieces of text-based information is that distracting information can easily derail information. One way to address this is to design an effective mechanism that controls which pieces of information are aggregated. Our primary result in this space is based on the intuition that both filtering (i.e., finding which pieces to aggregate) and aggregation can be seen as a form of textual entailment. We showed that a pre-trained neural entailment model can be repurposed to do multi-hop question answering. On two complex QA datasets that require reasoning with multiple sentences, while improving our ability to locate important sentences that support the answer -- a step towards explainability. We obtain 3 absolute points in F1 on MultiRC and 1.7 absolute points in F1 on OpenBook QA compared to Open AI Transformer, a large model with much higher capacity.
  • Predicting when the QA model has an answer: One of the ways of understanding and explaining a QA model’s answers is to learn a separate function that can tell when the QA model has found an answer or when it cannot reliably find an answer. Such a capability can improve the trustworthiness of a QA system but also has implications for efficiency -- not processing more documents when the answer is already found. We developed an early stopping algorithm that inspects the QA model’s internals and its scores to decide when the correct answer has been found. A simple score based classifier is able to predict when further processing is unnecessary with a 60% accuracy. Our initial attempts at using the internal representations of the QA models were largely unsuccessful.
  • Understanding the information needs of a question: In currently submitted work, we have developed (to the best of our knowledge) the largest and most detailed question classification dataset that narrows questions into hundreds of detailed problem domains. We have also paired this with a question classification model that achieves state-of-the-art performance across several benchmark open-domain and biomedical-domain datasets. (To preserve blind review, more details will be released upon acceptance).


1. Trivedi, H., Kwon, H., Khot, T., Sabharwal, A., & Balasubramanian, N. (2019). Repurposing Entailment for Multi-Hop Question Answering Tasks. NAACL-HLT.

2. Cao, Q., Weber, N., Balasubramanian, N., & Balasubramanian, A. (2019). DeQA: On-Device Question Answering. MobiSys.

3. Yang, X., Liu, Y., Xie, D., Wang, X. and Balasubramanian, N. (2019). Latent Part-of-Speech Sequences for Neural Machine Translation. EMNLP.

4. Trivedi, H., Balasubramanian, N., Khot, T. and Sabharwal, A., 2020. Measuring and Reducing Non-Multifact Reasoning in Multi-hop Question Answering. arXiv preprint arXiv:2005.00789.

5. Yavad, Bethard, and Surdeanu. (2019). Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering. EMNLP 2019.

6. Thiem and Jansen (2019). Extracting Common Inference Patterns from Semi-Structured Explanations. COIN 2019.

7. Jansen and Ustalov (2019). TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation Regeneration. TextGraphs 2019.

8. Xie, Thiem, Martin, Wainwright, Marmorstein, Jansen. (2020). WorldTree V2: A Corpus of Science-Domain Structured Explanations and Inference Patterns supporting Multi-Hop Inference. LREC 2020

9. Smith, Zhang, Culnan, Jansen. (2020). ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition. LREC 2020.

10. Xu, Jansen, Martin, Xie, Yadav, Madabushi, Tafjord, Clark. (2020). Multi-class Hierarchical Question Classification for Multiple Choice Science Exams. LREC 2020.

11. Yadav, Bethard, and Surdeanu. (2020). Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering. ACL 2020.


  1. [Slides] NAACL 2019 Talk on Repurposing Entailment for Multi-Hop Question Answering Tasks by Harsh Trivedi.
  2. [Slides] TextGraphs 2019 Shared Task talk on Explanation Regeneration by Peter Jansen.


  1. Explanation Bank -- All of the datasets developed at the University of Arizona are available at the Explanation Bank, including the (a) detailed multi-hop inference datasets, (b) fine-grained question classification dataset, (c) fine-grained high-density science-domain NER dataset, and (d) collections of inference patterns and visualizations.



Broader Impacts

Algorithmic Advances:

  1. We show that "naturally" collected datasets can be transformed to new ones where models that train on these transformed datasets cannot rely on cheating as much as they were able to on the original dataset. This increases the reliability of the models we build in addition to the specific gains in explainability.
  2. A new entailment based method for question answering, which helps advance the accuracy of QA techniques and takes a step towards explainability. The technique developed for repurposing entailment models for QA is a general contribution that can be applied for taking any pre-trained model trained on task A and applying it to task B. This has implications for multi-task learning in NLP.
  3. We show that a pre-trained function can be broken into parts and recombined together in order to be used in new tasks that may have different types of inputs (e.g., using pre-trained functions trained for sentences to work on documents). This is a promising idea that can be useful, in general, for re-configuring ML models.
  4. A new method for evaluating the relevance, overlap, and coverage of multiple facts for inference in a QA setting. (Paper currently submitted)
  5. A new method for performing question classification, which is the first single method to generalize and show state-of-the-art performance across multiple question classification datasets. (Paper currently submitted)

Dataset Advances:

  1. A new challenge dataset for detailed question classification, substantially exceeding previous datasets on metrics of size, complexity, the level of detail in problem classification, and the difficulty of the task. (Paper currently submitted)
  2. A continuation of collaborative work to develop a large dataset of supervised training data for the information aggregation/multi-hop inference and explanation generation tasks. At the time of writing this report, the website to disseminate the initial version of that dataset (the Explanation Bank) has been accessed 1,721 times. ( http://cognitiveai.org/explanationbank/ )

Integration with the broader research community:

  1. Co-I Surdeanu and Co-I Jansen are (with the help of collaborators) currently running a shared task on explanation regeneration at TextGraphs 2019 using the above data, in an effort to increase the ultimate impact of this data in terms of usage and speeding the algorithmic advances made possible through it’s use.

Educational Material


Highlights and Press Releases


Point of Contact

Niranjan Balasubramanian, Stony Brook University (PI)

Last Updated

July 17, 2020