Explainable Natural Language Inference


This collaborative project between Stony Brook University (SBU) and University of Arizona (UA) aims to develop explanation-centered approaches for natural language inference applications. This work is supported in part by the National Science Foundation.

Award title:

III: Small: Collaborative Research: Explainable Natural Language Inference


This material is based upon work supported by the National Science Foundation under Grant Numbers:

1815358 (Stony Brook University)

1815948 (University of Arizona)


Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


September 1st 2018 to August 31, 2021



1. Niranjan Balasubramanian, Stony Brook University. (PI)

2. Peter Jansen, University of Arizona. (Co-Investigator)

3. Mihai Surdeanu, University of Arizona. (Co-Investigator)

Graduate Students

  1. Heeyoung Kwon (SBU)
  2. Harsh Trivedi (SBU)
  3. Noah Weber (SBU)
  4. Qingqing Cao (SBU)
  5. Zeyu Zhang (UA)


  1. Tushar Khot, Allen Institute for Artificial Intelligence
  2. Ashish Sabharwal, Allen Institute for Artificial Intelligence
  3. Oyvind Tafjord, Allen Institute for Artificial Intelligence
  4. Peter Clark, Allen Institute for Artificial Intelligence
  5. Aruna Balasubramanian, Stony Brook University

Project Goals

The major goals of this project are to develop explainable inference methods. Text-based inference methods today support question answering, and information extraction capabilities. However, a key deficiency in these methods is that it is not easy to explain how these models arrive at their decisions. Our work aims to address this gap. In particular, our focus is to develop explanation centered approaches for inference for complex question answering, and information extraction tasks.

Research Challenges

The main research challenges are:

  • lack of large scale annotated datasets that support explainable reasoning.
  • the combinatorial possibilities when aggregating information for inference, requiring targeted exploration
  • the presence of distracting information that can cause inference drift, requiring careful incorporation of information

Summary of Current Results

  • Entailment-based Question Answering: A fundamental issue when reasoning with multiple pieces of text-based information is that distracting information can easily derail information. One way to address this is to design an effective mechanism that controls which pieces of information are aggregated. Our primary result in this space is based on the intuition that both filtering (i.e., finding which pieces to aggregate) and aggregation can be seen as a form of textual entailment. We showed that a pre-trained neural entailment model can be repurposed to do multi-hop question answering. On two complex QA datasets that require reasoning with multiple sentences, while improving our ability to locate important sentences that support the answer -- a step towards explainability. We obtain 3 absolute points in F1 on MultiRC and 1.7 absolute points in F1 on OpenBook QA compared to Open AI Transformer, a large model with much higher capacity.
  • Predicting when the QA model has an answer: One of the ways of understanding and explaining a QA model’s answers is to learn a separate function that can tell when the QA model has found an answer or when it cannot reliably find an answer. Such a capability can improve the trustworthiness of a QA system but also has implications for efficiency -- not processing more documents when the answer is already found. We developed an early stopping algorithm that inspects the QA model’s internals and its scores to decide when the correct answer has been found. A simple score based classifier is able to predict when further processing is unnecessary with a 60% accuracy. Our initial attempts at using the internal representations of the QA models were largely unsuccessful.
  • Understanding the information needs of a question: In currently submitted work, we have developed (to the best of our knowledge) the largest and most detailed question classification dataset that narrows questions into hundreds of detailed problem domains. We have also paired this with a question classification model that achieves state-of-the-art performance across several benchmark open-domain and biomedical-domain datasets. (To preserve blind review, more details will be released upon acceptance).


1. Trivedi, H., Kwon, H., Khot, T., Sabharwal, A., & Balasubramanian, N. (2019). Repurposing Entailment for Multi-Hop Question Answering Tasks. NAACL-HLT.

2. Cao, Q., Weber, N., Balasubramanian, N., & Balasubramanian, A. (2019). DeQA: On-Device Question Answering. MobiSys.


  1. [Slides] NAACL 2019 Talk on Repurposing Entailment for Multi-Hop Question Answering Tasks by Harsh Trivedi.


  1. Question Classification Dataset -- currently in submission
  2. Explanation Bank -- a continuing collaborative effort to develop a corpus of detailed semi-structured explanations to serve as training data for multi-hop inference, as well as an instrument for providing a detailed characterization of the information aggregation abilities of multi-hop inference algorithms.



Broader Impacts

Algorithmic Advances:

  1. A new entailment based method for question answering, which helps advance the accuracy of QA techniques and takes a step towards explainability. The technique developed for repurposing entailment models for QA is a general contribution that can be applied for taking any pre-trained model trained on task A and applying it to task B. This has implications for multi-task learning in NLP.
  2. We show that a pre-trained function can be broken into parts and recombined together in order to be used in new tasks that may have different types of inputs (e.g., using pre-trained functions trained for sentences to work on documents). This is a promising idea that can be useful, in general, for re-configuring ML models.
  3. A new method for evaluating the relevance, overlap, and coverage of multiple facts for inference in a QA setting. (Paper currently submitted)
  4. A new method for performing question classification, which is the first single method to generalize and show state-of-the-art performance across multiple question classification datasets. (Paper currently submitted)

Dataset Advances:

  1. A new challenge dataset for detailed question classification, substantially exceeding previous datasets on metrics of size, complexity, the level of detail in problem classification, and the difficulty of the task. (Paper currently submitted)
  2. A continuation of collaborative work to develop a large dataset of supervised training data for the information aggregation/multi-hop inference and explanation generation tasks. At the time of writing this report, the website to disseminate the initial version of that dataset (the Explanation Bank) has been accessed 1,721 times. ( http://cognitiveai.org/explanationbank/ )

Integration with the broader research community:

  1. Co-I Surdeanu and Co-I Jansen are (with the help of collaborators) currently running a shared task on explanation regeneration at TextGraphs 2019 using the above data, in an effort to increase the ultimate impact of this data in terms of usage and speeding the algorithmic advances made possible through it’s use.

Educational Material


Highlights and Press Releases


Point of Contact

Niranjan Balasubramanian, Stony Brook University (PI)

Last Updated

June 29, 2019