Science Quiz Generation

Quizzes can be hard, but making valid questions at the right difficulty level can be even harder!

Abstract

AI tools can be a valuable resource to build education resources. Among those, a very popular testing method uses quizzes or flashcards to test students' understanding of the information they are taught. This project aims to design a system that can generate multiple-choice questions, answers, and possible choices/distractors for a given paragraph of text. Importantly, the way in which questions are formulated, and the given choices can significantly influence the quiz difficulty. For this reason, you will also be asked to evaluate the quality of generated question/answer pairs and other provided options.

Description

In this project, you will be using the SciQ dataset which contains Question-Answer pairs along with three alternatives (also called distractors) and a context/support. The questions in this dataset originate from the fields of Physics, Chemistry and Biology, among others.

Your main goal is to build a system that can generate questions-answer pairs and possible distractors given a text segment as input. For example, given the context:

"Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin."

your model should generate a question-answer pair like:

Question: What type of plate boundaries produce huge mountain ranges in the ocean basin?
Answer: divergent

and add plausible distractors to use in the quiz, like:

"tractional"
"coherent"
"parallel"

You will then need to evaluate the generated multiple-choice questions with qualitative and quantitative methods. Are the generated questions coherent? Are the distractors relevant?

In natural language generation (NLG) tasks such as this one, many generations are correct for a given input. One way to evaluate a model-generated question is to check how similar it is to a (set of) reference question(s) given in the test set. This can be done by various text-based metrics like BLEU and ROUGE variants (Rouge1, Rouge2, RougeL). Alternatively, some neural metrics used for NLG evaluation are BLEURT and BartScore.

Additionally, the task of multiple-choice question generation also requires generating appropriate distractors. This, too, can be challenging as they should be neither too similar (e.g., synonyms) nor dissimilar (e.g., unrelated terms from a different field) to the correct answer, and often they are not included in the provided paragraph. Start by qualitatively assessing some of the produced distractors. Can word/sentence embeddings be used to create a metric and automate the evaluation of the generated distractors?

Ideas for research directions:

Generating multiple-choice questions and distractors might be seen as a two-step process: first, the question is generated from the context (also known as support) and answer, then relevant but wrong alternatives (distractors) are generated. A different approach is to create a single fine-tuned system that, given the context and correct answer, can generate both questions-answer pairs and distractors concurrently. Evaluate both approaches: how do the generated multiple-choice questions compare in terms of quality and time needed?
Tarrant et al. (2006) established 19 guidelines for "writing high-quality multiple-choice questions". Using these criteria, manually evaluate a subset of the generated multiple-choice questions to see if they are of high quality. Can this process be automated for some or all criteria (e.g., using a fine-tuned LLM, or a rule-based system)?
[Challenge 🏆] Can we extend this system for real-world usage? Create a Gradio or Streamlit demo hosted on the Hugging Face Hub in which a user can provide a URL and a relevant term, based on which the fine-tuned system extracts suitable content and generates multiple-choice questions and distractors for the given term. Does the fine-tuned system perform better on science content (i.e., similar to the training data) than on dissimilar topics (e.g., history)?

Materials

The SciQ dataset is available on Huggingface (article).
Refer to the dataset card on the Dataset Hub for all information related to available features and an example from the dataset.
Refer to the Gradio and Streamlit documentation for the challenge question
Multiple models fine-tuned for question generation, semantic similarity, machine translation and instruction following are available on the Huggingface Hub. A general heuristic is to try out some of the most downloaded ones on your use case and pick the best. Comparing different models used for the same task is good but should occupy only a minor part of your overall work.

References

Evaluation Metrics: Assessing the quality of NLG outputs

Sentence Transformers Quickstart

Automatic distractor generation for domain specific texts

A systematic review of automatic question generation for educational purposes

Distractor Generation for Multiple Choice Questions Using Learning to Rank

Lost in the Middle: How Language Models Use Long Contexts