Abstract
AI tools can be a valuable resource to build education resources. Among those, a very popular testing method uses quizzes or flashcards to test students' understanding of the information they are taught. This project aims to design a system that, given a context (e.g., textbook paragraph), a question and its answer, can generate possible wrong choices (known as distractors). Importantly, the way in which distractors are generated can significantly influence the quiz difficulty. For this reason, a large part of this project is the careful evaluation of the system's output.
Description
In this project, you will be using the SciQ dataset which contains Question-Answer (QA) pairs along with three alternatives (also called distractors) and a context/support. The questions in this dataset originate from the fields of Physics, Chemistry and Biology, among others.
Your main goal is to build a system that can generate suitable distractors given a context, a question, and the correct answer. For example, given:
The context: "Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin.",
the question: "What type of plate boundaries produce huge mountain ranges in the ocean basin?",
and the answer: "divergent"
your system should generate distractors like:
"tractional"
"coherent"
"parallel"
You will then need to evaluate the generated distractors with qualitative and quantitative methods. Are the distractors plausible? Are they too easy or too difficult?
In natural language generation (NLG) tasks such as this one, many generations are correct for a given input. One way to evaluate a model-generated question is to check how similar it is to a (set of) reference distractors given in the test set. This can be done by various text-based metrics like BLEU and ROUGE variants (Rouge1, Rouge2, RougeL). Alternatively, some neural metrics used for NLG evaluation are BLEURT and BartScore.
This task is further challenging because the distractors should be neither too similar (e.g., synonyms) nor too dissimilar (e.g., unrelated terms from a different field) to the correct answer, and often they are not included in the provided paragraph. Furthermore, for a given QA pair, there could be good distractors that are not among the distractors in the test set. Start by qualitatively assessing some of the produced distractors. Can word/sentence embeddings be used to create a metric and automate the evaluation of the generated distractors?
Ideas for research directions:
Could we use a Question-Answering model to evaluate the generated distractors? Beyond using classical quantitative metrics (BLEU, ROUGE, BLEURT etc.), we could also treat a Question-Answering model as a "student"/test-taker. Are implausible/bad distractors assigned less confidence (e.g., measured by softmax probabilities) by the student model?
Tarrant et al. (2006) established 19 guidelines for "writing high-quality multiple-choice questions". Using these criteria, manually evaluate a subset of the generated sets of distractors to see if they are of high quality. Can this process be automated for some or all criteria (e.g., using a fine-tuned LLM, or a rule-based system)?
[Challenge 🏆] Could we extend the system such that it converts an open question + answer to a Multiple-Choice Question with appropriate distractors? Extend and evaluate your distractor-generation system with a component that allows open QA pairs (e.g., wiki_qa, web_questions) to be used as input. When there is no provided context/passage this becomes trickier. Could we generate a passage before generating the distractors, or is a context paragraph not needed at all?
Materials
The SciQ dataset is available on Huggingface (article).
Refer to the dataset card on the Dataset Hub for all information related to available features and an example from the dataset.
Refer to the Gradio and Streamlit documentation for the challenge question
Multiple models fine-tuned for question generation, semantic similarity, machine translation and instruction following are available on the Huggingface Hub. A general heuristic is to try out some of the most downloaded ones on your use case and pick the best. Comparing different models used for the same task is good but should occupy only a minor part of your overall work.
References
Evaluation Metrics: Assessing the quality of NLG outputs
Sentence Transformers Quickstart
Automatic distractor generation for domain specific texts
A systematic review of automatic question generation for educational purposes
Distractor Generation for Multiple Choice Questions Using Learning to Rank
Lost in the Middle: How Language Models Use Long Contexts