Science Quiz Generation

Science Quiz Generation

Quizzes can be hard, but making valid questions at the right difficulty level can be even harder!


Abstract


AI tools can be a valuable resource to build education resources. Among those, a very popular testing method uses quizzes or flashcards to test students' understanding of the information they are taught. This project aims to design a system that can generate multiple-choice questions, answers, and possible choices/distractors for a given paragraph of text. Importantly, the way in which questions are formulated, and the given choices can significantly influence the quiz difficulty. For this reason, you will also be asked to evaluate the quality of generated question/answer pairs and other provided options.


Description


In this project, you will be using the SciQ dataset which contains Question-Answer pairs along with three alternatives (also called distractors) and a context/support. The questions in this dataset originate from the fields of Physics, Chemistry and Biology, among others. 


Your main goal is to build a system that can generate questions-answer pairs and possible distractors given a text segment as input. For example, given the context: 


"Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin."


your model should generate a question-answer pair like: 


Question: What type of plate boundaries produce huge mountain ranges in the ocean basin?
Answer: divergent


and add plausible distractors to use in the quiz, like



You will then need to evaluate the generated multiple-choice questions with qualitative and quantitative methods. Are the generated questions coherent? Are the distractors relevant? 


In natural language generation (NLG) tasks such as this one, many generations are correct for a given input. One way to evaluate a model-generated question is to check how similar it is to a (set of) reference question(s) given in the test set. This can be done by various text-based metrics like BLEU and ROUGE variants (Rouge1, Rouge2, RougeL). Alternatively, some neural metrics used for NLG evaluation are BLEURT and BartScore.


Additionally, the task of multiple-choice question generation also requires generating appropriate distractors. This, too, can be challenging as they should be neither too similar (e.g., synonyms) nor dissimilar (e.g., unrelated terms from a different field) to the correct answer, and often they are not included in the provided paragraph. Start by qualitatively assessing some of the produced distractors. Can word/sentence embeddings be used to create a metric and automate the evaluation of the generated distractors?


Ideas for research directions:


Materials

References

Evaluation Metrics: Assessing the quality of NLG outputs

Sentence Transformers Quickstart

Automatic distractor generation for domain specific texts

A systematic review of automatic question generation for educational purposes

Distractor Generation for Multiple Choice Questions Using Learning to Rank

Lost in the Middle: How Language Models Use Long Contexts