Abstract
Semantic search is being incorporated in most search applications to extract semantically meaningful results that would have otherwise been missed by simple lexical techniques. In this project, you will use semantic search and other techniques such as question answering and question generation to create a simple study assistant for the NLP course.
Description
A flashcard is a card bearing a question on one side and an answer on the other, which is intended to be used as an aid in memorization. We believe in flashcards as effective learning aids, but they are a lot of work to make! Wouldn’t it be nice if there was a way of automating the process of making them?
For this project, we parsed and semi-automatically cleaned the relevant portions to our course of the “Speech and Language Processing” book by Jurafsky and Martin into a data-readable format. Every paragraph of text has been associated with its respective chapter, section and subsection, and a small set of questions with the associated paragraph and answer has been created to kickstart your efforts. Your main goal is to develop a system able to extract an answer from the book when given a question.
For example, if provided with the query “What is stemming?”, a well-performing system would return the sentence “Stemming refers to a simpler version of lemmatization in which we mainly just strip suffixes from the end of the word” taken from the introduction of Chapter 2. You can use the provided questions and answers to validate your system, but you are encouraged to come up with your own and extend the list.
Ideas for research directions:
The trade-off between speed and accuracy is a well-known one when it comes to retrieval. You might be interested in making your search efficient by using tools such as FAISS and ElasticSearch and comparing how this trade-off impacts the performances of your systems in practice.
[Challenge 🏆] The cleaning process we adopted is far from perfect, and much content may be irrelevant to your goals. Moreover, if you plan to make use of pre-trained systems for retrieving contexts and answering questions, you’re likely to incur a significant domain shift. Explore ways to adapt models to your low-resource context, or to automatically filter the pool of texts to ensure relevance. The work by Wang et al, 2021 is a good starting point for that.
What about looking at the problem backward? If extracting salient concepts seems easier, it could be worth considering using some question generation models to perform augmentations. Are question generation models biased towards some specific question types in this setting?
Materials
A HuggingFace dataset associated with the parsed data has been created and is available on the Dataset Hub.
Refer to the dataset card on the Dataset Hub for all information related to available features and an example from the dataset.
References
Chaudhary, Amit "Evaluation Metrics For Information Retrieval". Blog post (2020)
Thakur, Nandan et al. “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.” NeurIPS, Datasets and Benchmarks Track (2021)
Lin, Jimmy J. et al. “Pretrained Transformers for Text Ranking: BERT and Beyond.” Proceedings of the 14th ACM International Conference on Web Search and Data Mining (2021)
Ma, Xueguang et al. “A Replication Study of Dense Passage Retriever.” ArXiv abs/2104.05740 (2021) and the PyGaggle library.
Wang, Kexin et al. “GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval.” ArXiv abs/2112.07577 (2021)
Sentence Transformers Domain Adaptation Tutorial