Project: Dataset Collection for NLP

Project 2023-24


Project Title: Dataset Collection for NLP

Professor: Mohit Iyyer

Lab/Research Group: UMass NLP

We work on natural language processing, which involves building computer programs that can understand and generate human language. The field is at the intersection of artificial intelligence and linguistics. My lab is particularly interested in building computational models of storytelling.

Project Description

An undergraduate team could collect a new dataset involving literary data and then train some models to perform a task using that data. Some examples could be speaker attribution (given some dialogue from a story, identify which character in that story said it), or question answering (given a snippet of a story, answer a question about why a character did or said something).

Learning Outcomes

They should become familiar with the basics of text processing (e.g., how to extract clean data from messy sources) and computational modeling of language.

Prerequisites

Experience in python programming would be a plus!