Final assignment F22 guidelines

Final assignment due (15 Dec, 23:59 - updated!)

The final assignment asks you to find a corpus of interest, to match it with a method you have been exposed to this term and to analyze it using that method, all the while bringing a critical eye to the method. This assignment can be done in pairs or alone.

The first step will be to choose a corpus. Here are some possibilities:

(1) There are the corpora we have used along the way in the semester (Little Cousin series, etc).

(2) You can use a fan fiction corpus (Archive of one's own), but choose something other than Harry Potter

(3) You can choose any dataset from Kaggle which has an unstructured text element to it (please chose something other than the airbnb dataset).

(4) You can use AntCorGen to generate a corpus of professional writing.

(5) You can also choose texts from Wikisource, from Project Gutenberg or from other text bases. When you are preparing the files for use, you should save them as txt and make sure that the encoding is utf8.

(6) You can scrape data from any place on the net you are able within the time allotted.

NB: If you choose to work with word embeddings, your instructor can train the model for you. Feel free to seek out any help needed in preparing your corpus.

The next step is to consider a method in line with the size and nature of the corpus. Ideally, these methods we have studied help you to approach a research question about the corpus. For example

  • What are the kinds of words that one author or text use which can be distinguished from another?

  • Who do we suspect is the author of this text? Can we predict authorship based on a set of words?

  • What are general trends in word use across my corpus?

  • What words tend to be use in similar contexts in my corpus? If I divide the corpus into parts, what is the meaning of the trends I might find?

Methods F22

Length and requirements:

The essay should be 1500-2000 words including visuals. Be sure to include the following:

  • a rationale for the corpus chosen and some contextualization of the material

  • a link to the corpus as a downloadable file

  • a description of the process (including the potential dead ends) of compiling and studying the corpus

  • an overview of what was learned through the analysis, relevant takeaways, "Eureka" moments

  • a statement about what else there is to learn, and how the specific method at hand was useful, even limiting, to an understanding of the phenomenon as you understood it