Course Description

Data 88: Language Modeling & Text Analysis

University of California, Berkeley | 2020 Spring Session | Section: SEM 001 Class Number: 32957

Dates: Monday 2:00P-3:59P | Location: 458 Evans

First Day of Class: January 27 | Final Exam Schedule: April 27

Instructor: Professor Adam G. Anderson | admndrsn@berkeley.edu

Connector Assistant: Richard Yu | richard682yu@berkeley.edu

Request Consultations (in place of my office hour): https://dlab.berkeley.edu/consultation/adam-g-anderson



Welcome to the Data 88 course on Language Models and Text Analysis! In this semester we will explore large collections of digitized books as data, using computational text analysis for language modeling. The goal of each student will be to build a series of language models using computational text analysis, and describe / compare the usefulness of these language models for a text corpus / collection of books of your choosing.

In this course, we will evaluate a variety of language models, in order to critically assess the value of the new knowledge that is being generated, and to weigh that knowledge in terms of the ever-increasing tools for deep learning, often referred to as Muppetware (Links to an external site.). We will explore the fundamental arguments that are being advanced about these new methods and how they interact with humanities’ interpretive underpinnings. This course will prepare students to apply Natural Language Processing (NLP) and Computational Text Analysis (CTA) methods in ethical, reflective, and responsible ways—by understanding the potentials and limitations of these language models.

As an area of scholarly research, NLP acts as a bridge between Linguistics, on the one hand, and Computer Science on the other hand, and therefore cultivates the ability of each student to translate between these disciplines. In order to make such translations, student projects take a ‘both-and’ approach:

  1. it begins with the artisanal craft of the Arts and Humanities through imaging, digitization, and digital curation of a corpus or collection of sources;

  2. it then prepares the relevant sources for critical analysis (i.e. exegesis and hermeneutics) through qualitative analysis, tagging, encoding, and linking datasets with the same specialist knowledge from each discipline within Linguistics through computational tools;

  3. it then approaches these curated sources empirically, posing new questions which are conducted through language modeling, exploratory network analysis, visualization, and documentation;

  4. lastly it deals responsibly with the interpretation of the results, and conveys any assumptions made through testing the reproducibility and replicability of the study at hand, along with analysis of the theoretical assumptions and ethical implications of existing technologies and methods.

The emphasis of this course is in the hands of each student, as they select their sources and apply the discussed tools and methods for language modeling and text analysis. We will explore the ways the union of data analysis and corpus linguistics research is both productive and in tension.

Students will engage with the theoretical and methodological foundations of language modeling and NLP through participation in discussion of the readings, crafting a proposal for their unique textual corpus project in the language of their choosing, creating posters and videos of their results. Through this course, students will learn to evaluate computational models and data-driven arguments and to develop reflective, ethical, and critically-aware projects.

Course Requirements

  • Attendance is required as the class discussions are central to the course. Students are responsible for making up any missed material. Please email me to make arrangements if you need to miss class.

  • Complete readings thoroughly and contribute to the corresponding collaborative Google Doc. Come to class ready to discuss the daily questions with ideas supported by the readings.

  • Assignment 1 (Links to an external site.) (Due 2/10/20): Complete a proposal Poster for your Project.

  • Assignment 2 (Due 3/30/20): Draft of the Poster for your Project.

  • Provide feedback on peer project posters.

  • Assignment 3 (Due 5/11/20): Final Video Presentation.

  • Assignment 4 (Due 5/11/20): Final Project Poster.

Learner Support

The main “hub” for this course will be a folder in Google drive (Links to an external site.), where students and the instructor will collaborate, share readings and slides for the course. Students will receive feedback from the instructor and peers on their Project Diagram Posters and Final Movie Projects through bCourses.

Please notify me in writing by the second week of the term about any known or potential extracurricular conflicts (such as religious observances, graduate or medical school interviews, or team activities). I will try my best to help you with making accommodations, but cannot promise them in all cases. In the event there is no mutually-workable solution, you may be dropped from the class.

4 Assignments:

1. Poster Diagram of your Project (Due 2/10/20)

Prepare a poster (24x36) illustrating your ideas on the text corpus you will make a project and the type of analysis you intend to explore. Describe the language(s), methods & tools (what & how?) of your project, and any questions for inductive and deductive estimation and analysis.

2. Finalized Poster Diagram of your Project (Due 3/30/20)

The final version of your project diagram for this course. A unique project that visually explains the models you've created and critically engages the results of the language models derived from the corpus of your choosing.

3. Video Tutorial of your Project (Due 4/27/20)

Prepare a 5-7 minute video tutorial on your language modeling project, hosted in YouTube. Your presentation should include a walk through of the major features of the project, a description and assessment of the project goals, design, data sources, and analytical methods, and an evaluation of the project using at least two of the course readings (include citations). Grading rubric will consist of the following questions:

  • Why did you select the language & corpus?

  • What tools & methods were used?

  • How did your methods shape the results?

4. Final Posters (Due 4/27/20), with following rubric:

  • (1) Course Title & Instructor Your Name & Date (who?)

  • (1) Descriptions of your Dataset (what?)

  • (5) Questions for Exploratory Data Analysis (what?)

  • (5) Descriptions of Tools & Methods (how?)

  • (5) Interpretations of results (why?)

  • (3) Works Cited, Links & References

Course Grading Rubric:

  • Attendance & Participation (Class discussion and evidence of reading): 5%

  • Assignment 1 Poster Diagram: 25%

  • Assignment 2 Poster Project: 25%

  • Final Video Tutorial: 20%

  • Final Poster: 25%

Suggested Digital Collections to Review

Language & Text Archives

Textual Analysis

Accommodations for Students with Disabilities

Please email me as soon as possible if you need particular accommodations, and we will work out the necessary arrangements.

Academic Integrity

You are a member of an academic community at one of the world’s leading research universities. Universities like Berkeley create knowledge that has a lasting impact in the world of ideas and on the lives of others; such knowledge can come from an undergraduate paper as well as the lab of an internationally known professor. One of the most important values of an academic community is the balance between the free flow of ideas and the respect for the intellectual property of others. Researchers don't use one another's research without permission; scholars and students always use proper citations in papers; professors may not circulate or publish student papers without the writer's permission; and students may not circulate or post materials (handouts, exams, syllabi--any class materials) from their classes without the written permission of the instructor.

Any paper or project submitted by you and that bears your name is presumed to be your own original work that has not previously been submitted for credit in another course unless you obtain prior written approval to do so from your instructor. In all of your assignments, including your homework or drafts of papers, you may use words or ideas written by other individuals in publications, web sites, or other sources, but only with proper attribution (MLA citation). If you are not clear about the expectations for completing an assignment or taking a test or examination, be sure to seek clarification from me beforehand. Finally, you should keep in mind that as a member of the campus community, you are expected to demonstrate integrity in all of your academic endeavors and will be evaluated on your own merits. The consequences of cheating and academic dishonesty—including a formal discipline file, possible loss of future internship, scholarship, or employment opportunities, and denial of admission to graduate school—are simply not worth it. You will not pass my class if you are found to be cheating or plagiarizing, so please use the proper citations and attributions.

Emergency plan:

https://dac.berkeley.edu/evans-hallLinks to an external site.

Selected Readings:

Suggested Readings:

Graded Components:

Assignment 1: Intro Survey & Project Diagrams (Due 2/10/20)

Assignment 2: 2nd Draft: Poster Diagram of your Project (Due 3/30/20)

Assignment 3: Video Tutorial of your Project (Due 5/11/20)

Assignment 4: Final Poster of your Project (Due 5/11/20)