Syllabus

CS 159 - Natural Language Processing

He then led me to the frame, about the sides, whereof all his pupils stood in ranks. It was twenty feet square, placed in the middle of the room. The superfices was composed of several bits of wood, about the bigness of a die, but some larger than others. They were all linked together by slender wires. These bits of wood were covered, on every square, with paper pasted on them; and on these papers were written all the words of their language, in their several moods, tenses, and declensions; but without any order. The professor then desired me “to observe; for he was going to set his engine at work.” The pupils, at his command, took each of them hold of an iron handle, whereof there were forty fixed round the edges of the frame; and giving them a sudden turn, the whole disposition of the words was entirely changed. He then commanded six-and-thirty of the lads, to read the several lines softly, as they appeared upon the frame; and where they found three or four words together that might make part of a sentence, they dictated to the four remaining boys, who were scribes. This work was repeated three or four times, and at every turn, the engine was so contrived, that the words shifted into new places, as the square bits of wood moved upside down.

Jonathan Swift, Gulliver’s Travels (1726)

The Basics

Instructor: Prof. Xanda Schofield

Office Hours: TBA

Class Time: Tu/Th 12:45-2 PM PST (Section 1), 2:30-3:45 PM PST (Section 2)

For information about office hours and links to the course resources, visit the Home page.


Motivation

Natural language processing, or NLP, is broadly the study of how to get computers to draw meaningful information from language produced by people, and, in some cases, to generate language in response. The idea of having machines that work with language is old, but in the era we find ourselves, the technologies of NLP are all around us: in our phones assistants, search engines, social media sorting algorithms, YouTube captions, even changing the way we write. The approaches in NLP span disciplines, from linguistics and psychology to statistics, information theory, and the theory of programming. It’s messy, and it’s impossible to do perfectly, but I think there’s so much to learn about both algorithms and people from trying. I’m excited to share in that excitement with you this semester.

Course Goals

This NLP class focuses on core concepts and problems in NLP centering on how text is stored, predicted, categorized, dismantled, and interpreted by computers. By the end of this semester, you should be able to

  • Implement classic algorithms for natural language processing such as text normalization, statistical language models, vector representations of words, word sense disambiguation, part-of-speech tagging, and text classification,

  • Read and analyze primary literature (that, is, academic papers) in the field of NLP in popular subfields,

  • Use and interpret appropriate metrics of evaluation of how NLP technologies perform in practice,

  • Critique the assumptions and approximations that are made about language when developing NLP problems, algorithms, datasets, and evaluations,

  • Practice the art of writing and presenting research, including experimental design and implementation, literature review, and effective use of LaTeX and slide decks.

Course Structure

Prerequisites

This course expects the completion of CS 81. Though it also officially requires a course in probability (and is easier with such a course), this requirement has been waived for this semester. If this doesn’t fit your prior courses, please reach out to me (Prof. Xanda) immediately.

Grading

Your grade will be computed using the following components:

  • 10% Class participation and worksheets

  • 40% Weekly Labs (first half)

  • 15% Midterm Exam

  • 10% Special Topic Presentation (second half)

  • 25% Final Project (second half)

This course will be graded using the Harvey Mudd scale (which contains neither A+ nor D-). Below are the intervals for each letter, with the interval [x, y) including all numbers greater than or equal to x and strictly less than y. (For instance, a 92.9 will be treated as an A-, while a 93 is an A.)

  • A: [93, 100]

  • A-: [90, 93)

  • B+: [87, 90)

  • B: [83, 87)

  • B-: [80, 83)

  • C+: [77, 80)

  • C: [73, 77)

  • C-: [70, 73)

  • D+: [67, 70)

  • D: [63, 67)

  • F: [0, 63)

Schedule

The tentative course schedule can be found here. Links will be added and updated for topic sign up and lab instructions as we progress through the semester.

Readings and Worksheets

This course will combine readings from the online copy of Jurafsky and Martin’s Speech and Language Processing, academic papers, and relevant blog posts and podcasts about topics in NLP. Reading discussions will happen on Tuesday, so you will be expected to complete weekly readings prior to Tuesday’s class. In class, you’ll get a chance both to ask about the shared technical reading (usually from Jurafsky and Martin) and to discuss in small groups a more modern reading on a subtopic you sign up for. Readings (and sign-ups for subtopics) will be posted the Friday prior to a Tuesday class.

Tuesday classes will typically be accompanied by worksheets with questions about comfort with the concepts from the required reading, a few conceptual questions, and space to reflect on your discussion in your small group. After the first week, worksheets are expected to be completed during class on Gradescope (not before) and submitted by 5 PM PST on Tuesdays. Worksheets are graded on completion to assess whether you've done the reading and thought about the content (first half of the worksheet) and how discussion went (second half of the worksheet).

Labs

Lab assignments will be posted on GitHub Classroom just prior to class on Thursday mornings, with lab descriptions linked on the Schedule page. You’ll have the chance to start these labs in class (individually or in groups) during Thursday class. While lab attendance is encouraged, it isn’t required; however, if you feel you’ll likely need to miss lab, please reach out. Assignments will typically be implemented in Python 3, with a Docker image available to ensure that you have the correct Python installation to successfully complete the assignment. More information on how to set up Docker can be found here. Labs are due the Wednesday after they are assigned at 10 PM PST on Gradescope.

Midterm

The midterm exam will consist of short-answer and multiple-choice questions covering concepts from the first half of the semester. The exam will be made available via Gradescope on Wednesday, March 31 at 10 PM PDT. It will be due Monday, April 5 at 10 PM. Once the exam is released, you may not speak to anybody other than Prof. Xanda about the exam until the exam period is over.

This exam will be distributed and collected via Gradescope. You will complete this exam individually (no team exam submissions). During the exam, you will have access only to your own notes, the course textbook, course website, and course Gradescope. Though the exam will be untimed, you will be expected to take it in one sitting; however, if you need to take a break for your own needs or the needs of those around you, you may do so (as long as you report the time on your exam).

Your exam will be submitted as a PDF using any source you wish (Markdown, LaTeX, exported from Microsoft Word, handwritten, etc.). Please make sure it is easy to locate and read your answers to each question, and be sure to mark the answer regions in Gradescope.

This exam is worth 15% of your semester grade. The breakdown of points is as follows.

  • 25% of your exam score will be on successful completion of the midterm exam review (which must include reasonable definitions for 10 terms from the reading + 2 suggested review questions.)

  • 50% of your exam score will be on providing clear and correct answers to 10 short-answer questions. These questions will be primarily sourced through your submitted questions, though a few other questions will likely be added that are on par with past worksheet questions.

  • 25% of your exam score will be on answering three out of six reflective short-answer questions based on 2-3 short passages of recent papers. These questions will not have one right answer and will be based on having meaningfully reflected on “why” questions in your response.

When your midterm is graded, you will have 1 week in which you may rewrite your answers. Rewrites can restore up to half the points lost.

Special Topic Presentation

In the second half of the semester, you will work with a small team of ~3 students to lead a 35-minute class discussion of a modern topic in natural language processing, in which you will presenting the core problem and some ideas from recent work. Information about signing up for topics and expectations for these will be shared when we reach that point.

Final Project

In the second half of the semester, you’ll also work on a final project in a group of 1-3 people. The final deliverables of this project will be (1) an short ACL-style paper describing your project and (2) a short recorded presentation of the highlights of your project. Your final project grade will be determined by these two deliverables, as well as several smaller milestone assignments in the second half of the semester.

Class Policies

Time Zones

I want us to have the chance to have meaningful discussions together as a class, but I know that synchronous class sessions can be hard to attend if you're not in the US. If you're in a time zone that will make attending synchronous class times on Tuesdays difficult, please reach out to me right away so we can come up with a plan that still allows you to engage in the conversation with classmates.

Extensions

If you need a 1-day extension in this class for any reason (lots of exams, busy with a family obligation, or just tired), please email me to let me know prior to the time the assignment is due. Any request for a 1-day extension will be honored, no matter the reason. If you’re working in a group, only one group member needs to request the late day.

If something comes up that will prevent you from completing classwork for multiple days (e.g. an illness or family emergency), please reach out to me as soon as you can so we can together determine a healthy plan for you to continue your coursework when it makes sense. My priority is always to make sure you can focus on the situation at hand while ensuring you won't spend the rest of the semester working to catch up again.

Accommodations

If you anticipate or experience academic barriers based on a disability (including mental health, chronic *or* temporary medical conditions), please let me know right away so that we can privately discuss options. Any student with a documented disability who requires reasonable accommodations should contact ability@g.hmc.edu (if at Mudd) or your home college’s disability officer and have them reach out to me.

Course Conduct

Natural language processing as a field is prone to touch on topics related to culture and identity. The unique experiences you bring to the class from your own experiences not only strengthen our community but also actively contribute to the learning of everyone in the classroom. On the flip side, false cultural assumptions and negative comments about others both causes harm and makes our collective work as conscientious NLP scholars harder.

As your instructor, I am committed to creating a classroom environment that welcomes all students, regardless of race, gender, social class, religious beliefs, etc. We all have implicit biases, and I will try to continually examine my judgments, words and actions to keep my biases in check and treat everyone fairly. I expect that you will do the same with respect to me and the other members of the class, and that you will let me know if there is anything I can do to make sure everyone is encouraged to succeed in this class.

Honor Code

All students—even those from other colleges—are expected to understand and comply with Harvey Mudd College’s Honor Code. If you haven’t already done so, you must read, sign, and abide by the computer-science department’s interpretation of the Honor Code to participate in this course. Specifically:

  • You must not exchange literal copies of material, whether that material consists of code, program output, or English-language text (e.g., documentation). You also may not copy material from published or online sources, with or without cosmetic changes (such as altering variable names), without explicit permission. If you do have permission to use externally written material, you must attribute it properly and clearly indicate which material is yours and which material is not yours. Publishing your own homework or exams from this class on the web (e.g., in a public GitHub repository) violates this policy.

  • You should not do anything that a reasonable student peer would describe as “subverting the clear intent of the assignment,” unless you have asked for and received permission to do so. Finding open-sourced code that you can use to solve an assigned problem, for example, would typically be subverting the intent of the assignment because your shortcut means that you do not learn what the assignment aims to teach.

  • If you use any sources to assist you, you must document them. For example, if you use code from an existing researcher’s code repository, paper, or blog post to build your class project, you must attribute that source in a sufficiently specific way for the course staff to easily find the original source.

  • If you aren’t sure whether something you’ve done or plan to do is allowed, you should explicitly document what you did and—if at all possible—consult with the course staff, ideally before you take the questionable action. Similarly, document any extensive or particularly important help you obtain, even if that help seems legitimate. If you’ve been helped so much that we can’t consider the work truly your own, you might not be able to get full credit for it, but proper attribution will avoid an Honor Code violation (or academic integrity case at your institution).

  • Academic integrity also involves being careful enough to avoid unintentionally breaking the rules. Thus, you must read instructions in assignments and exams carefully so that you are aware of any limitations they place on you, such as time restrictions or restrictions on information sources you may consult. Similarly, if you see something that plausibly seems like it ought to be off-limits to you, such as a GitHub directory belonging to another student or files from a previous semester, you should immediately contact us to let us know that something doesn’t seem right, rather than looking further at something that perhaps should have been off-limits.

These principles apply to all methods and media of discussion or exchange (voice, writing, email, etc.).