Syllabus

CS 159 - Natural Language Processing

He then led me to the frame, about the sides, whereof all his pupils stood in ranks. It was twenty feet square, placed in the middle of the room. The superfices was composed of several bits of wood, about the bigness of a die, but some larger than others. They were all linked together by slender wires. These bits of wood were covered, on every square, with paper pasted on them; and on these papers were written all the words of their language, in their several moods, tenses, and declensions; but without any order. The professor then desired me “to observe; for he was going to set his engine at work.” The pupils, at his command, took each of them hold of an iron handle, whereof there were forty fixed round the edges of the frame; and giving them a sudden turn, the whole disposition of the words was entirely changed. He then commanded six-and-thirty of the lads, to read the several lines softly, as they appeared upon the frame; and where they found three or four words together that might make part of a sentence, they dictated to the four remaining boys, who were scribes. This work was repeated three or four times, and at every turn, the engine was so contrived, that the words shifted into new places, as the square bits of wood moved upside down.

Jonathan Swift, Gulliver’s Travels (1726)

The Basics

Instructor: Prof. Xanda Schofield

Office Hours: (tentative) Tuesday 2-3:30 PM, Friday 10-11:30 AM

Class Time: M/W 9:35-10:50 AM PDT

For information about office hours and links to the course resources, visit the Home page.

Motivation

Natural language processing, or NLP, is broadly the study of how to get computers to draw meaningful information from language produced by people, and, in some cases, to generate language in response. The idea of having machines that work with language is old, but in the era we find ourselves, the technologies of NLP are all around us: in our phones assistants, search engines, social media sorting algorithms, YouTube captions, even changing the way we write. The approaches in NLP span disciplines, from linguistics and psychology to statistics, information theory, and the theory of programming. It’s messy, and it’s impossible to do perfectly, but I think there’s so much to learn about both algorithms and people from trying. I’m excited to share in that excitement with you this semester.

Course Goals

This NLP class focuses on core concepts and problems in NLP centering on how text is stored, predicted, categorized, dismantled, and interpreted by computers. By the end of this semester, you should be able to

  • Implement classic algorithms for natural language processing such as text normalization, statistical language models, vector representations of words, word sense disambiguation, part-of-speech tagging, and text classification,

  • Read and analyze primary literature (that, is, academic papers) in the field of NLP in popular subfields,

  • Use and interpret appropriate metrics of evaluation of how NLP technologies perform in practice,

  • Critique the assumptions and approximations that are made about language when developing NLP problems, algorithms, datasets, and evaluations,

  • Practice the art of writing and presenting research, including experimental design and implementation, literature review, and effective use of LaTeX and slide decks.

Course Structure

Prerequisites

This course expects the completion of CS 81 and a course covering introductory topics in probability (which can include MATH 62, MATH 157, or another appropriate course at the instructor's discretion). If this doesn’t fit your prior courses, please reach out to me (Prof. Xanda) immediately.

Grading

Your grade will be computed using the following components:

  • 10% Class participation and worksheets

  • 40% Weekly Labs (first half)

  • 15% Midterm Exam

  • 10% Special Topic Presentation (second half)

  • 25% Final Project (second half)

This course will be graded using the Harvey Mudd scale (which contains neither A+ nor D-). Below are the intervals for each letter, with the interval [x, y) including all numbers greater than or equal to x and strictly less than y. (For instance, a 92.9 will be treated as an A-, while a 93 is an A.)

  • A: [93, 100]

  • A-: [90, 93)

  • B+: [87, 90)

  • B: [83, 87)

  • B-: [80, 83)

  • C+: [77, 80)

  • C: [73, 77)

  • C-: [70, 73)

  • D+: [67, 70)

  • D: [63, 67)

  • F: [0, 63)

Schedule

The tentative course schedule can be found here. Links will be added and updated for topic sign up and lab instructions as we progress through the semester.

Readings and Worksheets

This course will combine readings from the online copy of Jurafsky and Martin’s Speech and Language Processing, academic papers, and relevant blog posts and podcasts about topics in NLP. Reading discussions will happen on Monday, so you will be expected to complete weekly readings prior to Monday’s class. In class, you’ll get a chance both to ask about the shared technical reading (usually from Jurafsky and Martin) and to discuss in small groups a more modern reading on a subtopic you sign up for. Readings (and sign-ups for subtopics) will be posted the Thursday prior to a Monday class.

When you complete the reading, you'll fill out a short worksheet on Gradescope with questions about comfort with the concepts from the required reading, a few conceptual questions, and a check-in about how things are going. Worksheets are expected to be completed by Monday class. Worksheets are graded on completion to assess whether you've done the reading and thought about the content. Additionally, during Monday class, you'll have the opportunity to write some notes down about what you learned in discussion to turn in; these will also be graded based on completion as part of your class participation grade.

Labs

Lab assignments will be posted on GitHub Classroom just prior to class on Thursday mornings, with lab descriptions linked on the Schedule page. You’ll have the chance to start these labs in class (individually or in groups) during Thursday class. While lab attendance is encouraged, it isn’t required; however, if you feel you’ll need to miss lab, please reach out to let me know. Assignments will typically be implemented in Python 3, with a Docker image available to ensure that you have the correct Python installation to successfully complete the assignment. More information on how to set up Docker can be found here. Labs are due the Tuesday after they are assigned at 10 PM PST on Gradescope.

Midterm

The midterm exam will consist of short-answer and multiple-choice questions covering concepts from the first half of the semester. The exam will be made available via Gradescope on Thursday, November 4 at 10 PM. It will be due Tuesday, November 9 at 5 PM. Once the exam is released, you may not speak to anybody other than Prof. Xanda about the exam until the exam period is over.

This exam will be distributed and collected via Gradescope. You will complete this exam individually (no team exam submissions). During the exam, you will have access only to your own notes, the course textbook, course website, and course Gradescope. Though the exam will be untimed, you will be expected to take it in one sitting; however, if you need to take a break for your own needs or the needs of those around you, you may do so as long as you report the break time on your exam.

Your exam will be submitted as a PDF using any source you wish (Markdown, LaTeX, exported from Microsoft Word, handwritten, etc.). Please make sure it is easy to locate and read your answers to each question, and be sure to mark the answer regions in Gradescope.

This exam is worth 15% of your semester grade. The breakdown of points is as follows.

  • 25% of your exam score will be on successful completion of the midterm exam review (which must include reasonable definitions for 10 terms from the reading + 2 suggested review questions.)

  • 50% of your exam score will be on providing clear and correct answers to 10 short-answer questions. These questions will be primarily sourced through your submitted questions, though a few other questions will likely be added that are on par with past worksheet questions.

  • 25% of your exam score will be on answering three out of six reflective short-answer questions based on 2-3 short passages of recent papers. These questions will not have one right answer and will be based on having meaningfully reflected on “why” questions in your response.

When your midterm is graded, you will have one week in which you may rewrite your answers. Rewrites can restore up to half the points lost.

Special Topic Presentation

In the second half of the semester, you will work with a small team of 3-4 students to lead a 35-minute class discussion of a modern topic in natural language processing, in which you will presenting the core problem and some ideas from recent work. Information about signing up for topics and expectations for these will be shared when we reach that point.

Final Project

In the second half of the semester, you’ll also work on a final project in a group of 1-3 people. The final deliverables of this project will be (1) an short 3-4 page ACL-style paper describing your project and (2) a short presentation of the highlights of your project. Your final project grade will be determined by these two deliverables, as well as several smaller milestone assignments in the second half of the semester. Final deliverables will be due on Gradescope at the end of the corresponding final exam period for our class, December 16 at 12 PM.

Class Policies

Extensions

If you need a 1-day extension in this class for any reason (lots of exams, busy with a family obligation, or just tired), please email me to let me know prior to the time the assignment is due. Any request for a 1-day extension on something besides an exam or the final project submission will be honored, no matter the reason. If you’re working in a group, only one group member needs to request the late day.

If something comes up that will prevent you from completing classwork for multiple days (e.g. an illness or family emergency), please reach out to me as soon as you can so we can together determine a healthy plan for you to continue your coursework when it makes sense. My priority is always to make sure you can focus on the situation at hand while ensuring you won't spend the rest of the semester working to catch up again.

Accommodations

HMC is committed to providing an inclusive learning environment and support for all students. As we return back to in-person instruction, we recognize that the challenges facing students may be different and student accommodation needs may change. Students with a disability (including mental health, chronic or temporary medical conditions) who may need accommodations in order to fully participate in this class are encouraged to contact the Office of Accessible Education at access@g.hmc.edu to request accommodations. Students from the other Claremont Colleges should contact their home college's Accessible Education officer.

For my part, I am very open to making alternate plans if accommodations should be needed for any reason. I mainly just need time to respond and agreement with you on what the plan is so that you don't end up with more on your plate at once than is reasonable. I value your privacy: you don't have to disclose more than you want to about why you are requesting an accommodation, though for some more substantive types of accommodations in this class, I will ask that you have the Accessible Education officer or a relevant dean reach out to me to confirm that a plan makes sense.

Course Conduct

Natural language processing as a field is prone to touch on topics related to culture and identity. The unique experiences you bring to the class from your own experiences not only strengthen our community but also actively contribute to the learning of everyone in the classroom. On the flip side, false cultural assumptions and negative comments about others both causes harm and makes our collective work as conscientious NLP scholars harder.

As your instructor, I am committed to creating a classroom environment that welcomes all students, regardless of race, gender, social class, religious beliefs, etc. We all have implicit biases, and I will try to continually examine my judgments, words and actions to keep my biases in check and treat everyone fairly. I expect that you will do the same with respect to me and the other members of the class, and that you will let me know if there is anything I can do to make sure everyone is encouraged to succeed in this class.

Honor Code

All students—even those from other colleges—are expected to understand and comply with Harvey Mudd College’s Honor Code. If you haven’t already done so, you must read, sign, and abide by the computer-science department’s interpretation of the Honor Code to participate in this course. Specifically:

  • You must not exchange literal copies of material, whether that material consists of code, program output, or English-language text (e.g., documentation). You also may not copy material from published or online sources, with or without cosmetic changes (such as altering variable names), without explicit permission. If you do have permission to use externally written material, you must attribute it properly and clearly indicate which material is yours and which material is not yours. Publishing your own homework or exams from this class on the web (e.g., in a public GitHub repository) violates this policy.

  • You should not do anything that a reasonable student peer would describe as “subverting the clear intent of the assignment,” unless you have asked for and received permission to do so. Finding open-sourced code that you can use to solve an assigned problem, for example, would typically be subverting the intent of the assignment because your shortcut means that you do not learn what the assignment aims to teach.

  • If you use any sources to assist you, you must document them. For example, if you use code from an existing researcher’s code repository, paper, or blog post to build your class project, you must attribute that source in a sufficiently specific way for the course staff to easily find the original source.

  • If you aren’t sure whether something you’ve done or plan to do is allowed, you should explicitly document what you did and—if at all possible—consult with the course staff, ideally before you take the questionable action. Similarly, document any extensive or particularly important help you obtain, even if that help seems legitimate. If you’ve been helped so much that we can’t consider the work truly your own, you might not be able to get full credit for it, but proper attribution will avoid an Honor Code violation (or academic integrity case at your institution).

  • Academic integrity also involves being careful enough to avoid unintentionally breaking the rules. Thus, you must read instructions in assignments and exams carefully so that you are aware of any limitations they place on you, such as time restrictions or restrictions on information sources you may consult. Similarly, if you see something that plausibly seems like it ought to be off-limits to you, such as a GitHub directory belonging to another student or files from a previous semester, you should immediately contact us to let us know that something doesn’t seem right, rather than looking further at something that perhaps should have been off-limits.

These principles apply to all methods and media of discussion or exchange (voice, writing, email, etc.).