LIN353C Introduction to Computational Linguistics

LIN353C Introduction to Computational Linguistics: Syllabus

Course: LIN 353C Introduction to Computational Linguistics, unique number 40215

Semester: Fall 2022

Course Canvas page: https://utexas.instructure.com/courses/1338579

Place and time: Wednesdays and Fridays 3-4:30pm, PAR 206. Directions to Parlin Hall: click here.

Instructor: Katrin Erk. office RLP 4.734, email: katrin.erk@utexas.edu
Office hours: Monday 10-12 in person, RLP 4.734.

Thursday 2-3 on zoom, see Canvas for the link.

Teaching Assistant: Will Sheffield, office: W4 area of the open office space for grad students
in the linguistics department, RLP 4th floor
email: sheffieldw@utexas.edu
Office hours: Wed 1-2pm, Thu 12-2pm over zoom -- link will be on Canvas

Prerequisites: Upper-division standing.

Textbook and readings: Jurafsky, D. and J. H. Martin, Speech and language processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Prentice-Hall, 2008.
The book is also available through the Longhorn Textbook Access program -- more information below.

Additional required readings will be made available for download from the course website.

Flags: Quantitative Reasoning

Course overview and objectives

Text is everywhere, in huge amounts: Books, emails, web pages, scientific papers, and so on. To be able to use the information laid down in all this text, we need technology that can help us make sense of all the information, for example: Automatically translating texts from one language to another; building better search engines that can deal with complex questions instead of just keywords; figuring out automatically whether the blogs are saying good or bad things about a particular product; extracting useful facts from repositories of scientific papers about medicine.

Computational linguistics uses mathematical and computational methods to describe how language works, and it develops methods for automatic language understanding and for language technology applications. Computational linguistics is an interdisciplinary field between linguistics to computer science.

This course gives an introduction to central problems and methods in computational linguisticsin theory and practice. The course includes hands-on exercises with language processing techniques. The course also includes a short introduction to programming using the Python programming language.

By the end of this course, you will:

understand, and be able to use, core algorithms and data structures used in Natural Language Processing (NLP) to automatically analyze text (we will work on English, but pointers to tools and methods for other languages will be given)
have a theoretical understanding of some of the main tecbniques for machine learning that are being used in NLP. (Machine learning systems learn from examples, for example it may learn, from examples of texts and their translations, how to automatically translate text.)
be able to write non-trivial programs for (NLP) using the Python programming language
know firsthand about some of the possibilities and difficulties of automatic meaning analysis of text
know how to implement different methods for labeling words with their part of speech (noun, adjective, verb, article, ...)
know how to describe the structure of English sentences through formal grammar, and how to put those grammars to work in practice for automatic analyses
know how to build language models, systems that know about typical word sequences in a language, which can be used for example to predict next words you want to type in text messages, or for error correction

Quantitative Reasoning

This course carries the Quantitative Reasoning flag. Quantitative Reasoning courses are designed to equip you with skills that are necessary for understanding the types of quantitative arguments you will regularly encounter in your adult and professional life. You should therefore expect a substantial portion of your grade to come from your use of quantitative skills to analyze real-world problems.

Course requirements and grading

Assignments: 8 assignments, each 9% of the grade, for 72 % overall.
Assignments will be made available on Canvas. Tentative assignment due dates are marked in the schedule. The homework assignments will be a mixture of programming assignments (appropriate to beginners), questions that involve using NLP algorithms and data structures, both by hand and through programming, and small-scale NLP applications.

Course Project: 23 % of grade.
You will turn in an intermediate report, for 8% of your grade, and the final report, for 15% of the grade. Tentative due dates are listed in the schedule. See below, under "Course Project", for more information on the course project.

Attendance: 5 % of grade

This course does not have a final exam or midterm exam.

The course will use plus-minus grading, using the following scale:

A: >= 93%.

A-: >= 90%

B+: >= 87%.

B: >= 83%.

B-: >= 80%

C+: >= 77%.

C: >= 73%.

C-: >= 70%

D+: >= 67%.

D: >= 63%.

Textbook, and the Longhorn Textbook Access (LTA) program

The textbook for the class is: Jurafsky, D. and J. H. Martin, Speech and language processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Prentice-Hall, 2008.

The materials for this class are available through the Longhorn Textbook Access (LTA) program, a new initiative between UT Austin, The University Co-op and textbook publishers to significantly reduce the cost of digital course materials for students. You are automatically opted into the program but can easily opt-out (and back in) via Canvas through the 12th class day. If you remain opted-in at the end of the 5th class day you will receive a bill through your “What I Owe” page and have until the end of the 20th class day to pay and retain access. If you do not pay by the 20th class day, you will lose access to the materials and your charge will be removed.

To summarize, students have the ability to opt-out (or back in) from class days 1 through 12. All students opted in on the 5th class day will be billed through their “What I Owe” account but can still opt-out through the 12th class day (any charges after the 5th class day would be removed). Students have through the 20th class day to pay their What I Owe account.

More information about the LTA program is available at https://www.universitycoop.com/longhorn-textbook-access

Schedule

Assignments are due right before class (3pm) on their due date unless noted otherwise. Assignment due dates are marked in red in the schedule.

Readings and course materials will be linked from the schedule below, from the date in question!

Unless otherwise noted, all readings can be done after class time.

This schedule is subject to change.

Week 1:

August 24: Introduction, and discussing the syllabus

Introductory slides are on Canvas

August 26: Introduction to programming: first steps

We'll be using Jupyter Notebooks in class. Please download notebooks to your computer. Then either open them in Anaconda with notebooks, or open a terminal, go to the directory with the notebooks, and run the command jupyter notebook
Here is a Notebook on how to use Jupyter Notebooks.
Use this notebook to check that your installation works.
Jupyter notebook: First steps in Python.

Week 2:

August 31: Text normalization and regular expressions

Reading:
- Jurafsky and Martin (2nd. edition) ch.2 sec. 2.1, which is pretty much the same as Jurafsky and Marty 3rd edition sec. 2.1.
- In the 3rd edition chapter, also look at sec. 2.2 (what are words), sec. 2.3 (corpora), and sec. 2.4.4 (stemming).
Jupyter notebook: Python regular expressions
We will use regular expressions on the opus online corpus collection.
Here is a paper that uses text patterns to automatically learn, say, that a broken bone IS-A injury

Sep 2: Introduction to programming: Conditions and lists

Jupyter notebook: Python conditions, lists and loops

Week 3

Sep 7: Language models: Learning about typical word co-occurrences. We start with probabilistic n-gram language models

Due: Assignment 1

Sep 9: Introduction to programming: loops

Week 4:

Sep 14: Language models, continued

- Readings: Jurafsky and Martin (2nd edition) chapter 4: n-grams, sections 4.1-4.3. You can find a shorter version, without the discussion of counting words, in Jurafsky and Martin 3rd edition, chapter 3 section 3.1
- Jupyter notebook: an n-gram language model demo

Sep 16: Intro to programming: Dictionaries

- Jupyter notebook: Python dictionaries
- Jupyter notebook: Python list comprehensions

Week 5:

Sep 21: Yet more language models. Then: Classification: Nearest neighbors

Continuing on the same Jupyter notebook: an n-gram language model demo
Due: Assignment 2

Sep 23 Introduction to programming: making your own functions

- Jupyter notebook: Python functions
- Jupyter notebook: sorting in Python
- We also discussed this notebook on Pandas, a Python package for spreadsheet handling and graphing. The matching dataset is available on Canvas under"files".

Week 6:

Sep 28: In-class discussion of project ideas. Bring your ideas!
Classification: Nearest neighbors, continued

Sep 30: Classification: Naive Bayes, Sentiment Analysis

Readings: Jurafsky and Martin 3rd edition chapter 4, sections 4.1 through 4.3, and 4.7, 4.8.

Week 7:

Oct 5: Naive Bayes, continued.

Jupyter notebook: Making a Naive Bayes classifier
Due: Assignment 3

Oct 7: Classification: logistic regression

- Readings: Jurafsky and Martin 3rd edition chapter 5, through 5.4. This is rather math-heavy, feel free to skim
- Featurization: the case of word sense disambiguation

Week 8:

Oct 12: Word embeddings: Characterizations of word usage

Readings: Jurafsky and Martin 3rd edition chapter 6, through 6.5
Due: Assignment 4
Slides
Jupyter notebook: text encodings, reading from a URL, removing HTML formatting, and using XML-annotated data; with sample XML file

Oct 14:c Neural word embeddings, and neural language models.

- Readings: Jurafsky and Martin 3rd edition chapter 6, sec. 6.8-6.12,
- Jurafsky and Martin 3rd edition, chapter 7. This is again very math-heavy .Read section 7.1, and skim 7.3 and 7.4 through 7.4.2

Week 9

Oct 19: Discussing word embeddings and neural language models

Due: Assignment 5

Oct 21: Word2vec demo. Then: Part-of-speech tagging.

Word2Vec:
- Notebook on using and creating prediction-based spaces
- Here is a word2vec demo that can list nearest neighbors for words
- Here is a word2vec demo that can visualize embeddings in 3-D space. To use it, enter a word on the line on the right that says "Search", and hit the "Isolate selection" button.
- Here is a demo of word occurrence embeddings from a contextualized language model, BERT
Part-of-speech tagging:
- Readings: Jurafsky and Martin 2nd edition chapter 5 through 5.4
- We will use the online NLTK book, chapter 5.
- Here is the Penn Treebank tagset.
- And here is the Brown corpus tagset.
- And here is the Jabberwocky poem.

Week 10:

Oct 26: Part-of-speech tagging continued

- Readings: Jurafsky and Martin 2nd edition chapter 5 section 5.5
- Due: Intermediate project report

Oct 28: part-of-speech tagging, continued

Week 11

Nov 2: Describing syntactic structure with phrase-structure grammar

- Readings: Jurafsky and Martin 2nd edition chapter 12 thorugh 12.3.5, 12.4.1, and 12.7.1
- We will also use chapter 7 of the NLTK book, in particular section 7.2
- Due: Assignment 6

Nov 4: Parsing

Readings: Jurafsky and Martin 2nd edition chapter 13 through 13.3, plus 13.4.1

Week 12

Nov 9: Parsing, continued.

- Readings: Jurafsky and Martin 3rd edition chapter 13, through 13.2

Nov 11: Statistical parsing. Then: Describing sentence meaning: logic-based representations and semantic roles

- Due: Assignment 7
- Semantic role resources:
  - The Unified Verb Index shows PropBank, FrameNet, and VerbNet
  - FrameNet
- Large-scale representation of natural language meaning with logic: The Groningen MeaningBank, which has a tool for inspecting sentence representations
- Abstract Meaning Representations
- Readings:Jurafsky and Martin 3rd edition, chapter 15.1; Jurafsky and Martin 3rd edition, chapter 19 through 19.5

Week 13:

Nov 16: Semantic role labeling with neural sequence models

- Readings: Jurafsky and Martin 3rd edition chapter 9 through 9.4 feel free to skim the math.

Nov 23: Discussing sentence structure, and the Chomsky hierarchy

- Due: Assignment 8

Nov 21-25 Thanksgiving break

Week 14:

Nov 30: Project presentations

3:00 Finn Haddon
3:08 Gabriela Garcia, Erika Gonzalez
3:16 Christian Coplin, Ethan Martin
3:24 Dylan Moses
3:32 Veronica King, Ella Thompson
3:40 Eloragh Espie
3:48 Sangdon Lim
3:56 Kaustub Navalady

Dec 2: Project presentations

3:00 Isabel Erwin, Annmarie Chang, Sridevi Hariharan
3:08 Keziah Reina, Arianna Rivera
3:16 Veronica Alejandro, Ethan Glass
3:24 Katie McGhee
3:32 Elizabeth Pena
3:40 Gabriela O'Connor
3:48 Nidhi Dubagunta, Malvika Vaidya

Final paper due date: Friday, December 9, 9am

Attendance

It is crucial for your success in this class that you attend the lectures, do the in-class exercises and participate in in-class discussions. The TA will keep an attendance sheet. Please remember to enter your name into the attendance sheet each time you come to class. You can have three missed class sessions without penalty. For each missed class sessions beyond three, your attendance grade will decrease by 4 out of 100 points. Exceptions to this rule (due to medical emergencies, etc.) are at the discretion of your teacher. An important rule of thumb for an extension-related conversation is be communicative, be proactive, and let us know ahead of time.

Extension policy

If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.

If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.

Note that there are always some points to be had, even if you turn in your assignment late. So if you would like to know if you should still turn in the assignment even though it is late, the answer is yes. The last day in the semester on which the class meets (Dec 3, 2020) is the last day to turn in late assignments for grading.

Academic honesty

Students who violate University rules on scholastic dishonesty are subject to disciplinary penalties, including the possibility of failure in the course and/or dismissal from the University. Since such dishonesty harms the individual, all students and the integrity of the University, policies on scholastic dishonesty will be strictly enforced. For further information, please visit the Office of Student Conduct and Academic Integrity website at http://deanofstudents.utexas.edu/conduct/.

Notice about students with disabilities

The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.

Notice about missed work due to religious holy days

A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.

Emergency Evacuation Policy

Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.

Behavior Concerns Advice Line (BCAL)

If you are worried about someone who is acting differently, you may use the Behavior Concerns Advice Line to discuss by phone your concerns about another individual's behavior. This service is provided through a partnership among the Office of the Dean of Students, the Counseling and Mental Health Center (CMHC), the Employee Assistance Program (EAP), and The University of Texas Police Department (UTPD). Call 512-232-5050 or visit http://www.utexas.edu/safety/bcal

Senate Bill 212 and Title IX Reporting Requirements

Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.

Use of E-mail for Official Correspondence to Students

All students should be familiar with the University’s official e-mail student notification policy. It is the student’s responsibility to keep the University informed as to changes in his or her e-mail address. Students are expected to check e-mail on a frequent and regular basis in order to stay current with University-related communications, recognizing that certain communications may be time-critical. The complete text of this policy and instructions for updating your e-mail address are available at http://www.utexas.edu/its/policies/emailnotify.html .

Lecture capture

This class is using the Lectures Online recording system. This system records the audio and video material presented in class for you to review after class. Links for the recordings will appear in the Lectures Online tab on the Canvas page for this class. You will find this tab along the left side navigation in Canvas.

To review a recording, simply click on the Lectures Online navigation tab and follow the instructions presented to you on the page. You can learn more about how to use the Lectures Online system at http://sites.la.utexas.edu/lecturesonline/students/how-to-access-recordings/.

You can find additional information about Lectures Online at: https://sites.la.utexas.edu/lecturesonline/.

Course project

LIN 353C course projects will ] be done in groups of 2 students. If you want to work in a larger or smaller group, you need prior approval of the instructor.

Section "Project topic suggestions" gives a list of possible project topics. I would suggest that you choose something from this list, but you can also choose something different.

All projects need to address an NLP problem and involve programming. All projects need to be evaluated in some form.

As you need some initial results to discuss in your intermediate progress report, I suggest that you start out with some simple rule-based approach, then improve on it, maybe with some change in technique, in the second half of the semester.

The process

Around week 6 we discuss project ideas. Be prepared to come with ideas for what you might want to do. The aim is to make sure all projects are right-sized and doable.
Around week 9: Intermediate progress reports due. We also discuss projects in class in order to clear any roadblocks and to share data and methods that may be helpful to other projects.
Around week 14: Final project reports due.
Last week of classes: In-class presentations about projects. These are not graded.

Project topic suggestions

Create a system that can guess what language a given text is written in. To start out, you could use hand-written cues, for example typical words or character sequences. For more of a challenge, use training data to learn about frequent words in each language, or frequent character sequences.
Create a dialog agent that pretends to be a specific kind of person (for example, Parry was a 20-something single guy with specific hobbies and some paranoid tendencies). This can be done with rules that hard-code behaviors, or react to particular key words. You can also add probabilities if you like, such that your agent would, for example, have a particular probability of reacting with anger.
Create a system that will extract meeting specifics (time, place) from emails. I can make available part of the ENRON emails (which were made publicly available by the Federal Energy Regulation Commission) for you to work with. Some rules encoding frequent patterns will let you extract some times and places of meetings, but time and date expressions can be surprisingly complicated. If you want a challenge, take on relative time expressions like "next Monday".
Create a morphological analyzer for a language other than English, ideally a morphologically complex one.
Create a system that can identify different types of "named entities", such as person names, locations, and organizations. To start out, you can use rules to identify named entities, for example a sequence of capitalized words of which the first is "Ms." is likely a person names. You can also train a machine learning system to do this task.
Create a grammar checker. To start out, you can use some rules, for example you could flag sentences that end in a preposition. Then you can experiment with a language model to flag unlikely word sequences.

Previous projects have also included:

Automatically determining the genre of a tweet, for example news, sports, entertainment
Automatically summarizing a document by automatically selecting the most important sentences
Automatically producing text in the style of some famous person, using language models to learn to mimic their style
Automatically producing a syntactic analysis of a sentence, or automatically assigning part-of-speech tags
Detecting the level of reading difficulty of a text
Authorship attribution: automatically detecting who, out of a number of possible authors, wrote a given text

Working in groups

In the Intermediate report, include a short paragraph that briefly describes which part of the project each group member is responsible for.
In the Final report: The report is joint from the group, but needs to include a separate section from each group member that describes the part of the project done by that group member

Requirements

You will write two documents about your course project, the intermediate report and the final report. You will also have a chance to discuss your project in the last week of classes. But this discussion will not be graded.

Intermediate report

At the time of the intermediate report, you need to have some system that addresses your problem. This can be a very simple, rule-based system, It need not be the final system.

2-3 pages
- contents:
  - Introduction with motivation: What the project is about, and why is this important?
  - What algorithms, rules, and data structures you are using
  - What corpus resources (if any) you are using
  - Initial results. At the least, this is discussion of some things your system is currently doing right or wrong. You can report some performance by some performance measure, but you do not have to.
  - If you are working in a group: who does what (as described above)

Final report

The final report is about your final system. This should improve over the system as it was at the time of the intermediate report, either by using a different technique, or by using the same technique in a more sophisticated way.

4-5 pages
- This is a revised version of your intermediate report. Do take into account all feedback that you got on your intermediate report. Do not omit introduction and motivation just because they were already in the intermediate report: The final report has to be self-contained.
- Write this as a research report to an audience of computational linguists.
- contents:
  - Introduction with motivation: What the project is about, and why is this important?
  - What algorithms, rules, and data structures you are using
  - What corpus resources (if any) you are using
  - Results: Describe as clearly as possible what it is your system can (and cannot) do. You can show examples of things your system is getting correct and of errors it is making. If you can, measure performance by some performance measure.
  - If you are working in a group: separate section describing who did what (as described above)

Resources

Are you building a supervised classification system? Then check the NLTK chapter on classification, chapter 6.

Also, you may want to use scikit-learn, a Python machine learning package.

Whatever kind of system you build, you will need to do an error analysis. Counter to its somewhat negative-sounding name, an error analysis is not just a sad list of errors, but an in-depth look at how your system deals with the language data it sees: both what it does right and where it does something wrong. For a discussion of the general spirit of error analysis, check Emily Bender's blog post on "putting the linguistics in computational linguistics"

Useful links

List of software we will use in the class

Python and Python packages:

- We strongly recommend installing Anaconda, as that includes Python along with all Python packages we need.
  - If you install anaconda, you will have to add gensim. Here is a tutorial on how to add a package to Anaconda: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#installing-a-package
  - Please do this, and choose to add gensim.
- Alternatively, you can individually install:
  - Python: Any version >= 3.4 should be fine.
- The Natural Language Toolkit.
  - Please follow the instructions here to download NLTK data. It is enough if you download the "popular" datasets, you don't need all.
- Jupyter notebooks
- scipy
- gensim

To test your Python installation, use this Jupyter notebook.

Using Jupyter notebooks

Jupyter notebooks that we use in class are listed on the class schedule page. Click on a Jupyter notebook link there to download the file. To access the notebook:

- put the file in some directory
- In anaconda, click "jupyter", then navigate to that directory with the notebook file, and select the file
- Or, if you're not using anaconda, open a terminal, go to the directory where you put the notebook file, and type
- jupyter notebook
- This will open a tab in your browser where you can select a notebook file to work with.

For info on how to format text and write code in Jupyter notebooks, see this Jupyter notebook.

Learning Python

Learning Python:

- How To Think Like A Computer Scientist is a very good and accessible online Python textbook
- A Python tutorial

General Python pages:

- The Python library documentation

The Natural Language Toolkit:

- The Natural Language Toolkit.
- NLTK documentation.

Page updated

Report abuse