Computational Linguistics
CMSC 723 / LING 723 / INST 735
Course Staff
Jordan Boyd-Graber (Professor)
Office Hours by appointment
Philip Resnik (Professor)
Office Hours by appointment
Khahn Nguyen (TA)
Office Hours: 10:00-11:00 AM, no appointment needed (Zoom link in pinned Piazza post)
Book
Important Links
Jan 26: Welcome to CL1!
Required Lectures:
How to take a course with Jordan (Except for the pre-Corona bit about nametags, just use Zoom handles for that!)
In-class exercise [slides]
Optional Readings:
Python (and programming) introductory course
You may also want to look at first couple of weeks (through February) of INST 414 as review of Python and probability.
(Optional, if you want more detail) Grinstead and Snell, Chapters 1-2, Chapter 4.1
How the statistical revolution changes (computational) linguistics
Optional Lectures (Stuff you should already know):
Jan 28: Review
Required Lectures [slides]:
In class exercise [slides]
Required Readings (skim if you know this already):
Also of interest:
A great and very intuitive video introduction to Bayes' Rule, ~15min long
A great 20-minute video about Bayes' Rule and the odds ratio
Feb 2: Historical Background and the NLP Pipeline
Required Lectures:
A famous quote: "Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte." And another: "Il meglio è l'inimico del bene". Such is the case with this week's video lectures. Hopefully the slightly greater length will be worthwhile in terms of providing you with interesting things to think about and discuss in class.
Some things to think about for class:
What levels of analysis are being illustrated in the Alexa examples?
In terms of those levels, where is Alexa doing well, or poorly, or not even trying?
Based on what you've seen so far, how does dealing with language computationally compare/contrast with other computational problems you've looked at -- e.g. vision, planning, gene sequencing, cryptography, ...?
Required Readings:
Other readings possibly of interest:
Tenney, Das, and Pavlick (2019), BERT Rediscovers the Classical NLP Pipeline. This reading is not intended for general class discussion but may be of interest to students who have already encountered deep learning models applied to text.
Feb 11: Deep Learning
Required Lectures:
In class exercise [Slides]
Required Readings:
Pytorch tutorial chapters 1-2
Feb 16: Words, Words, Words
Required Lectures:
Required Readings:
SLP 2 through section 2.4
Also potentially of interest:
How to do tokenization using spaCy
One of many useful discussions out there illustrating dealing with encoding issues
Converting from Windows encoding to UTF-8 while ignoring errors using iconv:
iconv -f windows-1252 -t utf-8 -c
-c When this option is given, characters that cannot be converted are silently discarded, instead of leading to a conversion error.
-f encoding, --from-code=encoding
Specifies the encoding of the input.
-t encoding, --to-code=encoding
Specifies the encoding of the output.
Feb 23: Word Meaning
Required Lectures:
Word meanings, part 1 - lexicographic and ontology-based approaches
Word meanings, part 2 - inheritance, homonymy, and polysemy
Word meanings, part 3 - word sense disambiguation and non-enumerated representation
Required Readings:
SLP 18
Not required
(We were originally going to include SLP 20 for this lecture but will not)
To think about
Without looking in a dictionary, what meanings would you enumerate for the verb break ?
Feb 25: Sequential Structure
Required Lectures:
N-gram models, part 1 - defining n-gram models
N-gram models, part 2 - understanding perplexity
N-gram models, part 3 - probability estimation and smoothing
Note that slides I showed in these lectures are linked with Chapter 3 in the SLP book draft.
Required Readings:
Pinker, pp. 89-97
SLP 3
SLP 8 through 8.4
Mar 2: Sequential Structure, continued
Required Lectures:
Sequence models, part 1 - The noisy channel model
Sequence models, part 2 - HMMs
Sequence models, part 3 - Forward and Viterbi
Sequence models, part 4 - Named entities
Required Readings:
SLP Appendix A (with a focus on the forward-backward algorithm)
Pinker 97-103
Mar 4: Syntactic Structure
Required Lectures:
Constituency syntax, part 1 - Constituents
Constituency syntax, part 2 - Context-free grammars
Constituency syntax, part 3 - "Movement" and slash categories
Constituency syntax, part 4 - A brief look at CCG
Constituency syntax, part 5 - An even briefer look at supervised estimation for probabilistic CFGs
If these links to Panopto do not work (they worked for me, but someone has reported an error), the videos (.mp4) should now be downloadable by anyone from a UMD account at https://umd.box.com/s/ypl9lrv6u462ldxs9h6ocmzf5lgo38gq.
Required Readings:
SLP 12
Pinker 103-125
Mar 9: Syntactic Structure, continued
Required Lectures:
Parsing, part 1 - CKY preliminaries
Parsing, part 2 - CKY
Parsing, part 3 - From recognition to parsing
Parsing, part 4 - Probabilistic parsing using CKY
Parsing, part 5 - Beyond CKY
Parsing, part 6 - Brief introduction to dependency representations
Required Readings:
SLP 13
Main focus is the start of the chapter through Section 13.2
In Section 13.3, focus on the discussion of Equations 13.6-13.10 (and how they relate to CKY)
Section 13.4 will be covered in a later lecture on evaluation; you can just skim if you like
Read Section 13.5
Read Sections 13.6.3-13.6.4, with a focus on the idea of agenda-driven parsing (and again the relationship to CKY)
Recommended Readings:
SLP 14
Our main focus is on the idea of dependency representations, not dependency parsing algorithms
Recommended ahead of class
To get a feel for what dependency parsing looks like, try using the online demo of the spaCy dependency parser, e.g. to parse the busy professor chose the morning flight to denver.
Mar 11: Sentence Meaning
Required Lectures:
Sentence meaning, part 1 - Preliminaries
Sentence meaning, part 2 - Model-theoretic semantics
Sentence meaning, part 3 - Meaning representation languages
Sentence meaning, part 4 - Semantic analysis
Sentence meaning, part 5 - Using semantic representations
Required Readings:
Mar 16-18: Spring Break
Have fun!
Mar 23: Evaluation of NLP Systems
Required Lectures:
Evaluation, part 1 - Evaluation preliminaries and some relevant terminology
Evaluation, part 2 - Shared tasks and leaderboards
Evaluation, part 3 - Evaluation nuts and bolts
Note an error I've caught starting around 2:25 that I'll post about on Piazza.
Evaluation, part 4 - Evaluating label, set, or text outputs
Evaluation, part 5 - Evaluating structured outputs and outputs on a scale
Required Readings:
Resnik and Lin (2010): Evaluation of NLP Systems. In lark, Alexander, Chris Fox, and Shalom Lappin, eds. The handbook of computational linguistics and natural language processing. John Wiley & Sons.
Ethayarajh, Kawin, and Dan Jurafsky. "Utility Is in the Eye of the User: A Critique of NLP Leaderboard Design." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4846-4853. 2020.
Mar 25: Pre-midterm discussion
Take-home midterm to be distributed at end of class
Mar 30: Topic Modeling
Required Lectures:
In class exercise [slides]
Required Readings:
Optional Reading:
Apr 1: Topic Modeling Continued, Variational Inference
Required Lectures:
In-class exercise [pdf]
Required Readings:
Optional Readings:
Apr 6: Neural Language Models
Required Lectures:
Required Readings:
Highly Recommended Reading:
Apr 8: Non-neural ML for sequences - Structured Perceptron & PYLM
Required Lectures:
Structured Perceptron Violation Analysis
Required Readings:
Chapter 17 of CIML (Structured Prediction)
Apr 13: Named Entities / Entity Linking / Coreference
Required Lectures:
Required Readings:
Apr 15: Machine Translation
Required Lectures:
MT part 1 - the problem of translation
MT part 2 - lexical correspondences (word translations)
MT part 3 - MT concepts via MT history at 90mph
MT part 4 - neural MT
Required Readings:
Exercises:
Also of interest:
Ruder, Sebastian, Ivan Vulić, and Anders Søgaard. "A survey of cross-lingual word embedding models." Journal of Artificial Intelligence Research 65 (2019): 569-631.
Wu, S. and Dredze, M., 2020. Are All Languages Created Equal in Multilingual BERT?. Proceedings of the 5th Workshop on Representation Learning for NLP.
Lauscher, Anne, Vinit Ravishankar, Ivan Vulić, and Goran Glavaš. "From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers." EMNLP (2020).
Apr 20: Ethical AI
Required Lectures:
Ethical AI, part 1 - Intro: Situating discussions of ethical issues and bias
Ethical AI, part 2 - Discussion of Abebe et al., Roles for Computing in Social Change
Ethical AI, part 3 - Watch "Language (technology) is power: A critical survey of" bias" in NLP"
The conference video for this paper is short and nicely done, and the in-class discussion will be an opportunity for me to elaborate or provide my own take, so I think it makes more sense for you to get the info straight from the source!
Ethical AI, part 4 - Bias in word embeddings
Required Readings:
Abebe, Rediet, Solon Barocas, Jon Kleinberg, Karen Levy, Manish Raghavan, and David G. Robinson. "Roles for computing in social change." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 252-260. 2020.
Blodgett, Su Lin, Solon Barocas, Hal Daumé III, and Hanna Wallach. "Language (technology) is power: A critical survey of" bias" in nlp."
Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings."
Recommended
SLP Section 6.11 (very short overview: ~2 pages)
Andrey Kurenkov, "Lessons from the PULSE Model and Discussion", The Gradient, 2020. Useful recap of, and reflections on, an important mid-2020 public discussion about bias (and more) on Twitter involving Yann LeCun, Timnit Gebru, and many others from the AI community. (As the author notes, this piece includes an element of subjectivity; it should be viewed as a starting point for thinking, not as a list of required conclusions. I highly recommend looking at the CVPR tutorial by Timnit Gebru and Emily Denton on on fairness, accountability, transparency and ethics in computer vision.)
Bryson, Joanna J. "Robots should be slaves." Close Engagements with Artificial Companions: Key social, psychological, ethical and design issues 8 (2010): 63-74. Provocative and important philosophical argument that humanization of robots contributes to dehumanization of real people and encourages poor human decision making in the allocation of resources and responsibility. (Bryson is also co-author of an important paper by Caliskan et al. 2017 showing that word embeddings incorporate and amplify human biases.)
Lane, J., & Schur, C. (2010). Balancing access to health data and privacy: A review of the issues and approaches for the future. Health Services Research, 45(5 Pt 2), 1456–1467. https://doi.org/10.1111/j.1475-6773.2010.01141.x Very good discussion of privacy-related issues in connection with healthcare research, and data enclaves as a promising solution.
Gonen, Hila, and Yoav Goldberg. "Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them." NAACL 2019.
Apr 27: NLP and Computational Social Science
Required Lectures:
NLP and computational social science, part 1 - Concepts
NLP and computational social science, part 2 - Category-based lexicons
NLP and computational social science, part 3 - Ethical concepts in human subjects research
Required Readings:
SLP Ch 20 sections up to Section 20.3; Section 20.5.1
Apr 29: NLP Communication and Reviewing
Required Lectures:
Required Readings:
Optional Reading:
May 4: The Computational Linguistics Community Landscape
Required Lectures:
Suggested Readings:
List of NLP Venues https://medium.com/@robert.munro/the-top-10-nlp-conferences-f91eed97e950
Declaration of independence from MLJ https://jmlr.csail.mit.edu/statement.html
How JMLR works https://blogs.harvard.edu/pamphlet/2012/03/06/an-efficient-journal/
Philip’s Discussion with Jason Baldridge on NLP Careers on AI Highlights https://nlphighlights.allennlp.org/090_research_in_academia_versus_industry_with_philip_resnik_and_jason_baldridge
ImageNet https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world
Quality vs. Quantity for Annotation https://dl.acm.org/doi/abs/10.5555/1868720.1868728
Data Sheets https://arxiv.org/abs/1803.09010
Model Cards https://arxiv.org/pdf/1810.03993.pdf
May 6: Linguistics as a Scientific Endeavor
This will be a live lecture for half the class time, followed by discussion. Readings below are optional and not required.
Recommended to view ahead of time (<20min):
Recommended Readings
Marr, D.; Poggio, T. (1976). "From Understanding Computation to Understanding Neural Circuitry". Artificial Intelligence Laboratory. A.I. Memo. Massachusetts Institute of Technology. hdl:1721.1/5782. AIM-357. (Also widely cited for levels of analysis/explanation: Marr, D. (1982), Vision: A Computational Approach, San Francisco, Freeman & Co., Introduction and Chapter 1)
Simon, H. (1957), The Sciences of the Artificial, esp. Ch 1
Pinker Ch 4, How Language Works
May 11: Project Presentations
Post your presentation to Piazza (we'll provide a thread) by midnight May 9, and we'll use the time in class to ask questions.