This is an introductory class in Natural Language Processing, where you will learn some fundamentals of this broad field. You will find most of the administrative information you need on this website.
Instructor: Niranjan Balasubramanian.
Office hours (Tentative for now):
1) Tuesdays 9:00-10:00 am - On zoom here.
2) Thursdays 9:00-10:00 am - My office NCS 257.
TAs and their Office Hours:
TBD
Detailed syllabus with lecture topics and dates will be made available in the first week of class.
Topics
Here is a tentative list of topics and some example sub-topics within each. Topic list and order is subject to change.
NLP and Language Modeling - What are they and why are they seemingly inseparable?
Model Architectures - RNNs, LSTMs, and Transformers, Position Embeddings, Tokenization
Language Generation - Auto-regressive generation, sampling techniques, and non auto-regressive formulations
Large Language Models - Scaling, Instruction following, Alignment, Reasoning, and Thinking models
Language Models as Agents - Agentic architectures, Retrieval Augmented Generation, Tool Use
Small Language Models - Knowledge Distillation, Synthetic Data Generation, Mixture-of-Experts, Mamba, Energy-based Models
Understanding NLP Models - Interpretability, Explainability, Factuality, and Guard railing
Efficiency, Evaluation, and Ethics - Parameter Efficient Training, Compression, Quantization, Benchmarking, and Ethics
Interspersed among these topics we will touch upon specific applications such as Machine Translation, Question Answering, and Summarization.
Themes:
We will view the language processing problem from multiple frames such as the language frames, machine learning frames, algorithmic frames, and systems frames (somewhat infrequently), and the AI frame.
Exams (70%)
Two midterms - 25% each
Comprehensive Final Exam - 20%
Programming Assignments (15%)
3 assignments (10 days long each)
Final Project (15%)
I will likely make adjustments to the grading scheme based on the overall performance of the class. Here is a tentative grading rubric:
A: 90 and above
A-: 80 or more but less than 90
B+: 75 or more but less than 80
B: 70 or more but less than 75
B-: 65 or more but less than 70
Five point intervals for lower letter grades.
Week 1: NLP and Language Modeling
Week 2: Classification + Representations: Logistic Regression, Word Vectors, RNNs, LSTMs, CNNs, DNNs
Week 3: The architectures of NLP - Transformers, Tokenization, and Position Embeddings + Assignment 1
Week 4: Language Generation + Machine Translation
Week 5: Large Language Models - Scaling, Prompting, Instruction Following + Midterm 1 (Feb 25)
Week 6: Large Language Models - RLHF, Alignment + Assignment 2
Week 7: Language Models as Agents: Tool Use, MCP + Final Project Proposal Due
Week 8: Spring Recess
Week 9: Language Models as Agents: Retrieval Augmented Generation
Week 10: Small Language Models: Knowledge distillation, Synthetic Data Generation + Assignment 3
Week 11: Small Language Models: Application Areas + New Architectures
Week 12: Interpretability + Explainability + Midterm 2 (April 15)
Week 13: Factuality + Guard Railing
Week 14: Efficiency - Parameter Efficient Training, Compression, Quantization
Week 15: Evaluation + Ethics + Final Project Presentation
Final Exam: TBD
No specific course is made a prerequisite.
Here is a list of things that would be useful for this class. I won't be able to respond to individual requests on whether your background is suitable. Please use the following to make your own determination.
The following are critical. If you are completely unaware of the following then you will likely have difficulties following material in class.
Strongly Recommended
Basic probability and statistics (joint and conditional probabilities, Bayes rule, etc)
Basic linear algebra (vector and matrix operations)
Basic calculus (differential calculus)
Machine learning basics (classification, basic ml recipe)
Python programming
There are no required text books. The course content or structure will not follow any specific book. If I were to recommend one book it will be:
Brightspace for coursework (assignments, course content etc)
Piazza for discsussion forums
If you have a physical, psychological, medical, or learning disability, please contact the Department of Student Affairs. They will determine with you what accommodations if any, are necessary and appropriate. All information and documentation of disability is confidential.
We will make every effort to support accessibility needs for all parts of the course. Please contact me via email to make specific arrangements.
Stony Brook University expects students to respect the rights, privileges, and property of other people. Faculty are required to report to the Office of Student Conduct and Community Standards any disruptive behavior that interrupts their ability to teach, compromises the safety of the learning environment, or inhibits students' ability to learn.
You will get a total of 8 grace days to use for late assignment submissions. You can use them all for one assignment or split it over different assignments as you see fit.
Once you use up all your grace days you cant make late submissions for future assignments.
NOTE: Policy will be finalized by the end of the first week of classes.
AI Use:
Programming Assignments:
You are not allowed to use AI tools for generating, debugging, or otherwise editing code to complete the programming assignments. Code completion tools cant be used either.
Collaboration with other students:
In this class, we encourage collaboration with other students. Whenever possible we will clearly state what forms of collaboration are allowed and what aren't. Of course, it is near impossible to list all forms of unethical or dishonest behavior. You can consult the SBU website on Academic Integrity for more information.
Cheating
Grades serves some needs in classes and can be stressful but please don't cheat.
It is hardly worth the risk.
It is often very easily detected.
Part of your training is to learn how to make ethical decisions.
If you are under difficult circumstances of any kind, come talk to me about it.
When in doubt, cite the sources from which you got content/code/ideas and give credit to people who you worked with.
When in doubt, ask the instructor or the TAs before engaging in any specific forms of collaboration or use of outside material.
Here is the official statement from SBU on academic integrity, which I endorse and will follow for this class:
Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always wrong. Faculty is required to report any suspected instances of academic dishonesty to the Academic Judiciary. Faculty in the Health Sciences Center (School of Health Technology & Management, Nursing, Social Welfare, Dental Medicine) and School of Medicine are required to follow their school-specific procedures. For more comprehensive information on academic integrity, including categories of academic dishonesty please refer to the academic judiciary website at http://www.stonybrook.edu/commcms/academic_integrity/index.html