LIN350 Analyzing Linguistic Data: Syllabus
Course: LIN 350 Analyzing Linguistic Data, unique number 39885
Semester: Spring 2023
Course Canvas page: https://utexas.instructure.com/courses/1359438
Place and time: Tuesday/Thursday 11-12:30, CBA 4.342. Directions to CBA: click here.
Instructor: Katrin Erk. office RLP 4.734, email: katrin.erk@utexas.edu
Office hours: Monday 1-2 on zoom (see Canvas for link), Tuesday 1:30-3:30 in person ,RLP 4.734
Teaching Assistant: Urvi Paresh Shah. Contact information on Canvas.
Office hours: 3 PM - 4.30 PM on Tuesdays and 1.30 PM to 3 PM on Wednesdays.
Prerequisites: Upper-division standing.
Textbook and readings:
P. R. Hinton (2004): Statistics Explained: A Guide for Social Science Students. Psychology Press; 3rd edition, 2014
Additional required readings will be made available for download from the course website.
Flags: Quantitative Reasoning, Independent Inquiry
Course overview and objectives
Today, huge amounts of text are available in electronic form. We can poke these electronic text collections to answer questions about language, and questions about the people who use it. For example, we can test whether passive constructions are increasingly falling out of favor in English, and we can trace how words change their meaning over time. We can also study a politician's word choices in political debates to find out more about their personality, or we can see how inaugural addresses have changed over time.
This course provides a hands-on introduction to working with text data. This includes an introduction to programming in Python, with a focus on text processing and data exploration, with a "cookbook" of programming examples that will enable you very quickly to analyze texts on your own. Most of the conclusions that we want to draw from text are "risky conclusions", they are trends rather than yes-or-no answers, so the course also includes an introduction to statistical techniques for data exploration and for making and assessing "risky conclusions". The course also includes a course project where you can test your text analysis skills on a question of your own choice.
By the end of this course, you will:
know how to use simple word counts to answer many questions about people and about language, and know how to choose the right words for counting
know how to write programs in the Python programming language to access and analyze texts
know how to visualize and graph descriptive statistics about texts
know what hypothesis tests in statistics are, know some types of hypothesis tests, and know how to implement them in practice using Python packages
know what basic regression models in statistics are, know what they are used for, and know how to implement them in practice using Python packages
be familiar with a toolkit of linguistic text preprocessing tools, and know how to use it to normalize and filter words in a text
know what hypothesis testing is, and how to use it to distinguish actual findings from random variations in the data
know how clustering and topic modeling can be used to gain a quick overview of topics and themes that appear in written texts, and know how to apply these techniques in practice using Python packages
Quantitative Reasoning
This course carries the Quantitative Reasoning flag. Quantitative Reasoning courses are designed to equip you with skills that are necessary for understanding the types of quantitative arguments you will regularly encounter in your adult and professional life. You should therefore expect a substantial portion of your grade to come from your use of quantitative skills to analyze real-world problems.
Independent Inquiry
This course carries the Independent Inquiry flag. Independent Inquiry courses are designed to engage you in the process of inquiry over the course of a semester, providing you with the opportunity for independent investigation of a question, problem, or project related to your major. You should therefore expect a substantial portion of your grade to come from the independent investigation and presentation of your own work.
For more information on the project you will do in this course, see below under "Course project".
Course requirements and grading
Assignments: 48% (4 assigments, 12% each)
Assignments will be made available on Canvas. Tentative assignment due dates are marked in the schedule. The homework assignments will mostly be programming assignments appropriate to beginners, with a focus on text processing and statistics, and more theoretical exercises about statistics. Homeworks are designed to provide the foundation needed for course projects.
"Food for thought" : 12% (4 mini-assigments, 3% each)
"Food for thought" assignments are smaller assignments that require you to think about larger questions such as the ethics of text processing, or that let you try out demos of text processing tools.
Course project: 35%
You will turn in an initial project report and an intermediate report, each for 5% of your grade, and a final report, for 20% of the grade. You will also do a project presentation for 5% of your grade. Tentative due dates are listed in the schedule. See below, under "Course Project", for more information on the course project.
Course projects are typically done by teams of 2 students. Projects done by 1 or 3 students are only possible with prior approval of the instructor.
Project presentations will be in the final week of classes, in the order given on the schedule page (which will be generated via Python's random.shuffle()). If possible, all members of a project team should get some time to speak.
This course does not have a midterm or final exam.
Attendance: 5 % of grade
The course will use plus-minus grading, using the following scale:
A: >= 93%.
A-: >= 90%
B+: >= 87%.
B: >= 83%.
B-: >= 80%
C+: >= 77%.
C: >= 73%.
C-: >= 70%
D+: >= 67%.
D: >= 63%.
Textbook
P. R. Hinton (2004): Statistics Explained: A Guide for Social Science Students. Psychology Press; 3rd edition, 2014
Additional readings will be made available for download from the course website.
Schedule
Assignments are due right before class (11am) on their due date unless noted otherwise. Assignment due dates are marked in red in the schedule.
Readings and course materials will be linked from the schedule below, from the date in question.
Unless otherwise noted, all readings can be done after class time.
This schedule is subject to change.
Week 1: Introduction
January 10: Introduction. Some simple examples of text analysis.
Counting words to find out about people:
Google trends: word counting to get a sense of what people are interested in
Word watchers: word counting in politicians' speeches, and what that tells us about them
Political text analysis: variety of politicians' vocabulary
Counting words to find out about linguistic questions:
January 12: Foundations of programming
We'll be using Jupyter Notebooks in class. To use a "Code for download" file, download it to your computer. Your computer will probably complain that it doesn't know how to open the file. This is not a problem, ignore it. Then either open it in Anaconda with notebooks, or open a terminal, go to the directory with the notebook, and type the command jupyter notebook
Code for download: First steps in Python
Code for download: a notebook to check your installation
Code for download: how to use Jupyter notebooks
Week 2: Exploring and visualizing data
Jan 17: Exploring and visualizing data: the Inaugural Address collection
We finish up first steps in Python (code see above)
Code for download: Exploring and visualizing data
Data set to go with it
Jan 19: Exploring and visualizing data, continued
Week 3: Python basics
Jan 24: Python programming basics: conditions, lists, and loops
Code for download: conditions, list, and loops
Jan 26: Python programming basics: conditions, lists, and loops, continued
Food for Thought 1 due
Week 4: Python basics
Jan 31: no school, ice day
Feb 2: no school, ice day.
Homework 1 due: now Sunday end of day
Week 5: Python basics
Feb 7: We discuss your project ideas in class
Feb 9: Finishing up loops, then:
Python basics: dictionaries for word counting
Code for download: accessing files from Python (we'll only need the first part of the notebook for now)Python dictionaries, continued
Week 6: Python spreadsheets
Feb 14: Python dictionaries, continued
Feb 16: Making your own Python spreadsheets from word counts
Code for download: Making Pandas data frames
Code for download: Merging Pandas data frames
Initial project description due
Week 7: Text processing
Feb 21: Tools for text processing: Splitting text into sentences and words, mapping words to their base form, filtering away stopwords, labeling words with their part of speech
Feb 23: Accessing texts in different writing systems, accessing texts from the web
Code for download: Accessing text data, and different writing systems
Code for download: Accessing multiple files in a directory
Homework 2 due
Week 8: Identifying themes in text: tf/idf and clustering
Feb 28: Identifying important words in a text: tf/idf and pointwise mutual information for computing word importance weights
Mar 2: Clustering to identify main themes in a text
Food for Thought 2 due
Week 9: Topic modeling
Mar 7: Clustering continued, and topic modeling
Mar 9: Topic modeling, continued
Week 10: Spring break
Week 11: Probabilities and hypothesis testing
Mar 21: Descriptive statistics, probabilities and hypothesis testing
Mar 23: Hypothesis testing, and starting on the t-test
Week 12: Hypothesis testing, and more programming
Mar 28: Finishing up the t-test. Then:
Python list comprehensions, and how to use them with Pandas.
Code for download, same as above: the t-test, the chi-squared test
Food for Thought 3 due
Mar 30: defining your own Python functions, and structuring your programs
Project progress report due
Week 13: Correlation and regression
April 4: Correlation
April 6: Linear regression
Homework 3 due
Week 14: Regression
April 11: Logistic regression
Food for Thought 4 due
April 13: Practicing regression
Week 15:
April 18: Project presentations:
11:00 Jillian Plant andBlake Griffin
11:15: Evely Ludington, William Hartman, and Keziah Reina
11:30 Olivia Tucker
11:45 Gayoung Jeon
12:00 Isabel Erwin and Gabby Garcia
Homework 4 due
April 20: Project presentations:
11:00 Anna Alvis
11:15 Tran Nguyen and Erika Gonzalez
11:30 Doan Nguyen and Harini Shanmugam
11:45 Elizabeth Pena
Final paper due date: Thursday April 27, end of day
Attendance
It is crucial for your success in this class that you attend the lectures, do the in-class exercises and participate in in-class discussions. The TA will keep an attendance sheet. Please remember to enter your name into the attendance sheet each time you come to class. You can have three missed class sessions without penalty. For each missed class sessions beyond three, your attendance grade will decrease by 4 out of 100 points. Exceptions to this rule (due to medical emergencies, etc.) are at the discretion of your teacher. An important rule of thumb for an extension-related conversation is be communicative, be proactive, and let us know ahead of time.
Extension policy
If you turn in your assignment late and we have not agreed on an extension beforehand, expect points to be deducted. Extensions will be considered on a case-by-case basis. I urge you to let me know if you are in need of an extension, such that we can make sure that you get the time necessary to complete the assignments.
If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late.
Note that there are always some points to be had, even if you turn in your assignment late. So if you would like to know if you should still turn in the assignment even though it is late, the answer is yes. The last class day in the semester (April 24, 2023) is the last day to turn in late assignments for grading.
Academic honesty
Students who violate University rules on scholastic dishonesty are subject to disciplinary penalties, including the possibility of failure in the course and/or dismissal from the University. Since such dishonesty harms the individual, all students and the integrity of the University, policies on scholastic dishonesty will be strictly enforced. For further information, please visit the Office of Student Conduct and Academic Integrity website at http://deanofstudents.utexas.edu/conduct/.
Notice about students with disabilities
The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 471-6259.
Notice about missed work due to religious holy days
A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.
Emergency Evacuation Policy
Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.
Behavior Concerns Advice Line (BCAL)
If you are worried about someone who is acting differently, you may use the Behavior Concerns Advice Line to discuss by phone your concerns about another individual's behavior. This service is provided through a partnership among the Office of the Dean of Students, the Counseling and Mental Health Center (CMHC), the Employee Assistance Program (EAP), and The University of Texas Police Department (UTPD). Call 512-232-5050 or visit http://www.utexas.edu/safety/bcal
Senate Bill 212 and Title IX Reporting Requirements
Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.
Use of E-mail for Official Correspondence to Students
All students should be familiar with the University’s official e-mail student notification policy. It is the student’s responsibility to keep the University informed as to changes in his or her e-mail address. Students are expected to check e-mail on a frequent and regular basis in order to stay current with University-related communications, recognizing that certain communications may be time-critical. The complete text of this policy and instructions for updating your e-mail address are available at http://www.utexas.edu/its/policies/emailnotify.html .
Course project
Course project requirements:
Initial project description:
This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It is enough for each team to submit one single project description. Just put all team members' names.
The document needs to contain the following information:
Research questions: What are the main questions that you want to answer, the main language phenomena you want to address, or the main ideas you want to explore?
Method: What are the relevant words, multi-word expressions, or constructions you need to analyze? What descriptive data analyses do you plan to do? Do you plan to do statistical significance tests, and do you know already which ones will be the right ones? (Yes, I know you will not have worked out every detail at this point, but strive to work out as many as you can.)
Data: It is vital that you figure out as early as possible what data you can use. Is there enough data? Is it freely available? Do you have to contact someone to get it?
Splitting the work: Who in the team will be doing what?
Intermediate report:
Submit the following document, which should be the same for the whole team: a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. This is a revised version of your initial project description, which needs to take into account the feedback you got on the initial description.Research questions: any changes?
Method: any changes?
Status:
Describe the data that was obtained: source, size, anything else that is relevant
Describe at least two (smaller, and preliminary) concrete results that you have at this point
In addition, each team member submits a short (half page) document describing their individual contribution and reflecting on what they learned in the project so far.
Short presentation:
This is a short presentation to the class. You should discuss:
Research questions/linguistic phenomena/main questions you are addressing
Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
Data: source, size (say how many words overall you have)
Results
You will need to prepare slides for this, which you submit to the instructor ahead of time.
It is okay if you don't have all results in place at this point. This does not lead to points being taken away for the presentation.
Final report:
Submit the following document, which should be the same for the whole team:
A 5-6 page document (single-spaced, single column) that describes the results of your project. This is a revised version of your intermediate project description. It needs to contain the following information:
Research questions/linguistic phenomena covered/main ideas pursued
Data: source, size, other relevant statistics
Method
Findings
If you build on previous work, you need to discuss it, and give references.
Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.
You need to take into account the feedback that you got on the Initial project description and Intermediate report.
In addition, each team member submits a short (half page) document describing their individual contribution and reflecting on what they learned in the project.
Course project ideas
Ideally, you pick a topic of your own that you are curious about. But to give you an idea of possible topics, here are a few pointers:
How do people with different political affiliations talk about the same topic, do they use different words? To study this, you can use word association weights, clustering and topic modeling to identify themes in documents
Themes in song lyrics for different genres: To study this, you can use word association weights, clustering and topic modeling to identify themes in documents
Author analysis: analyzing poems to detect who may have written them, and what characteristics they have
Language and ratings: What kinds of words are being used to describe, for example, cheap versus expensive wines?
Some ideas from Language Log's breakfast experiments:
Which words are used to describe white and black NFL prospects? Links here, here (data for download in the 2nd link)
State of the Union: what are signature words of Obama, of earlier presidents? (And why?)
The statistics of real estate listings: linking real estate price to the language in the descriptions
Contrasting "almost" and "nearly": discussed here, here, here, and here
Noah Smith has a few nice datasets to analyze:
Movie corpus: predicting movie revenue from review texts
Congressional bill corpus: predicting whether a bill will survive from the text in the bill
Corporate reports corpus: predicting how well a company will do from the annual reports that it issues
Please discuss your topic with the instructor to make sure that it is both substantial and feasible.
For your course project, you will need to apply statistical analyses yourself. Google books n-gram charts, while pretty, do not count.
Useful links
List of software we will use in the class
Python and Python packages:
We strongly recommend installing Anaconda, as that includes Python along with all Python packages we need.
If you install anaconda, you will have to add gensim. Here is a tutorial on how to add a package to Anaconda: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#installing-a-package
Alternatively, you can individually install:
Python: Any version >= 3.4 should be fine.
pandas: https://pandas.pydata.org/
numpy: https://numpy.org/
matplotlib: https://matplotlib.org/
Statsmodels: https://www.statsmodels.org/stable/index.html
The Natural Language Toolkit.
Please follow the instructions here to download NLTK data. It is enough if you download the "popular" datasets, you don't need all.
To test your Python installation, use this Jupyter notebook.
Using Jupyter notebooks
Jupyter notebooks that we use in class are listed on the class schedule page. Click on a Jupyter notebook link there to download the file. To access the notebook:
put the file in some directory
In anaconda, click "jupyter", then navigate to that directory with the notebook file, and select the file
Or, if you're not using anaconda, open a terminal, go to the directory where you put the notebook file, and type
jupyter notebook
This will open a tab in your browser where you can select a notebook file to work with.
For info on how to format text and write code in Jupyter notebooks, see this Jupyter notebook.
Learning Python
Learning Python:
How To Think Like A Computer Scientist is a very good and accessible online Python textbook
General Python pages:
The Natural Language Toolkit:
Fun with statistics
Language Log: a language and linguistics blog written by Mark Liberman and others
Bad science: Ben Goldacre's blog with lots of illustrations of what not to do in statistics
xkcd: A webcomic of romance, sarcasm, math, and language.