LIN350 Analyzing Linguistic Data Spring 2025
Course: LINdocs.google.com/document/d/1YJ_Jw9GzQ6jdIUy2vwAJVwmJt3V3nCNyMuwxK5PeNWY/edit?usp=sharing350 Analyzing Linguistic Data, unique number 40145
Semester: Spring 2025
Course Canvas page: https://utexas.instructure.com/courses/1406622
Place and time: Tuesday/Thursday 12:30-2, WCP 5.102. Directions to WCP: click here.
Instructor: Katrin Erk. office RLP 4.734, email: katrin.erk@utexas.edu
Office hours: to be announced. Until office hours are determined, please email me to set up a meeting.
Teaching Assistant: Sooji Lee.. Contact information on Canvas.
Office hours: to be announced.
Prerequisites: Upper-division standing.
Textbook and readings:
Readings will be made available for download from the course website.
Flags: Quantitative Reasoning, Independent Inquiry
Course Syllabus
Course overview and objectives
Today, huge amounts of text are available in electronic form. We can poke these electronic text collections to answer questions about language, and questions about the people who use it. For example, we can test whether passive constructions are increasingly falling out of favor in English, and we can trace how words change their meaning over time. We can also study a politician's word choices in political debates to find out more about their personality, or we can see how inaugural addresses have changed over time.
This course provides a hands-on introduction to working with text data. This includes an introduction to programming in Python, with a focus on text processing and data exploration, with a "cookbook" of programming examples that will enable you very quickly to analyze texts on your own. Most of the conclusions that we want to draw from text are "risky conclusions", they are trends rather than yes-or-no answers, so the course also includes an introduction to statistical techniques for data exploration and for making and assessing "risky conclusions". The course also includes a course project where you can test your text analysis skills on a question of your own choice.
By the end of this course, you will:
know how to use simple word counts to answer many questions about people and about language, and know how to choose the right words for counting
know how to write programs in the Python programming language to access and analyze texts
know how to visualize and graph descriptive statistics about texts
know what hypothesis tests in statistics are, know some types of hypothesis tests, and know how to implement them in practice using Python packages
know what basic regression models in statistics are, know what they are used for, and know how to implement them in practice using Python packages
be familiar with a toolkit of linguistic text preprocessing tools, and know how to use it to normalize and filter words in a text
know what hypothesis testing is, and how to use it to distinguish actual findings from random variations in the data
know how clustering and topic modeling can be used to gain a quick overview of topics and themes that appear in written texts, and know how to apply these techniques in practice using Python packages
Course project
Course project requirements:
Initial project description:
This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It is enough for each team to submit one single project description. Just put all team members' names.
The document needs to contain the following information:
Research questions: What are the main questions that you want to answer, the main language phenomena you want to address, or the main ideas you want to explore?
Method: What are the relevant words, multi-word expressions, or constructions you need to analyze? What descriptive data analyses do you plan to do? Do you plan to do statistical significance tests, and do you know already which ones will be the right ones? (Yes, I know you will not have worked out every detail at this point, but strive to work out as many as you can.)
Data: It is vital that you figure out as early as possible what data you can use. Is there enough data? Is it freely available? Do you have to contact someone to get it?
Splitting the work: Who in the team will be doing what?
Intermediate report:
Submit the following document, which should be the same for the whole team: a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. This is a revised version of your initial project description, which needs to take into account the feedback you got on the initial description.Research questions: any changes?
Method: any changes?
Status:
Describe the data that was obtained: source, size, anything else that is relevant
Describe at least two (smaller, and preliminary) concrete results that you have at this point
In addition, each team member submits a short (half page) document describing their individual contribution and reflecting on what they learned in the project so far.
Short presentation:
This is a short presentation to the class. You should discuss:
Research questions/linguistic phenomena/main questions you are addressing
Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
Data: source, size (say how many words overall you have)
Results
You will need to prepare slides for this, which you submit to the instructor ahead of time.
It is okay if you don't have all results in place at this point. This does not lead to points being taken away for the presentation.
Final report:
Submit the following document, which should be the same for the whole team:
A 5-6 page document (single-spaced, single column) that describes the results of your project. This is a revised version of your intermediate project description. It needs to contain the following information:
Research questions/linguistic phenomena covered/main ideas pursued
Data: source, size, other relevant statistics
Method
Findings
If you build on previous work, you need to discuss it, and give references.
Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.
You need to take into account the feedback that you got on the Initial project description and Intermediate report.
In addition, each team member submits a short (half page) document describing their individual contribution and reflecting on what they learned in the project.
Course project ideas
Ideally, you pick a topic of your own that you are curious about. But to give you an idea of possible topics, here are a few pointers:
How do people with different political affiliations talk about the same topic, do they use different words? To study this, you can use word association weights, clustering and topic modeling to identify themes in documents
Themes in song lyrics for different genres: To study this, you can use word association weights, clustering and topic modeling to identify themes in documents
Author analysis: analyzing poems to detect who may have written them, and what characteristics they have
Language and ratings: What kinds of words are being used to describe, for example, cheap versus expensive wines?
Some ideas from Language Log's breakfast experiments:
Which words are used to describe white and black NFL prospects? Links here, here (data for download in the 2nd link)
State of the Union: what are signature words of Obama, of earlier presidents? (And why?)
The statistics of real estate listings: linking real estate price to the language in the descriptions
Contrasting "almost" and "nearly": discussed here, here, here, and here
Noah Smith has a few nice datasets to analyze:
Movie corpus: predicting movie revenue from review texts
Congressional bill corpus: predicting whether a bill will survive from the text in the bill
Corporate reports corpus: predicting how well a company will do from the annual reports that it issues
Please discuss your topic with the instructor to make sure that it is both substantial and feasible.
For your course project, you will need to apply statistical analyses yourself. Google books n-gram charts, while pretty, do not count.
Useful links
List of software we will use in the class
Python and Python packages:
We strongly recommend installing Anaconda, as that includes Python along with all Python packages we need.
If you install anaconda, you will have to add gensim. Here is a tutorial on how to add a package to Anaconda: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#installing-a-package
Alternatively, you can individually install:
Python: Any version >= 3.4 should be fine.
pandas: https://pandas.pydata.org/
numpy: https://numpy.org/
matplotlib: https://matplotlib.org/
Statsmodels: https://www.statsmodels.org/stable/index.html
The Natural Language Toolkit.
Please follow the instructions here to download NLTK data. It is enough if you download the "popular" datasets, you don't need all.
A third alternative: You can also execute your jupyter notebooks online using Google colab. The downside of this is that it's a bit of a pain to upload the data that you want to work with.
To test your Python installation, use this Jupyter notebook.
Using Jupyter notebooks
We'll be using Jupyter Notebooks in class. To use a "Code for download" file, download it to your computer. Your computer will probably complain that it doesn't know how to open the file. This is not a problem, ignore it. Then you have multiple options for how to open the file: (1) If you have Anaconda on your computer, you can open the file with notebooks. (2) If you have Anaconda, or another python, on your system, you can open a terminal, go to the directory with the notebook, and type the command jupyter notebook, or the command python -m jupyterlab. Or (3) if you have a Google colab account, you can open the file online in colab by selecting "Upload" from the left-hand side menu.
For info on how to format text and write code in Jupyter notebooks, see this Jupyter notebook.
Learning Python
Learning Python:
How To Think Like A Computer Scientist is a very good and accessible online Python textbook
General Python pages:
The Natural Language Toolkit:
Fun with statistics
Language Log: a language and linguistics blog written by Mark Liberman and others
xkcd: A webcomic of romance, sarcasm, math, and language.