LIN350 Analyzing Linguistic Data Spring 2025

Course Syllabus

Course overview and objectives

Course project

Course project requirements:

Course project ideas

Useful links

List of software we will use in the class

Using Jupyter notebooks

Learning Python

Course: LINdocs.google.com/document/d/1YJ_Jw9GzQ6jdIUy2vwAJVwmJt3V3nCNyMuwxK5PeNWY/edit?usp=sharing350 Analyzing Linguistic Data, unique number 40145

Semester: Spring 2025

Course Canvas page: https://utexas.instructure.com/courses/1406622

Place and time: Tuesday/Thursday 12:30-2, WCP 5.102. Directions to WCP: click here.

Instructor: Katrin Erk. office RLP 4.734, email: katrin.erk@utexas.edu
Office hours: to be announced. Until office hours are determined, please email me to set up a meeting.

Teaching Assistant: Sooji Lee.. Contact information on Canvas.
Office hours: to be announced.

Prerequisites: Upper-division standing.

Textbook and readings:

Readings will be made available for download from the course website.

Flags: Quantitative Reasoning, Independent Inquiry

Course Syllabus

Link

Course overview and objectives

Today, huge amounts of text are available in electronic form. We can poke these electronic text collections to answer questions about language, and questions about the people who use it. For example, we can test whether passive constructions are increasingly falling out of favor in English, and we can trace how words change their meaning over time. We can also study a politician's word choices in political debates to find out more about their personality, or we can see how inaugural addresses have changed over time.

This course provides a hands-on introduction to working with text data. This includes an introduction to programming in Python, with a focus on text processing and data exploration, with a "cookbook" of programming examples that will enable you very quickly to analyze texts on your own. Most of the conclusions that we want to draw from text are "risky conclusions", they are trends rather than yes-or-no answers, so the course also includes an introduction to statistical techniques for data exploration and for making and assessing "risky conclusions". The course also includes a course project where you can test your text analysis skills on a question of your own choice.

By the end of this course, you will:

know how to use simple word counts to answer many questions about people and about language, and know how to choose the right words for counting
know how to write programs in the Python programming language to access and analyze texts
know how to visualize and graph descriptive statistics about texts
know what hypothesis tests in statistics are, know some types of hypothesis tests, and know how to implement them in practice using Python packages
know what basic regression models in statistics are, know what they are used for, and know how to implement them in practice using Python packages
be familiar with a toolkit of linguistic text preprocessing tools, and know how to use it to normalize and filter words in a text
know what hypothesis testing is, and how to use it to distinguish actual findings from random variations in the data
know how clustering and topic modeling can be used to gain a quick overview of topics and themes that appear in written texts, and know how to apply these techniques in practice using Python packages

Course project

Course project requirements:

Initial project description:
This is a 1-2 page document (single-spaced, single column) that describes what your project will be about. It is enough for each team to submit one single project description. Just put all team members' names.
The document needs to contain the following information:

Research questions: What are the main questions that you want to answer, the main language phenomena you want to address, or the main ideas you want to explore?
Method: What are the relevant words, multi-word expressions, or constructions you need to analyze? What descriptive data analyses do you plan to do? Do you plan to do statistical significance tests, and do you know already which ones will be the right ones? (Yes, I know you will not have worked out every detail at this point, but strive to work out as many as you can.)
Data: It is vital that you figure out as early as possible what data you can use. Is there enough data? Is it freely available? Do you have to contact someone to get it?
Splitting the work: Who in the team will be doing what?

Intermediate report:
Submit the following document, which should be the same for the whole team: a 1-2 page document (single-spaced, single column) that describes what the status of your project is at this point. This is a revised version of your initial project description, which needs to take into account the feedback you got on the initial description.
- Motivation: Why should readers be interested in your study? Please devote half a page to this.
- Research questions: any changes?

Method: any changes?
Status:
- Describe the data that was obtained: source, size, anything else that is relevant
- Describe at least two (smaller, and preliminary) concrete results that you have at this point

In addition, each team member submits a short (half page) document describing their individual contribution and reflecting on what they learned in the project so far.

Short presentation:
This is a short presentation to the class. You should discuss:

Research questions/linguistic phenomena/main questions you are addressing
Motivation: Why is this relevant? (Spend a lot of time on the research questions and their relevance. Describing the big picture is important!)
Data: source, size (say how many words overall you have)
Results

You will need to prepare slides for this, which you submit to the instructor ahead of time.

It is okay if you don't have all results in place at this point. This does not lead to points being taken away for the presentation.

Final report:
Submit the following document, which should be the same for the whole team:
A 5-6 page document (single-spaced, single column) that describes the results of your project. This is a revised version of your intermediate project description. It needs to contain the following information:

Motivation
Research questions/linguistic phenomena covered/main ideas pursued
Data: source, size, other relevant statistics
Method
Findings

If you build on previous work, you need to discuss it, and give references.
Published papers (at conferences, in journals) go into the references list at the end of the paper. Links to blog posts and the like go in a footnote. Also, links to websites containing data go in a footnote, not in the references list.

You need to take into account the feedback that you got on the Initial project description and Intermediate report.

In addition, each team member submits a short (half page) document describing their individual contribution and reflecting on what they learned in the project.

Course project ideas

Ideally, you pick a topic of your own that you are curious about. But to give you an idea of possible topics, here are a few pointers:

How do people with different political affiliations talk about the same topic, do they use different words? To study this, you can use word association weights, clustering and topic modeling to identify themes in documents
Themes in song lyrics for different genres: To study this, you can use word association weights, clustering and topic modeling to identify themes in documents
Author analysis: analyzing poems to detect who may have written them, and what characteristics they have
Language and ratings: What kinds of words are being used to describe, for example, cheap versus expensive wines?
Some ideas from Language Log's breakfast experiments:
- Which words are used to describe white and black NFL prospects? Links here, here (data for download in the 2nd link)
- State of the Union: what are signature words of Obama, of earlier presidents? (And why?)
- The statistics of real estate listings: linking real estate price to the language in the descriptions
- Degrees of plurality
- Contrasting "almost" and "nearly": discussed here, here, here, and here
Noah Smith has a few nice datasets to analyze:
- Movie corpus: predicting movie revenue from review texts
- Congressional bill corpus: predicting whether a bill will survive from the text in the bill
- Corporate reports corpus: predicting how well a company will do from the annual reports that it issues
and further datasets...

Please discuss your topic with the instructor to make sure that it is both substantial and feasible.

For your course project, you will need to apply statistical analyses yourself. Google books n-gram charts, while pretty, do not count.

Useful links

List of software we will use in the class

Python and Python packages:

We strongly recommend installing Anaconda, as that includes Python along with all Python packages we need.
- If you install anaconda, you will have to add gensim. Here is a tutorial on how to add a package to Anaconda: https://docs.anaconda.com/anaconda/navigator/tutorials/manage-packages/#installing-a-package
Alternatively, you can individually install:
- Python: Any version >= 3.4 should be fine.
- Jupyter notebooks: https://jupyter.org/
- pandas: https://pandas.pydata.org/
- numpy: https://numpy.org/
- matplotlib: https://matplotlib.org/
- Statsmodels: https://www.statsmodels.org/stable/index.html
- The Natural Language Toolkit.
  
  Please follow the instructions here to download NLTK data. It is enough if you download the "popular" datasets, you don't need all.
- gensim
A third alternative: You can also execute your jupyter notebooks online using Google colab. The downside of this is that it's a bit of a pain to upload the data that you want to work with.

To test your Python installation, use this Jupyter notebook.

Using Jupyter notebooks

We'll be using Jupyter Notebooks in class. To use a "Code for download" file, download it to your computer. Your computer will probably complain that it doesn't know how to open the file. This is not a problem, ignore it. Then you have multiple options for how to open the file: (1) If you have Anaconda on your computer, you can open the file with notebooks. (2) If you have Anaconda, or another python, on your system, you can open a terminal, go to the directory with the notebook, and type the command jupyter notebook, or the command python -m jupyterlab. Or (3) if you have a Google colab account, you can open the file online in colab by selecting "Upload" from the left-hand side menu.

For info on how to format text and write code in Jupyter notebooks, see this Jupyter notebook.

Learning Python

Learning Python:

- How To Think Like A Computer Scientist is a very good and accessible online Python textbook
- A Python tutorial

General Python pages:

- The Python library documentation

The Natural Language Toolkit:

- The Natural Language Toolkit.
- NLTK documentation.

Fun with statistics

Language Log: a language and linguistics blog written by Mark Liberman and others
xkcd: A webcomic of romance, sarcasm, math, and language.

Page updated

Report abuse