Schedule

Fall 2022 Schedule

The schedule for the F22 semester is broken down week by week below. Registered students in the course will receive a link via email for access. Anyone interested in the course, but not yet in Albert, can contact the instructor directly.

Materials:

The course learning materials are composed of numerous online articles & tutorials, interdisciplinary writing from the blogosphere, videos, digital projects in addition to traditional academic readings. There will be no books for purchase. Students will have access to ebook chapters available through NYU Libraries.

For this course you will need to make some accounts and download some software. To make the accounts, you can use your NYU account or create a "burner" account for the class. We will make use of AntConc, AntCorGen, SublimeText (or other text editor), as well as a number of web-based tools. You will also be assigned access to RStudio Cloud within a few weeks. If you are familiar with RStudio and already have it downloaded, you can use it on your own machine. For beginners, I recommend the cloud based version. These resources are all at no cost to you.

Part I : Digital Textual Analysis - Thinking with Corpora

Week 1 (30 Aug, 1 Sept) Beginnings

Introduction, Reviewing course components & syllabus, creating your own Google Site

Materials: Speed Reading (Tonight Show) (6 mins) | How to Speed Read (Ferriss) (9 mins)| "How Many Books Will You Read Before You Die?" | "The beginning of silent reading"

Questions: How do you read? how do people read in a country where you have lived? What might it mean to "read like a computer"? What is the tension in reading between speed vs care? How much information do you capture when you read?

Please register the address of your site here.

Google Sites: video guidance here and here.

Week 2 (6 Sept, 8 Sept) Reading, Fast/Slow, Close/Distant

Can Computers Read? Can Computers Help Us Read?

Read: Can Computers Read (Literature)? (Kestemont / Herman) | What is Fan Fiction? | Harry Potter fandom | The Pitfalls of Using Google Ngram

Listen: Distant Reading (a conversation Ama Bemma Adwetewa_Badu) - 14 mins.

Explore: Google n-gram vs. BookWorm (extra: tutorial for HT+Bookworm and "My Secret Editing Weapon")

Download: Laurence Anthony’s AntConc (Thurs)

Search: Look for a tutorial about AntConc on Youtube, try to learn how to do something with it and show us on Thursday.

In-Class: Harry Potter fan fiction with AntConc (source: AOOO) (Thursday) I have chosen a selection of the 390K+ fan fiction in the course drive.

Response 1 (on your site) - see rubric and guidelines (can be completed anytime between 9-20 September) Using AntConc 4.1, explore the Harry Potter fan fiction texts we worked with. What are you able to observe by looking at a list of words? How are they different from text to text? How can you use the concordance function, as well as the clusters / n-grams and collocation to gain insight into the different texts? Are there certain expressions which begin in Rowling and are popularized by fan fiction? What things seem to come about only in fan fiction? Also, feel free to look at AOOO to build your own HP fan fiction corpus using specific search terms of your choice.

If you would like to try a corpus other than Harry Potter, try the corpora of African-American or colonial South Asian literature assembled by Amardeep Singh. Alternatively, you can also use Jonathan Reeve's corpus of English fiction (~250 novels of the Corpus of English Novels with the Txtlab corpus of English novels).

Be sure to include illustrative visuals (screenshots) in your blog which allow your reader to follow your investigative process about a specific question of interest to you.

Week 3 (13 Sept, 15 Sept) Reading All of Anything

Corpora | Text Mining

Skim: Introduction to Text Mining | About open access journals | About Plos One Open Access Journals | About a paywall | AntCorGen tutorial

Read: Building the Invisible College (ch4, Crymble)

Watch: How can we make our own corpus? (10 mins) | AntCorGen - Getting Started (6 mins) | Finding collocations (8 mins) | Finding clusters (6 mins) | AntConc 4.0 tutorial 1 / tutorial 2 / tutorial 3 / tutorial 4 / tutorial 5 / tutorial 6 / tutorial 7 / tutorial 8 / tutorial 9

Download: Laurence Antony's AntCorGen (1.2)

In-class: Reading a corpus of Harry Potter fan fiction with AntConc (Tues) | Reading a corpus of specialized academic language with AntCorGen (Thurs). The shared Drive contains a number of downloaded articles in six different specialized fields of knowledge (AI Support Vector Systems, Citizen Science, Computer Vision, Neurolinguistics, Paleoecology, Refugee Studies, Synthetic Biology).

Discuss: Thinking about Crymble's notion of the "invisible college" what would you like to learn how to do that university is not currently teaching you? (Tues)

Week 4 (20 Sept, 22 Sept) Voyant and Project Gutenberg

Presentations | Voyant

Prepare: Mini-presentation - 21 Sept (5 mins maximum, in pairs, sign up here) - Choose a specialized field of knowledge and prepare a small analysis of it using AntCorGen. Use a maximum of 3 slides in Google drive so that you can present whatever the form of delivery we are in. (Thurs)

Watch: Introduction to Text Mining with Voyant Tools (23 mins)| Reading all of Jane Austen with Voyant Tools (11 mins) | About Project Gutenberg (6 mins)

Skim: About Jane Austen | What is Project Gutenberg? PG Mission Statement| Project Gutenberg Blocks Access in Germany

In-class: Reading a book from Project Gutenberg with Voyant | Reading Harry Potter & HP FF with Voyant / embedding a visual into your site | (Tues)

Assignment 1 (due 7-27 October) see the rubrics and guidelines. Use one of the already downloaded corpora from AntCorGen or Follow the steps to download AntCorGen described in the video above to download your own. Use AntConc to analyze the corpus that you have created. What are the most common words? specialized words? clusters? collocations? Are there groups of articles with common trends? Did your results match your hypotheses? Feel free to use Antconc or Voyant tools for your analysis, but be aware that it is more difficult to use Voyant with many documents at a time.

Week 5 (27 Sept | 29 Sept) Text Analysis with R

Rstudio Cloud | Reading Corpora from Project Gutenberg with R

Sign up: get your free account in RStudio.Cloud. Once you have your account, please submit the email address and name that you used for it so that I can add you to our class space. Use the form here. Sign up for RStudio Cloud even if you would like to use an instance of R on your own machine.

Skim: About Qur'an | The Watsons | Pride and Prejudice | Tidy Data (Wickham)

Watch: Part I and Part II (15 mins) | Intro to RStudio Cloud (6 mins) | Introduction to R and Tidyverse (12 mins) - once you have an RStudio Cloud account you can actually do all of what this video teaches you to do by creating a a "project" | Two short videos I made about using RStudio Cloud. Part 1 (11 mins) and Part 2 (12 mins)

Notebooks : Reading Jane Austen with R | Reading Qur'an with R

A word cloud of the Sahih English meanings of Qur'an with stopwords removed (quRan package)

Week 6 (4 Oct | 6 Oct) More Text Analysis with R - most distinctive words; Wordclouds from Project Gutenberg

This week we will use a digital library, Project Gutenberg, to look at prolific authors of children's literature.

Read: Children's literature; Who Was Mary Hazelton Blanchard Wade?; Manifest Destiny's Child: Mary Hazelton Blanchard Wade and the Literature of American Empire (Tunc)

Notebooks: My Little Cousin with GutenbergR

Making a Workcloud from any text in Project Gutenberg

Discussion: What are the benefits of using a library such as Project Gutenberg? the disadvantages? What happens when code goes wrong? What are the policies of CRAN? How do we mitigate the potential problems when working with open source software? What are other domains of prolific popular literature we can access in Project Gutenberg?

Week 7 (11 Oct | 13 Oct) Sentiment and Wrap up

Sentiment analysis, sometimes called opinion mining, attempts to extract from texts affective or subjective information from data. The kind we will look at here is a somewhat simple one: the automated extraction, classification and interpretation of sentiment from texts using some techniques in R. It is one of the ways we might say that we can “read like a computer.” Sentiment can also be derived from image or even biometric data.

We will look at sentiment using a hand-curated list of words that are considered to be negative or positive, called a lexicon. The tidytext package that we have used come pre-packaged with three different lexicons, described briefly here.

Read: Sentiment analysis in ecommerce

Explore: Want to know if the languages you know have a sentiment lexicon? Check out this dataset at Kaggle: Sentiment Analysis in 81 Languages

Notebook: Detecting Sentiment with lexicons (Austen and Gutenberg) (Thurs)

Watch: Data Lit (sentiment analysis) (starting 1:00) (4 mins) | How to See Sentiment on Twitter (5 mins)

Response 2 (due 10 Oct-10 Nov) Use the Detecting Sentiment or the My Little Cousin notebook with a corpus of your choice to discuss either sentiment in one text or most distinctive words in the texts of your choice. There is code if you chose to use texts from Project Gutenberg and additional code if you want to use your own txt files. NB: you will have to rename variables throughout the code depending on what you choose!

Wrap up of first half of term (Thurs)

Break!

Part 2: Digital Text Analysis (+ Creating Digital Text) Project-Based Learning

Week 8 (25 Oct | 27 Oct) OCR and HTR

Building a Corpus from Print or Handwritten Documents

Skim: Working with Batches of PDF files (Mähr) | Intro to Powershell (Windows) and Bash Command (ioS) | documentation for ocrmypdf / A (brief!) introduction to OCR in Digital Humanities | Automatic Transcription of BnF ms fr 24428 with Transkribus | The eScriptorium VRE for Manuscript Cultures

Sample texts:

Computers and Automation 1970 | Toad Computers 1996 | Games Computers Play Manual 1986 | System of Water Supply (Bahrain, 1936) | Agreement relating to the Abu Dhabi Oil Concession (1940) | Richard Helms departs Beirut (CIA) (1974) | Syrian Protestant College 1896 | Shipping and congestion at Stores in Basrah (1916) | The India-Pakistan Border (CIA) (1966)

Watch: Transkribus makes breakthrough in understanding medieval texts (Euronews) Transkribus in 10 Steps | A (brief!) Introduction to OCR in the Digital Humanities

Explore: NewsEye | Viral Texts Project (two projects which use quite messy OCR'd data) | transkribus.ai (drop in a sample of handwriting)

Optional Download: Tesseract (instructions in Mähr for iOS or Windows) -- if you are unsure about the command line or encounter issues, we can work on this slowly in class or office hours.

Class demos: Tesseract (Tues), Transkribus (Thurs)

Discussion with Suphan Kirmizialtin (NYUAD) on Handwritten Text Recognition (HTR) (Thurs)

Response 3 : see rubric (due 25 Oct-10 Nov) Imagine that you would like to build a corpus from typewritten pdfs you have been given or can find easily with library resources. What would the subject of your corpus be? How would you leverage techniques you have learned thus far this semester? Use databases or archive.org or another digital library to identify about 10 pdfs you would like to work with and explain why you chose what you did. If the text layer is available, try to cut and paste it into a text editor to see how accurate it is? (This can also be done using Acrobat Pro, if you have it). Do you think the OCR quality will make a difference for the kind of analysis you carry out?


Week 9 (1 Nov | 3 Nov) Converting Recorded Video into Text

Building a corpus from videos with Stream | Watching some of Fall 2020-Spring 2021 Rooftop Rhythms

Skim: What are speech-to-text algorithms? | Accent bias is an unchecked sign of racism in the workplace | If we all end up sounding like Americans, you can probably blame voice assistants (Olyeinka)

Watch: Voice Recognition Elevator in Scotland | Watch one of the sessions of Rooftop Rhythms (full list here and transcriptions here).

Skim listen: The Late Wire EP 5 (interview with Raffy Akinwande) – Nigerian social podcast | Iraq Matters#30: (interview with Moussa AlNasari) Remembering Mutanabbi Street 10 Years Later | SG Explained (Willy, Elliot, Rovik) Talking about Racism – Singaporean “regular guys” podcast | Scotland Outdoors, Mark and Euan Visit the Mysterious Goblin Ha’ – BBC Radio Scottish nature podcast | Chini Ya Maji podcast (interview with Don Okoth) – Kenyan podcast on startup culture | Cornish Soccer Talking Football (interview with Andy Watkins) – football podcast from the SW United Kingdom | AWR Colloquial English Sudan – a Christian podcast from Sudan

Discussion: What does accent bias have to do with AI/STT? Which of the podcasts above would STT do best with? What potential issues will we have with STT and Rooftop Rhythms?

Notebook: Word Vectors for RR (two explanations here and here) (Thurs)

Week 10 (8 Nov | 10 Nov) Word Embeddings 1

Word Embeddings, Know a Word by the Company it Keeps...

Skim: An Introduction to Word Vectors (WWP); Word embeddings quantify 100 years of gender and ethnic stereotypes (Garg et al)

Discussion:

Notebook: Using some pre-computed vector spaces with three different corpora, we will explore word embeddings using the Word2Vectors package in R. The corpora we will look at include STT generated transcriptions of recorded episodes of Rooftop Rhythms, 10 Volumes of colonial correspondence from the 19th Arabian Gulf and an expanded science fiction corpus.

Assignment 2: Using the notebook and the pre-computed vector space models, carry out a basic analysis of your choice of the provided corpora (Rooftop Rhythms, Little Cousin series, 1950s sci-fi, AntCorGen disciplinary corpora, HTR-created Gulf corpus, etc). What are the clusters that are most interesting? most surprising? What do they tell you about a part of your corpus? Do some vector arithmetic to go deeper with a handful of words. Your final project will include some of the same corpora, but you will be asked to work on a different one--figure that into your choice of corpus at this point. Please do not choose the same corpus as assignment 1--you will not receive full credit.) Due date TBA

Week 11 (15 Nov | 17 Nov) Word Embeddings 2

Skim: How to identify hot topics in psychology using topic modeling (Bittermann and Fischer); Application of Topic Modeling to Tweets to Learn Insights on the African American Lived Experience of COVID-19 (Odlum et al); Gendered Language on the Economics Job Market Rumors Forum (Wu); A Cross-Verified Database of Notable People, 3500BC-2018AD (Laouenan et al); Impact of COVID 19 on Indian Migrant Workers: Decoding Twitter Data by Text Mining (Misra & Gupta)

Guest speaker: Minda Belete, NYUAD (Tues)

Demo in class: Word embeddings with a corpus from psychology collected with AntCorGen; topic modeling RR (Thurs)

Discussion: How do the results of our word embeddings experiment compare with the discussion of Bittermann and Fischer? What are fields you would want to use word embeddings to understand better? How would you build the corpus?

Week 12 (22 Nov | 24 Nov) Stylometry

Attend a virtual conference "Building Digital Humanities" (Australasian DH) or asynchronous project work (Tues) Signup information forthcoming

Read:

Classification by style (Thurs) - LAB

Week 13 (29 Nov) LAB (Thursday off, National Day)

Final project work - LAB (Tues)

Link to demo of stylometry

Week 14 (6 Dec | 8 Dec) - LAB / final presentations

Final project work (Tues) - LAB

Final presentations (Thurs)

This is the opportunity for groups to present their almost completed work in short 10 minute presentations. These presentations are ungraded, but provide an opportunity to receive feedback for finalizing the final written work.

-Response 4: Go back to the original podcast featuring Adwetewa-Badu and evaluate what you have learned this semester.

Week 15 (12 Dec) - final presentations

Today's class will be a wrap up, featuring the second half of the presentations of groups.