UA HLT - LING 593 Internship - (Ahn) Michael Howell

American Speech ML/NLP Internship, SUMMER/FALL 2022

with support of Drs. Sonja Lanehart and Gus Hahn-Powell, Ayesha Malik (annotation) (University of Arizona)

Project Overview:

The above-listed 13 thematic categories
272 articles from the American Speech journal dataset (manually annotated by Sonja and assistant Ayesha) which have been selected because they contain information relevant to the above categories - these were organised in a Google Sheet / XLSX
We compiled all fully-complete, verified rows (article metadata) and moved them into one sheet which contained no partial data or negative article instances
- Once downloaded, each article's / PDFs filename was defined by its link to the academic database (for example: 10-2307_454860.pdf, a JSTOR ID, etc)
Each article's raw text data from inside (preprocessed minimally), paired with the 13 annotated categories per article
- Some preprocessing that is currently applied:
  - Lowercasing
  - Removing citation references (Regex): for example, (Person 1900), (Person et al. 2000), Person (2011)
  - Removing excess white space
  - Collapsing numbers to NUM
  - Removing non-period punctuation

Which categories?
- A first consideration was to decide if the initial goal for the classifier should aim to predict every category (including negative categories: UNK and CAT_0) or just positive categories
- We initially decided to focus only on positive instances (articles where a theme was certainly present) from what Dr. Lanehart and Ayesha had annotated
- However, once collected, the data distribution (visible under Key Insights and Limitations) suggested that there may not yet be enough positive instances to give reliable prediction (without further techniques such as k-fold cross-validation)
Which tools to build with?
- We also considered using OpenAI's GPT or other LLMs to build a classification system (while reflecting on data-use licenses), both as it seems already very capable and increasingly apt for the task, reducing time to build and likely greatly improving results
- Additionally, I am familiar with building on that toolset professionally in recent months
- Overall, I chose to pursue a more custom solution (using Python and its libraries) to explore building a model more from the ground up, with the goal of further developing my ML skills

This project is written in Python. The most recent version of my code is visible here (Github): https://github.com/michaelinwords/ua-americanspeech
- There are also performance examples and other information included in the Github repo
An overview of latest performance:
- This training was performed on 218 articles (80%) of the available 273 articles, using a TF-IDF vectoriser
  - Though below says 5 k-fold splits, only 1 of the loops was used for this classification report (the SKF is not fully implemented)
- The subset_accuracy is very low (though this is a harsh metric: it is an "exact match ratio" or "zero-one loss," and requires a perfect match between the predicted and actual labels for each sample in order to be considered correct) and the per_label_accuracy is somewhat high
- It is clear that support is a major issue for most of the categories (thus the need for more cross-validation as well as more articles generally, but especially for underrepresented categories)

Page updated

Google Sites

Report abuse