UA HLT - LING 593 Internship - (Ahn) Michael Howell
American Speech ML/NLP Internship, SUMMER/FALL 2022
with support of Drs. Sonja Lanehart and Gus Hahn-Powell, Ayesha Malik (annotation) (University of Arizona)
American Speech ML/NLP Internship, SUMMER/FALL 2022
with support of Drs. Sonja Lanehart and Gus Hahn-Powell, Ayesha Malik (annotation) (University of Arizona)
To develop a machine-learning classifier which could - from training on academic journals - learn to identify the presence of certain sociologically- and politically-significant thematic categories - mainly features representing racial/ethnolinguistic or otherwise marginalised groups:
The categories - primarily defined by Dr. Lanehart - were (count 13):
UNK - article theme unclear
CAT_0 - article theme not matching any other category
CAT_1 - African-American Language / AAVL
CAT_2 - African Americans
CAT_2_1 - African Diaspora
CAT_3 - Mexican Americans & Latinx Peoples
CAT_3_1 - Latinx / Hispanic Diaspora
CAT_4 - Native Americans
CAT_4_1 - Indigenous Peoples (World)
CAT_5 - Asian Americans / Pacific Islanders
CAT_5_1 - Asian Diaspora
CAT_6 - Women's Language
CAT_7 - LGBTQ Speech
For this model to be able to make predictions on which categories were present in a newly-presented / unseen journal, for the purposes of:
(1) facilitating future annotation, speeding up the previously manual process,
(2) research insights (such as identifying historical trends in marginalised authorship or in-group / out-group perspectives on language groups), and
(3) potentially being used by academic journal systems/databases to assist in users searching for these themes
The above-listed 13 thematic categories
272 articles from the American Speech journal dataset (manually annotated by Sonja and assistant Ayesha) which have been selected because they contain information relevant to the above categories - these were organised in a Google Sheet / XLSX
We compiled all fully-complete, verified rows (article metadata) and moved them into one sheet which contained no partial data or negative article instances
Once downloaded, each article's / PDFs filename was defined by its link to the academic database (for example: 10-2307_454860.pdf, a JSTOR ID, etc)
Each article's raw text data from inside (preprocessed minimally), paired with the 13 annotated categories per article
Some preprocessing that is currently applied:
Lowercasing
Removing citation references (Regex): for example, (Person 1900), (Person et al. 2000), Person (2011)
Removing excess white space
Collapsing numbers to NUM
Removing non-period punctuation
Which categories?
A first consideration was to decide if the initial goal for the classifier should aim to predict every category (including negative categories: UNK and CAT_0) or just positive categories
We initially decided to focus only on positive instances (articles where a theme was certainly present) from what Dr. Lanehart and Ayesha had annotated
However, once collected, the data distribution (visible under Key Insights and Limitations) suggested that there may not yet be enough positive instances to give reliable prediction (without further techniques such as k-fold cross-validation)
Which tools to build with?
We also considered using OpenAI's GPT or other LLMs to build a classification system (while reflecting on data-use licenses), both as it seems already very capable and increasingly apt for the task, reducing time to build and likely greatly improving results
Additionally, I am familiar with building on that toolset professionally in recent months
Overall, I chose to pursue a more custom solution (using Python and its libraries) to explore building a model more from the ground up, with the goal of further developing my ML skills
This project is written in Python. The most recent version of my code is visible here (Github): https://github.com/michaelinwords/ua-americanspeech
There are also performance examples and other information included in the Github repo
An overview of latest performance:
This training was performed on 218 articles (80%) of the available 273 articles, using a TF-IDF vectoriser
Though below says 5 k-fold splits, only 1 of the loops was used for this classification report (the SKF is not fully implemented)
The subset_accuracy is very low (though this is a harsh metric: it is an "exact match ratio" or "zero-one loss," and requires a perfect match between the predicted and actual labels for each sample in order to be considered correct) and the per_label_accuracy is somewhat high
It is clear that support is a major issue for most of the categories (thus the need for more cross-validation as well as more articles generally, but especially for underrepresented categories)
To reach a next level of functionality / utility, the project needs:
Design: perhaps temporarily only classifying for the largest category, then introducing smaller categories (especially after more journal data is annotated/available)
Data: more annotated data to create less gaps in the category distribution
Data: further preprocessing of each article's raw text (for example, removing header/footer info from the journal, as well as potentially author data)
Code: to implement the Stratified K-Fold and other previously in-progress approaches to bolster performance
Code: testing with different parameters to identify what functions best for this task
Code: strong documentation
Code: to be cleared of redundant lines (some were left over from my different attempts)
Code: the other modes (final-train mode and predict mode) need to be implemented:
Final-train mode: this would involve the usual training steps, but none of the data would be kept out of training (none reserved for testing); this mode would be used to prepare a version of the model to save, which would be used for predictions later
Predict mode: this would involve loading the previously trained model/its weights (using joblib library), reading in an XLSX on newly-presented journals (similar to the current setup, but without training on it), then predicting for those journals and saving the output in a prediction XLSX