I've created the following forced alignment dictionaries, acoustic models, and g2p models:
Catalan MFA Bundle
The bundle includes a Catalan Pronunciation Dictionary, Acoustic Model, and G2P Model for the Montreal Forced Aligner (McAuliffe et al. 2017). It also includes phone-level alignments of the training data, for easy acoustical analysis!
Trained on 90 hours of clean training data from the ParlamentParla Speech Corpus
The dictionary was generated using the WikiPron Web Scraper Catalan Broad Transcription dataset (Lee et al. 2020) and expanded using the G2P model to generate pronunciations of additional out-of-vocabulary tokens from the ParlamentParla Corpus.
The dictionary contains pronunciations for 155,595 words and can easily be expanded using the G2P model. Two versions of the dictionary exist - one with word probabilities and one without, for easy customizability.
I've written the following guides and programs to help with the data processing pipeline:
Density-controlled 3D Vowel Space Volumetric Calculator:
This R script is based on Story & Bunton's (2017) Vowel Space Area workflow and was inspired by Annie Helms's 2D calculator program, where the area of an F1-by-F2 plane is calculated using the convex hull of a density-controlled plot. In this R script, I expand on Vowel Space Density into the 3rd dimension! This script calculates vowel space volumes for F1, F2 and F3, and graphs three dimensional convex hulls in R Studio.
View the Density-Controlled Vowel Space Volume Calculator here
Onset of Nasalization:
This Praat script takes a phoneme-by-phoneme aligned TextGrid file and scans it for every instance of a /VN/ sequence. Then a variety of phonetic cues for nasalization are analyzed by the script to make a prediction on when the start of anticipatory vowel nasalization begins. It can analyze multiple files at once.
It's particularly useful if you want to analyze variation or patterns on vowel nasalization but don't have any articulatory data from a nasometer or velotrace. No speaker-specific training models required.
Note: I'm still in the works for optimizing and testing the accuracy of the software's predictions. I'm also developing a much more user-friendly application and interface for nasal analysis which will be released in the coming months. If you have any questions about acoustic analyses of nasal vowels or using this software, please email me and I am happy to help.
View the Onset of Nasalization program here
SRT to TextGrid:
This script cleans up silent intervals in .srt subtitle files and converts them into sentence-by-sentence parsed TextGrids. It can analyze multiple files at once.
It's particularly useful if you run your audio files through AI transcribers like Otter.ai or Sonix.ai or are using subtitle files from a movie or TV show and need to convert your transcriptions into a TextGrid. A great first step for preparing to force-align your data!
View the SRT to TextGrid converter here
View a corpus processing workflow that uses the SRT to TextGrid program (and teaches how to do forced alignment!)
R Graphing Tutorial for Linguists:
Learn how to use ggplot2 and graph like a pro! R script with several built-in datasets!
All of the datasets are of (artificially-constructed) linguistic data, to give you a better idea of how to make some awesome graphs.
Praat Mass Analyzer:
This script takes a phoneme-by-phoneme aligned TextGrid file and converts it into .csv spreadsheet data. It can analyze multiple files at once.
It collects Formants (F1-F5), Pitch (F0), Formant Bandwidths (F1-F5), Formant Slope, Harmonicity, Intensity, Intensity Max, Intensity Min, Intensity difference (from Bongiovanni 2015), Preceding Phoneme, Following Phoneme, Phone Duration, Jitter, Center of Gravity, Center of Gravity Standard Deviation, Skewness, Kurtosis.
Great for analyzing several linguistic measurements in bulk (If you'd like to analyze only a certain subset of phoneme types or need to transform your data, this script makes spreadsheet filtering and spreadsheet equations quite easy to do).
If you'd like help with using the script, please email me julianvargo@berkeley.edu
View the Praat Mass Analyzer program here
Guide - Processing linguistic corpora in Sonix AI
If you are interested in using sonix.ai as a tool to transcribe large datasets, I've written up a short guide on how to process that data. The guide was originally written for undergraduate researchers at UC Berkeley who corrected transcriptions for the Multilingual Hispanic Speech in California Corpus.
I'm the creator of two downloadable keyboards: the Ladino Meruba Phonetic Keyboard and the Mam Mayan Keyboard.
Guide to download the Mam Mayan Keyboard here
A comprehensive multilingual Spanish/English guide to download the keyboard for Windows and Mac, written by Julian Vargo and Henry Sales Hernandez
Guide to download the Ladino Meruba Keyboard here
A comprehensive multilingual Ladino/English guide to download the keyboard for Windows and Mac, written by Julian Vargo
I'm currently developing a Rashi Ladino keyboard. Please email me to be notified when it's ready to be published.
Feel free to julianvargo@berkeley.edu if you have any questions about downloading a keyboard.