NLU @ Meta (2020)

Below are the projects I worked on while contracted as an i18n Linguist at Meta for the Assistant Linguistic Engineering team. 

I focused on finding efficient internationalization and localization workflows for developing NLU domains and improving existing model accuracy. 

Simultaneous i18n domain development

I piloted a project that simultaneously developed new domains across multiple languages and dialects. The typical domain development timeline begins with English and later expands to other languages, involving long manual translation times. I relied on machine translation and vendor partner support to shorten the timeline. I created data creation and annotation guidelines, machine-translated them, and sent them to the language leads at our vendor partner for a manual quality check, which greatly reduced typical translation times. Then I simultaneously set up data collection and annotation jobs across languages, relying on our vendor partner to help maintain consistency across languages and curate test sets. 

Localization of existing domains

I conducted an experiment to determine whether including locale-specific data for an existing, healthy domain would improve model accuracy. This involved the cleanup and curation of high, mid, and low priority test sets for data across five English-speaking locales, in a domain that was already supported for US English. Results indicated that adding locale-specific data does slightly improve model accuracy in a robustly supported domain. This suggests that the collection and curation of locale-specific data in the domain development stage is good practice, but also that a domain that is healthy for US English is likely also healthy in other English-speaking locales. 

Improving NLU model robustness

I identified incorrect annotations and gaps in training data to improve NLU model accuracy in English and French domains. I combed through error analyses for newly annotated and modeled datasets and identified two main annotation errors: the semantic mislabeling of utterances and the inclusion or exclusion of particular tokens. Using data transformers, I batch re-annotated utterances. Then I used data templating to generate training data for types of test utterances that were not reflected in existing train utterances. 

Data budget recalculation for cost reduction

As part of a team-wide effort, I recalculated requested utterance counts on existing data collection guidelines to efficiently establish the new reduced budget for future data requests. I then used the updated guidelines to request English locale data for ten existing high-priority domains, as a pilot for the updated budget. The annotation and curation of these locale data contributed to the robustness of the overall English model, while the revised guidelines ensured that subsequent requests stay within budget.