The First Workshop on Corpus Generation and Corpus Augmentation for Machine Translation @ AMTA 2022
ABOUT
The First Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) will be co-located with AMTA 2022 in Orlando, Florida, USA on September 16th, 2022.
CoCo4MT sets out to be the first workshop centered around research that focuses on corpora creation, cleansing, and augmentation techniques specifically for machine translation.
We hope that submissions will provide high-quality corpora that is available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future.
News
September 13, 2022 - Program Released
August 25, 2022 - Graham Neubig added as guest speaker
August 24, 2022 - Request for materials released
August 10, 2022 - Accepted papers released
August 2, 2022 - Acceptance decisions released
July 6, 2022 - Jörg Tiedemann, Julia Kreutzer, and Maria Nadejde finalized as guest speakers
July 6, 2022 - Paper submissions deadlines extended to July 20th 2022
June 29, 2022 – Third and Final call for papers released
June 15, 2022 – Second call for papers released
June 1, 2022 – First call for papers released
SCOPE
It is a well-known fact that machine translation systems, especially those that use deep learning, require massive amounts of data. Several resources for languages are not available in their human-created format. Some of the types of resources available are monolingual, multilingual, translation memories, and lexicons. Those types of resources are generally created for formal purposes such as parliamentary collections when parallel and more informal situations when monolingual. The quality and abundance of resources including corpora used for formal reasons is generally higher than those used for informal purposes. Additionally, corpora for low-resource languages, languages with less digital resources available, tends to be less abundant and of lower quality.
CoCo4MT sets out to be the first workshop centered around research that focuses on corpora creation, cleansing, and augmentation techniques specifically for machine translation. We accept work that covers any spoken language (including high-resource languages) but we are specifically interested in those submissions that are on languages with limited existing resources (low-resource languages) where resources are not highly available. Since techniques from high-resource languages are generally statistical in nature and could be used as generic solutions for any language, we welcome submissions on high-resource languages also.
The goal of this workshop is to begin to close the gap between corpora available for low-resource translation systems and promote high-quality data for online systems that can be used by native speakers of low-resource languages is of particular interest. Therefore, It will be beneficial if the techniques presented in research papers include their impact on the quality of MT output and how they can be used in the real world.
CoCo4MT aims to encourage research on new and undiscovered techniques. We hope that submissions will provide high-quality corpora that is available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future. The workshop’s success will be measured by the following key performance indicators:
Promotes the ongoing increase in quality of machine translation systems when measured by standard measurements,
Provides a meeting place for collaboration from several research areas to increase the availability of commonly used corpora and new corpora,
Drives innovation to address the need for higher quality and abundance of low-resource language data.
TOPICS
Topics of the workshop include but are not limited to:
Difficulties with using existing corpora (e.g., political considerations or domain limitations) and their effects on final MT systems,
Strategies for collecting new MT datasets (e.g., via crowdsourcing),
Data augmentation techniques,
Data cleansing and denoising techniques,
Quality control strategies for MT data,
Exploration of datasets for pretraining or auxiliary tasks for training MT systems.
SUBMISSION INFORMATION
There is one type of submission in the workshop: Research, review and position paper. The length of each paper should be at least four (4) and not exceed ten (10) pages, plus unlimited pages for references. Submissions should be formatted according to the official AMTA 2022 style templates (PDF, LaTeX, Word). Accepted papers will be published on-line in the AMTA 2022 proceedings which includes the ACL Anthology and will be presented at the conference either orally or as a poster.
Submissions must be anonymized and should be done using the official conference management system (https://cmt3.research.microsoft.com/AMTA2022). Scientific papers that have been or will be submitted to other venues must be declared as such, and must be withdrawn from the other venues if accepted and published at CoCo4MT. The review will be double-blind.
We would like to encourage authors to cite papers written in ANY language that are related to the topics, as long as both original bibliographic items and their corresponding English translations are provided.
Registration will be handled by the main conference. (To be announced)
IMPORTANT DATES
June 1, 2022 – Call for papers releasedJune 15, 2022 – Second call for papersJune 29, 2022 – Third and final call for papersJ
uly 13, 2022 – Paper submissions dueJuly 20, 2022 - Paper submissions due (extended deadline)July 27, 2022 – Notification of acceptanceAugust 7, 2022 – Camera-ready dueAugust 31, 2022 – Video recordings dueSeptember 16, 2022 - CoCo4MT workshop
KEYNOTE SPEAKERS
Ankur Parikh, Google NYC
Ankur Parikh is a staff research scientist at Google NYC. His research interests are in natural language processing and machine learning with a recent focus on high precision text generation. Ankur received his PhD from Carnegie Mellon in 2015 and has received a best paper runner up award at EMNLP 2014 and a best paper in translational bioinformatics at ISMB 2011.
Graham Neubig, Carnegie Mellon University
Graham Neubig is an associate professor at the Language Technologies Institute of Carnegie Mellon University. His research focuses on multilingual natural language processing, natural language interfaces to computers, and machine learning methods for NLP, with the final goal of every person in the world being able to communicate with each-other, and with computers in their own language. He also contributes to making NLP research more accessible through open publishing of research papers, advanced NLP course materials and video lectures, and open-source software, all of which are available on his web site.
Jörg Tiedemann, University of Helsinki
Jörg Tiedemann is professor of language technology at the Department of Digital Humanities at the University of Helsinki. He received his PhD in computational linguistics for work on bitext alignment and machine translation from Uppsala University before moving to the University of Groningen for 5 years of post-doctoral research on question answering and information extraction. His main research interests are connected with massively multilingual data sets and data-driven natural language processing and he currently runs an ERC-funded project on representation learning and natural language understanding.
Julia Kreutzer, Google Research
Julia is a research scientist at Google Research, Montreal, where she works on improving machine translation. She is generally interested in the intersection of natural language processing (NLP) and machine learning. In her PhD (Heidelberg University, Germany) she investigated how reinforcement learning algorithms can be used to turn weak supervision signals from users into meaningful updates for a machine translation system.
Maria Nadejde, Amazon
Maria Nadejde is a Senior Applied Scientist at Amazon AWS AI working on improving quality and customization of Amazon Translate. Before joining Amazon, Maria was an Applied Research Scientist at Grammarly developing deep learning applications that enhance written communication. She obtained a PhD in Informatics from the University of Edinburgh on the topic of syntax-augmented machine translation.
PANEL
Kenneth Ward, Northeastern University
Kenneth Church is a senior principal research scientist at the Institute for Experiential AI at Northeastern University. He earned his undergraduate and graduate degrees from the Massachusetts Institute of Technology, and has worked at AT&T, Microsoft, Hopkins, IBM and Baidu. His work in computational linguistics includes web search, language modeling, text analysis, spelling correction, word-sense disambiguation, terminology, translation, lexicography, compression, optical character recognition, speech (recognition, synthesis, and diarisation), and more. He was an early advocate of empirical methods and a founder of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Kenneth was the president of the Association for Computational Linguistics (ACL) in 2012 and SIGDAT (the group that organizes EMNLP) from 1993 until 2011. He became an AT&T Fellow in 2001 and ACL Fellow in 2015.
Marine Carpuat, University of Maryland
Marine Carpuat is an Associate Professor in Computer Science at the University of Maryland. Her research focuses on multilingual natural language processing and machine translation. Before joining the faculty at Maryland, Marine was a Research Scientist at the National Research Council Canada. She received a PhD in Computer Science and a MPhil in Electrical Engineering from the Hong Kong University of Science & Technology, and a Diplome d’Ingenieur from the French Grande Ecole Supelec. Marine is the recipient of an NSF CAREER award, research awards from Google and Amazon, best paper awards at the *SEM and TALN conferences, and an Outstanding Teaching Award.