The Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation @ MT Summit 2023

ABOUT

The Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) will be co-located with the MT Summit conference in Macau, China in September 2023. 

CoCo4MT sets out to be a  workshop centered around research that focuses on corpora creation, cleansing, and augmentation techniques specifically for machine translation. 

We hope that submissions will provide high-quality corpora that is available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future. 

News

SCOPE

It is a well-known fact that machine translation systems, especially those that use deep learning, require massive amounts of data. Several resources for languages are not available in their human-created format. Some of the types of resources available are monolingual, multilingual, translation memories, and lexicons. Those types of resources are generally created for formal purposes such as parliamentary collections when parallel and more informal situations when monolingual. The quality and abundance of resources including corpora used for formal reasons is generally higher than those used for informal purposes. Additionally, corpora for low-resource languages, languages with less digital resources available, tends to be less abundant and of lower quality.

CoCo4MT is a workshop centered around research that focuses on manual and automatic corpus creation, cleansing, and augmentation techniques specifically for machine translation. We accept work that covers any language (including sign language) but we are specifically interested in those submissions that explicitly report on work with languages with limited existing resources (low-resource languages). Since techniques from high-resource languages are generally statistical in nature and could be used as generic solutions for any language, we welcome submissions on high-resource languages also.

CoCo4MT aims to encourage research on new and undiscovered techniques. We hope that the methods presented at this workshop will lead to the development of high-quality corpora that will in turn lead to high-performing MT systems and new dataset creation for multiple corpora. We hope that submissions will provide high-quality corpora that are available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future. The workshop’s success will be measured by the following key performance indicators:


TOPICS

Topics of the workshop include but are not limited to:


SUBMISSION INFORMATION



CoCo4MT will accept research, review, or position papers. The length of each paper should be at least four (4) and not exceed ten (10) pages, plus unlimited pages for references. Submissions should be formatted according to the official MT Summit 2023 style templates (https://www.overleaf.com/latex/templates/mt-summit-2023-template/knrrcnxhkqxd). Accepted papers will be published in the MT Summit 2023 proceedings which are included in the ACL Anthology and will be presented at the conference either orally or as a poster. 

Submissions must be anonymized and should be made to the workshop using the Softconf conference management system (https://softconf.com/mtsummit2023/CoCo4MT). Scientific papers that have been or will be submitted to other venues must be declared as such, and must be withdrawn from the other venues if accepted and published at CoCo4MT. The review will be double-blind. 

We would like to encourage authors to cite papers written in ANY language that are related to the topics, as long as both original bibliographic items and their corresponding English translations are provided.

Registration will be handled by the main conference. (To be announced)

IMPORTANT DATES


KEYNOTE SPEAKERS

Manuel Mager, Amazon AWS

Manuel Mager is an Applied Scientist at AWS AI Labs, and completing his Ph.D. candidate at the University of Stuttgart, Germany. He graduated in informatics from the National Autonomous University of Mexico (UNAM) and did a Master's in Computer Science at the Metropolitan Autonomous University, Mexico (UAM). His research is focused on Natural Language Processing for low resource languages, mainly indigenous languages of the American continent that are polysynthetic. He also worked on Graph-to-text generation and information extraction. 

Jack Halpern, The CJK Dictionary Institute

Jack Halpern (春遍雀來), CEO of The CJK Dictionary Institute, is a lexicographer by profession, specializing in Japanese and Chinese. His work as an editor in chief of learner’s dictionaries resulted in various renowned standard reference works. He has been a resident of Japan for over 40 years but was born in Germany and has lived in France, Brazil, Japan, and the United States. He is an avid polyglot who has studied 18 languages (speaks 11).  

Marta R. Costa-jussà, Meta AI

Marta R. Costa-jussà is a research scientist at Meta AI since February 2022. She received her PhD from the UPC in 2008. Her research experience is mainly in Machine Translation. She has worked at LIMSI-CNRS (Paris), Barcelona Media Innovation Center, Universidade de São Paulo, Institute for Infocomm Research (Singapore), Instituto Politécnico Nacional (Mexico), the University of Edinburgh and at Universitat Politècnica de Catalunya (UPC, Barcelona), co-leading the MT-UPC Group. She has participated in 18 European/Spanish research projects; she has organised 12 workshops in top venues and she has published more than 100 papers. She has been part of the Editorial Board of the Computer Speech and Language journal. She has received an ERC Starting Grant and two Google Faculty Research Awards (2018 and 2019).

PANEL

Silvio Amir, Northeastern University

Silvio Amir works on Natural Language Processing and Machine Learning with emphasis on methods to analyze personal and user generated text, such as social media and clinical notes from Electronic Health Records. He is primarily interested in tasks involving subjective, personalized or user-level inferences (e.g. opinion mining and digital phenotyping). Amir's research is part of ongoing efforts to develop Human-centered AI (i.e. to empower rather than replace humans). To that end, he often collaborates with domain experts in multidisciplinary projects to address real-world problems in the social sciences, medicine and epidemiology.

CONTACT

CoCo4MT Workshop Organizers:

coco4mt-2023-organizers@googlegroups.com

CoCo4MT Shared Task Organizers:

coco4mt-shared-task@googlegroups.com