Workshop of Multimodal, Multilingual and Multitask Modeling Technologies for Oriental Languages (M3Oriental)

to be held in conjunction with ACM Multimedia Asia 2023

M3Oriental Workshop of ACM Multimedia Asia 2023 (Dec. 8th afternoon GMT+8, Tainan city, Taiwan)

Abstract

This M3Oriental workshop is designed to address the challenges in low-resourced language problems in speech recognition. Traditional methods are often limited to a single modality and task. The workshop aims to leverage advanced methods from computer vision (CV) and neural language processing (NLP) to overcome these limitations. It focuses on integrating multimodal, multilingual, and multitask modeling technologies using large-scale pretraining models. The goal is to explore their potential in multimodal tasks and cross-lingual communication, key features of next-generation artificial intelligence. The workshop covers multiple tasks (such as machine translation (MT), speech translation (ST), speech recognition (ASR), speech synthesis (TTS), voice conversion (VC), and speech emotion recognition (SER)). It aims to incorporate complementary information across multiple languages and modalities.

Scope of the Workshop

We are not result-oriented. We welcome any original, interdisciplinary research related to the M3Oriental (the topic of this workshop), including but not limited to:

Audio, Speech and Language Processing;
Datasets, Benchmark Systems and Models;
Multimedia, Biomedical and Health Informatics;
Information Forensics and Security;

Calling for Papers

Paper submission

Submission: Papers on M3Oriental can be submitted to the workshop through the ACM Multimedia Asia 2023 author console of paper management system (CMT) (track 7): Paper Submission.

Select Track: M3Oriental: Workshop of Multimodal, Multilingual and Multitask Modeling Technologies for Oriental Languages (shown in above picture).

We invite submissions of original technical papers related to the M3Oriental (topic of this workshop) including but not limited to:

Audio, Speech and Language Processing;
Datasets, Benchmark Systems, Models and Shared Tasks;
Multimedia, Biomedical and Health Informatics;
Information Forensics and Security;

Paper format: Submitted papers should be within the scope of the workshop. Submitted workshop papers basically follow the ACM Multimedia Asia-2023 paper style, format, and no longer than 6 pages (Paper Submission Guidelines), but single blind.

In-person or online: The Satellite Workshops will be held with in-person (online just in case) attendance. Accordingly, each accepted workshop paper must be presented in-person (online just for personal health, VISA, or national restrictions) by one of the authors.

Registration: The main conference will provide the registration with the official entry of the workshop paper. One main conference paper can cover one workshop paper. Note that if the authors of workshop papers would like to attend the main conference, it requires the main conference registration.

Publication: Upon acceptance, paper authors will have the opportunity to present their paper at our workshop, and the workshop paper will be included in the workshop proceedings belonging to the ACM MMAsia2023.

Important dates

Workshop Paper Submission Deadline: Sept. 27 Oct. 3 October 11 (2nd extension)
Workshop Paper Acceptance Notification: October 16
Workshop Camera Ready Paper Deadline: October 22 (sharp deadline)

Accepted resource/toolkit/benchmark papers

1. Qiwei Li (Wuhan University); Zuchao Li (Wuhan University); Xiantao Cai (Wuhan University); Bo Du (Wuhan University); Hai Zhao (Shanghai Jiao Tong University), Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

2. Chengxi Lei (Massey University); Satwinder Singh (Massey University); Feng Hou (Massey University); Xiaoyun Jia (Shandong University); Ruili Wang (Massey University), PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

3. Zhaojie Luo (Osaka University); Stefan Christiansson (KTH Royal Institute of Technology); Bence Ladóczki (Budapest University of Technology and Economics); Kazunori Komatani (Osaka University), Speech Emotion Recognition Using Threshold Fusion for Enhancing Audio Sensitivity

4. Zom Yang (Qinghai Minzu University); Kuntharrgyal Khysru (Qinghai Minzu University); Yi Zhu (Tianjin University); Long Daijicuo (Qinghai Minzu University); Jianguo Wei (School of Computer Software, Tianjin University, Tianjin, China), Automatic Labeling of Tibetan Prosodic Boundary Based on Speech Synthesis Tasks

5. Qie Yangzhuoma (Qinghai Guide); Kuntharrgyal Khysru (Qinghai Minzu University); Wan Maji (Qinghai Minzu University); Jianguo Wei (School of Computer Software, Tianjin University, Tianjin, China), Research on the classification method of knowledge question intention for Tibetan language curriculum

6. Hay Mar Soe Naing (University of Computer Studies, Yangon, Myanmar); Win Pa Pa (University of Computer Studies, Yangon), A Large Vocabulary End-to-End Myanmar Automatic Speech Recognition

7. Wangjin Zhou (Kyoto University); Zhengdong Yang (Kyoto University); Sheng Li (National Institute of Information & Communications Technology (NICT)); Chenhui Chu (Kyoto University), KyotoMOS: An Automatic MOS Scoring System for Speech Synthesis

8. Ye Kyaw Thu (NECTEC); Thazin Myint Oo (Language Understanding Lab); Thepchai Supnithi (NECTEC), Reinforcement Learning Fine-tuning for Improved Neural Machine Translation of Burmese Dialects

We are lucky to have an invited paper:

Zhao Ren (University of Bremen); Kun Qian (Beijing Institute of Technology); Tanja Schultz (University of Bremen); Bjorn W. Schuller (Imperial College London), An Overview of the ICASSP Special Session on AI Security and Privacy in Speech and Audio Processing

We also have a speech recognition challenge in preparation.

Indic language speech task: please communicate with Dr. Raj Dabre (raj.dabre-a-t-nict.go.jp)

Workshop Schedule (Dec. 8th GMT+8, Hybrid online and at National Cheng Kung University)

Paper presentations are posters. Please see the 3-minute videos and PDF posters from the Gather town .

Communication is possible throughout the entire meeting.

333 Qiwei Li (Wuhan University), et al., Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

336 Chengxi Lei (Massey University), et al., PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

340 Zhaojie Luo (Osaka University), et al., Speech Emotion Recognition Using Threshold Fusion for Enhancing Audio Sensitivity

341 Zom Yang (Qinghai Minzu University), et al., Automatic Labeling of Tibetan Prosodic Boundary Based on Speech Synthesis Tasks

344 Hay Mar Soe Naing (University of Computer Studies, Yangon, Myanmar), et al., A Large Vocabulary End-to-End Myanmar Automatic Speech Recognition

345 Qie Yangzhuoma (Qinghai Guide), et al., Research on the classification method of knowledge question intention for Tibetan language curriculum

354 Wangjin Zhou (Kyoto University), et al., KyotoMOS: An Automatic MOS Scoring System for Speech Synthesis

358 Zhao Ren (University of Bremen), et al., An Overview of the ICASSP Special Session on AI Security and Privacy in Speech and Audio Processing

359 Ye Kyaw Thu (NECTEC), et al., Reinforcement Learning Fine-tuning for Improved Neural Machine Translation of Burmese Dialects

Following is the keynote speech schedule at Zoom meeting room 3 and on-site:

The Zoom meeting link is: https://reurl.cc/NydOK9

The password of all Zoom meeting rooms is: mmasia2023

Welcome speech

Session chairs: Dr. Sheng Li, Dr. Raj Dabre, and Dr. Bei Liu

14:00-14:30 Prof. Zhizheng Wu: AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models

14:30-15:00 Dr. Jianlong Fu: Text-to-Video Generation Based on Large-Scale Multimodal Diffusion Models

15:00-15:30 Dr. Xianchao Wu: Adapting Diffusion Models for Speech Recognition and Translation

15:30-16:00 Prof. Yu Tsao: Utilizing Deep Learning for Speech Enhancement in Assistive Oral Communication Technologies

Session chairs: Prof. Jiyi Li and Dr. Zhao Ren

16:00-16:30 Prof. Björn W. Schuller: Navigating AI through the Peaks and Valleys of Emotion in Spoken Mandarin

16:30-17:00 Prof. Tatsuya Kawahara: Making a Robot to Communicate with Social Signals

17:00-17:30 Prof. Zuchao Li: Document Understanding and Beyonds, Towards Multi-modal Language Processing

17:30-18:00 Prof. Emilia Barakova and Dr. Vacaru Stefania: Unlocking Potential: Tech Advances in Disability Care

Workshop finish speech

Invited Speakers

Prof. Tatsuya Kawahara received B.E. in 1987, M.E. in 1989, and Ph.D. in 1995, all in information science, from Kyoto University, Kyoto, Japan. From 1995 to 1996, he was a Visiting Researcher at Bell Laboratories, Murray Hill, NJ, USA. Currently, he is a Professor of School of Informatics, Kyoto University. From 2020 to 2023, he was the Dean of the School. Before that, he was also an Invited Researcher at ATR and NICT.

He has published more than 450 academic papers on automatic speech recognition, spoken language processing, and spoken dialogue systems. He has been conducting several projects including open-source speech recognition software Julius, the automatic transcription system deployed in the Japanese Parliament (Diet), and the autonomous android ERICA.

Dr. Kawahara received the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology (MEXT) in 2012. From 2003 to 2006, he was a member of IEEE SPS Speech Technical Committee. He was a General Chair of IEEE ASRU 2007. He also served as a Tutorial Chair of INTERSPEECH 2010, a Local Arrangement Chair of ICASSP 2012, and a General Chair of APSIPA ASC 2020. He was an editorial board member of Elsevier Journal of Computer Speech and Language and IEEE/ACM Transactions on Audio, Speech, and Language Processing. From 2018 to 2021, he was the Editor-in-Chief of APSIPA Transactions on Signal and Information Processing. Dr. Kawahara is the President of APSIPA, a board member of ISCA, and a Fellow of IEEE.

Prof. Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor in Machine Intelligence and Signal Processing all in EE/IT from TUM in Munich/Germany. He is Full Professor of Artificial Intelligence and the Head of GLAM at Imperial College London/UK, Full Professor and Chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg/Germany, co-founding CEO and current CSO of audEERING – an Audio Intelligence company based near Munich and in Berlin/Germany, independent research leader within the Alan Turing Institute as part of the UK Health Security Agency, and permanent Visiting Professor at HIT/China amongst other Professorships and Affiliations. Previous stays include Guest Professor at Southeast University in Nanjing/China, Full Professor at the University of Passau/Germany, Key Researcher at Joanneum Research in Graz/Austria, and the CNRS-LIMSI in Orsay/France. He is a Fellow of the IEEE and Golden Core Awardee of the IEEE Computer Society, Fellow of the BCS, Fellow of the ELLIS, Fellow of the ISCA, Fellow and President-Emeritus of the AAAC, Elected Full Member Sigma Xi, and Senior Member of the ACM. He (co-)authored 1,200+ publications (50,000+ citations, h-index=100+ ranking him number 7 in the UK for Computer Science), is Field Chief Editor of Frontiers in Digital Health and was Editor in Chief of the IEEE Transactions on Affective Computing amongst manifold further commitments and service to the community. His 50+ awards include having been honoured as one of 40 extraordinary scientists under the age of 40 by the WEF in 2015. Currently, he was awarded IEEE Signal Processing Society Distinguished Lecturer 2024. He served as Coordinator/PI in 15+ European Projects, is an ERC Starting and DFG Reinhart-Koselleck Grantee, and consultant of companies such as Barclays, GN, Huawei, Informetis, or Samsung. Schuller counts more than 300 public press appearances including in Business Insider, Guardian, International Business Times, Newsweek, Scientific American, Times, The Economist 1843, UK Daily Mail, and national and international podcast, radio, and television contributions such as in MIT Technology Review and “The World” and “The Why”.

Prof. Emilia Barakova received her Ph.D. in Mathematics and Natural Sciences from the University of Groningen in 1999, and her master’s degree in Electronics and Automation engineering from the Technical University of Sofia in Bulgaria. She is presently affiliated with the Industrial Design department and serves as the Head of the Social Robotics Lab at the Eindhoven University of Technology. She formerly worked at Riken Brain Science Institute, Wako-shi, Japan, the German-Japanese Robotics Research Lab, Kitakyushu, Japan, the University of Groningen in the Netherlands, and the Bulgarian Academy of Sciences. Barakova specializes in embodied social interaction with and through technology and social and cognitive robotics. She has expertise in modelling social behaviour by merging artificial intelligence, cognitive sciences, and robotics. Her present research focuses on the use of social robots for enhancing the well-being of people with visual impairments and intellectual disabilities, dementia, and also education and special education (i.e. social skills training of children with autism spectrum disorders. Barakova has served as the program chair for several conferences (including IJSR, IEEE RO-MAN, and IEEE Hybrid Intelligent Systems), and she is an Associate Editor of the International Journal of Social Robotics, as well as an editor of Personal and Ubiquitous Computing, Interaction Studies, and Transactions of Human-Machine Systems. She has co-authored over 250 peer-reviewed papers.

Prof. Zhizheng Wu is an associate professor at the Chinese University of Hong Kong, Shenzhen. Prior to that, he led teams and performed research at Meta, JD.com, Apple, the University of Edinburgh, and Microsoft Research Asia. Zhizheng received his Ph.D. from Nanyang Technological University, Singapore in 2015. Zhizheng is the creator of Merlin, an open-source speech synthesis toolkit. He initiated and co-organized the first speaker verification spoofing and countermeasures challenge as a special session at Interspeech 2015, the Voice Conversion Challenge 2016, and the Blizzard Challenge 2019. He also gave a tutorial on spoofing detection at APSIPA ASC 2015 and a tutorial on deep learning-based speech synthesis at Interspeech 2017. Zhizheng is an associate editor of IEEE/ACM Transactions on Audio Speech and Language Processing and a member of the IEEE Speech and Language Processing Technical Committee. He is also the General Chair of IEEE Spoken Language Technology Workshop 2024.

Prof. Yu Tsao (Senior Member, IEEE) received the B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher with the National Institute of Information and Communications Technology, Tokyo, Japan, where he engaged in research and product development in automatic speech recognition for multilingual speech-to-speech translation. He is currently a Research Fellow (Professor) and the Deputy Director with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. He is also a Jointly Appointed Professor with the Department of Electrical Engineering, Chung Yuan Christian University, Taoyuan, Taiwan. His research interests include assistive oral communication technologies, audio coding, and bio-signal processing. He is currently an Associate Editor for the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING and IEEE SIGNAL PROCESSING LETTERS. He was the recipient of the Academia Sinica Career Development Award in 2017, national innovation awards in 2018– 2021, Future Tech Breakthrough Award 2019, Outstanding Elite Award, Chung Hwa Rotary Educational Foundation 2019–2020, and NSTC FutureTech Award 2022. He is the corresponding author of a paper that received the 2021 IEEE Signal Processing Society (SPS), Young Author, Best Paper Award.

Dr. Jianlong Fu is currently a senior research manager who is responsible for the research and innovation in multimodal computing group at Microsoft Research Asia (MSRA). He received his Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences.

His research focuses on multimedia content understanding, and multi-modal perceptual computing in images, videos, and embodied agents. He has published over 100 peer-reviewed technical papers and over 20 US patents. His Google Scholar h-index is currently 47.

Dr. Fu serves as the vice-chair for the Automotive CE Applications Technical Committee under the IEEE Consumer Technology Society, as well as an editorial board member for IEEE TMM, IEEE CTSoc-NCT, and guest editor for IEEE TPAMI from 2019-2021. He has also chaired several specialized committees at international multimedia flagship conferences such as ACM Multimedia 2021 and ACM ICMR 2021/2023. He has received multiple awards, including the ACM SIGMM Rising Star Award 2022, Best Paper Award at the 2018 ACM Multimedia Conference, and over 10 international competition championships in CVPR/ICCV/ECCV. Additionally, his research has been applied to various Microsoft products such as Windows, Office, Bing, Edge, and XiaoIce.

Prof. Zuchao Li is currently an Associate Researcher at Wuhan University. He obtained his PhD degree from the Department of Computer Science and Technology at Shanghai Jiao Tong University, under the guidance of Prof. Hai Zhao. Prof. Li has also spent time at the National Institute of Information and Communications Technology (NICT) in Japan as a Limited Technical Researcher, hosted by Dr. Masao Utiyama and Dr. Eiichiro Sumita. His research interests encompass language sequence modeling, linguistic structure parsing, and language representation learning from various types of data, such as unlabeled or noisy data and structured data like trees and graphs. Specifically, Prof. Li concentrates on theoretical and algorithmic approaches for syntax/semantic parsing, self-supervised/weakly supervised learning, structure learning, and related NLP tasks. Recently, he has delved into the realm of Multimodal Large Language Models (LLM).

Dr. Xianchao Wu received his Ph.D. degree in NLP field from The University of Tokyo in 2010. He is currently a senior solution architect and data scientist at NVIDIA. He worked at Baidu until 2015 and then Microsoft until 2020. He helped developed chatbots of Rinna and XiaoICE who have 32-million users in Japan and more than 80-million users in China. His research interests include large-scale pretrained language models, conversational AI, creative AI and financial NLP. He is an (co-)author and (co-)inventor of 50+ papers and 110+ patents in conversational AI and creative AI fields. During these 10+ years, he has served as a PC member/session chair of ACL, EMNLP, NeurIPS, ICML, ICLR, NAACL, COLING, AAAI, AACL, and UAI for more than 30 times.

Organizers and Program Committee

This workshop is partly supported by NICT international funding.

For questions, please contact organizers at: sheng.li-a-t-nict.go.jp or other organizers.

Speech

Dr. Eng Siong Chng Nanyang Technological University (NTU), Singapore, Associate Professor (ASESChng-a-t-ntu.edu.sg)

Dr. Zhizheng Wu, Chinese University of Hong Kong, Shenzhen, Associate Professor (wuzhizheng-a-t-cuhk.edu.cn)

Dr. Xugang Lu, NICT, Kyoto, Japan, Senior Researcher (xugang.lu-a-t-nict.go.jp)

Dr. Sheng Li, NICT, Kyoto, Japan, Researcher (sheng.li-a-t-nict.go.jp)

Dr. Xinhui Hu, RoyalFlush AI, China, Chief Scientist (huxinhui-a-t-myhexin.com)

NLP

Dr. Chenhui Chu, Kyoto Univ., Kyoto, Japan, Associate Professor (chu-a-t-i.kyoto-u.ac.jp)

Dr. Jiyi Li, Univ. Yamanashi, Japan, Assistant Professor (jyli-a-t-yamanashi.ac.jp)

Dr. Raj Dabre, NICT, Kyoto, Japan, Researcher (raj.dabre-a-t-nict.go.jp)

Dr. Qianying Liu, Rinna Inc., Tokyo, Japan, Researcher (ying-a-t-nlp.ist.i.kyoto-u.ac.jp)

Haiyue Song, Kyoto Univ., Kyoto, Japan, Star student (haiyue.song-a-t-nict.go.jp or song-a-t-nlp.ist.i.kyoto-u.ac.jp)

Multimodal

Dr. Xianchao Wu, NVIDIA, Tokyo, Japan, Senior Solution Architect (xianchaow-a-t-nvidia.com)

Dr. Bei Liu, MSRA, Beijing, China, Senior Researcher (bei.liu-a-t-microsoft.com)

Dr. Zuchao Li, Wuhan Univ., China, Associate Professor (zcli-charlie-a-t-whu.edu.cn)

Dr. Zhaojie Luo, Osaka Univ., Japan , Assistant Professor

Security

Dr. Yang Cao, Hokkaido Univ., Sapporo, Japan, Associate Professor (yang-a-t-ist.hokudai.ac.jp)

Dr. Zhao Ren, Univ. Bremen, Germany, Researcher (zren-a-t-uni-bremen.de)

Gallery, Presentation Videos, Data/Recipe/Model Releases

Will release after workshop.