Efficiency in African NLP @Deep Learning Indaba

The Natural Language Processing workshop at the Deep Learning Indaba 2023 takes place on,  Friday,  September 8, 2023. It is organized by Hatem, Haddad, David Adelani, Salomey Osei, Everlyn Asiko, Shester Gueuwou, Kwadwo Amedi, Atnafu Lambebo Tonja, Millicent Ochieng, and the Masakhane community.


Please, register your email to participate in the hands-on session for Named Entity annotation using this Google FormA short tutorial on the annotation is on YouTube, and a hands-on session is on the GitHub.

Join the entire programme using the Zoom link


Schedule

Friday, September 8, 2023  Location: --

Morning Sessions 

Afternoon Sessions

Invited Speakers

Co-Founder, GhanaNLP

PhD student  at the University of Washington

Research scientist at Google DeepMind

DeepMind Academic Fellow at University College London, UK

Monojit Choudhury

Principal Data and Applied Scientist, Microsoft India

Spotlight Speakers

 Ph.D. student at McGill University and Research Scientist at Lelapa AI 

Jessica Ojo 

Research and Machine Learning Engineer at Lelapa

Invited Talk Details

Paul Azunre

Paul Azunre holds a PhD in Computer Science from MIT and has served as a Principal Investigator on several DARPA research programs. He founded Algorine Inc., a Research Lab dedicated to advancing AI/ML and identifying scenarios where they can have a significant social impact. Paul also co-founded Ghana NLP, an open-source initiative focused on using NLP and Transfer Learning with Ghanaian and other low-resource languages. He also serves as Director of Research at Dun & Bradstreet, a company helping businesses manage supply chain risk and other business analytics challenges. He is the author of the recently published book "Transfer Learning for NLP" by Manning Publications.

African NLP - Beyond Global Big Tech

In this talk, I will describe how GhanaNLP and Algorine Research built Khaya AI, the world's first machine translation and speech recognition app and API for several Ghanaian, Nigerian, and Kenyan Languages. I will discuss how this system was deployed on a shoestring budget to hundreds of thousands of users across available platforms. Our discourse will challenge prevailing research paradigms that have garnered support from prominent Big Tech corporations, particularly those emphasizing "zero resource translation," predicated upon extensive scraping of monolingual data and the relentless expansion of model sizes. By juxtaposing recent evaluations of our system vis-à-vis those advanced by Big Tech entities, we will demonstrate that prioritizing the curation of superior-quality multilingual data - and thereby investing in the communities that actually own the language and culture yields simultaneously smaller and better models. Furthermore, this comparative analysis will underscore the detrimental impact of exaggerated assertions and hype surrounding suboptimal models, as exemplified by Meta's NLLB, which undermine the local communities doing this important work. I will explore a collaborative initiative involving the Distributed AI Research Institute and Lesan AI to build a unified system with better coverage of African Languages by uniting local organizations serving their own communities and building the best tools for their own languages. I will argue that this, as opposed to a monopoly by Global Big Tech - is in the cultural and national security interests of our communities.

Orevaoghene Ahia

Orevaoghene Ahia is a PhD student in Computer Science and Engineering at the University of Washington advised by Noah A. Smith and Yulia Tsvetkov. Previously, she was a Research Engineer at Instadeep working on AI solutions for enterprise clients. My research interests include topics in Multilingual NLP, Model Efficiency, and Model Interpretability.

Subword tokenization 

Subword tokenization has become the defacto method to segment text in the field of NLP. It is particularly useful when learning multilingual representations because it has been shown to facilitate cross-lingual transfer across languages that share properties like script, alphabet, etc. However,  subword tokenizers have also been reported to cause disproportionate fragmentation rates for languages with different scripts. The majority of commercial LMs are multilingual and users that speak languages with higher fragmentation rates are usually disadvantaged in terms of inference cost and model utility. In this talk I would give a brief history of tokenization in NLP, and then I would extensively discuss the current flaws in recent subword tokenization methods and their effect on model utility,  inference and financial costs in the current age of  Commercial LLMs.

Machel Reid

Machel Reid is a research scientist at Google DeepMind working on NLP research, with a focus on multilingual NLP. Recently, He has been working a lot with LLMs and instruction tuning, developing recipes to boost multilingual capabilities in LLMs.

Extending Large Language Models Beyond English

Large language models (LLMs) have achieved state-of-the-art results on a wide range of natural language processing tasks. However, most highly performant LLMs are trained on largely monolingual data in English, which limits their ability to generalize to other languages. This talk will discuss recent advances in extending LLMs beyond English. We will cover approaches in multilingual pre-training, zero-shot cross-lingual transfer in an in-context learning setting, and multilingual instruction tuning. We will also discuss the challenges of working with LLMs in low-resource languages, and the opportunities for future research.

David Ifeoluwa Adelani

David Ifeoluwa Adelani is DeepMind Academic Fellow at University College London, UK, collaborating with the UCL NLP Group. He was formerly a PhD Student at the Spoken Language Systems group and a member of the Saarbrücken Graduate School of Computer Science at the Saarland Informatics Campus. 

Hands-on session

Monojit Choudhury

Monojit Choudhury is a Principal Data and Applied Scientist at Microsoft India. They build large universal language models that forms the backbone of various Microsoft products. Prior to this, he was a Principal researcher at Microsoft Research Lab India, and I still strongly collaborate with my colleagues from MSR. His research interests cut across the areas of Linguistics, Cognition, Computation and Society. He has a B.Tech and PhD in Computer Science and Engineering from IIT Kharagpur and had been at Microsoft Research since 2007. 

Spotlight Talk

Bonaventure F. P. Dossou

Bonaventure F. P. Dossou is a Ph.D. student in the Probability Vision Group of the Center for Research on Intelligent Machines, at McGill University (supervised by Professor Tal Arbel) & a Research Scientist at Lelapa AI. He holds a Bachelor of Science in Mathematics (Russia) and a Master of Science in Computer Science and Data Engineering (Germany). Previously, he worked as a researcher at the Mila Quebec AI Institute (under Professor Yoshua Bengio), Google Research, Roche Canada, and Modelis, to name but a few. Bonaventure is interested in machine learning for healthcare (medical imaging, AI-powered drug discovery, and gene therapy), and natural language processing for low-resourced languages.

Bridging Linguistic Frontiers: Machine Learning & NLP Innovations Empowering African Languages: Challenges, Progress, and Promising Futures

Whether expressed in written, spoken, or signed, language is crucial for human communication and ensures understanding between people across regions. With the growing awareness and effort to include more low-resourced languages in NLP research, African languages have recently been a major subject of research in natural language processing. Accounting for more than 31% of living spoken languages, African languages are morphologically rich, culturally rich, diverse, and low-resourced. Covering a range of topics from (multilingual) machine translation, speech recognition, language modelling, named entity recognition, part-of-speech to datasets, and Lanfrica; in this presentation, I will share my journey into AfricaNLP research, challenges faced, progres made, and future insights.

Jessica Ojo 

Jessica Ojo is a research and Machine Learning Engineer at Lelapa. Researcher at Masakhane. Passionate about low-resource language technology.

How Good are Large Language Models on African Languages?

Recent advancements in Natural Language Processing (NLP) have led to the proliferation of large pre-trained language models. These models have been shown to yield good performance, using in-context learning, even on unseen tasks and languages. They have also been exposed as commercial APIs as a form of language-model-as-a-service, with great adoption. However, their performance in African languages is largely unknown. We present a preliminary analysis of commercial large language models on multiple tasks across several African languages, spanning different language families and geographical areas. Our results suggest that commercial language models produce a below-par performance in African languages. In general, our findings present a call to action to ensure African languages are well represented in commercial large language models, given their growing popularity.