Over the past few years, we have seen an increase in the languages, and along with that, the cultures that are included in Pre-Trained Language and Vision Models. From works that aim to increase the representation of diverse cultures and languages to those that evaluate how well the current models represent the diverse realities of the world, a common denominator is the lack of representative data. In this panel, we will learn from the experiences of three institutions focused on collecting culturally and linguistically representative data. We will touch upon challenges, financing, licensing of data, and community engagement. By engaging with experts who work in diverse locations, we hope to bring light to the similarities and differences in collecting representative data in different parts of the world.
Panelists:
Ekaterina (Katya) Artemova, Toloka AI is a Research Scientist at Toloka AI and holds a PhD from the HSE University. Her research focuses on data-centric NLP, with a particular focus on benchmarking strategies, multilingual and low-resource settings, and LLM evaluation. Ekaterina has published in leading NLP and AI conferences and journals. She has also co-organized the 1st NLP Power! workshop at ACL '22, a tutorial on artificial text detection at INLG '22, a tutorial on hybrid data collection at COLING '25, and multiple shared tasks at the Dialogue and CLEF conferences.
Gloria Emezue, Lanfrica is a professor of English at the Federal University Ndufu Alike Ikwo in Ebonyi State, Nigeria, and a Post-Doctoral fellow of the American Council of Learned Societies' African Humanities Program. She is also a fellow of the International League of Conservation Writers. Her major research foci include Postcolonial studies, Gender studies, and Digital Humanities. At Lanfrica, Prof. Emezue spearheaded the NaijaVoices project collecting audio datasets for African languages among other responsibilities.
Neha Sengupta, Inception AI is a Director of R&D at Inception. Neha joined Inception in September 2018 after a PhD at the Indian Institute of Technology (IIT), Delhi, India. Prior to PhD, she spent 3 years working at IBM Research Labs in India. At Inception, she leads the bilingual LLM team that brings applications of LLM technology to the domain of non-English languages. She also works on agentic solutions utilizing LLMs in various downstream applications including English and non-English languages.
Vivek Sheshadri, Karya Inc. is the Cofounder of Karya and a Principal Researcher at Microsoft Research India. Vivek received his BTech in Computer Science from IIT Madras and his PhD in Computer Science from Carnegie Mellon University, where he worked on designing efficient memory systems. Inspired to use his skills to solve problems faced by underserved communities, Vivek moved back to India and started working in the broad area of technology for development. At Karya, along with his team, Vivek is focused on creating economic and skilling opportunities for people in underserved communities by connecting them to AI-enabled digital work. Together, he has led Karya into multiple global impact-focused accelerator programs and awards. Outside of work, he is a competitive squash player.
Sougata Saha, MBZUAI (Moderator) is a postdoctoral researcher at the NLP department of MBZUAI, Abu Dhabi, and advised by Dr. Monojit Choudhury. His current research focuses on the intersection of culture and LLMs, encompassing the cultural alignment of LLMs and gearing LLMs for diverse aspects of culture. He graduated with a PhD in Computer Science and Engineering from the University of Buffalo, New York, in Spring 2024.