Ilaria Manco is a Research Scientist in the Magenta team at Google DeepMind. Her research spans music generation and understanding, with a current focus on new forms of musical interaction via controllable, real-time generative models. Ilaria received her PhD from Queen Mary University of London, where she developed multimodal representation learning approaches to connect music and language. During her doctoral work she also collaborated with Universal Music Group on large-scale audio-caption datasets and audio-language models.
About her talk
Title: The Musical Highs and Lows of Audio-Language Models
Audio-Language Models (ALMs) promise a future where AI can hear and reason about music with human-like fluency. The past couple of years have brought remarkable progress towards this goal, but the depth of musical understanding in these models remains under scrutiny. This talk retraces the evolution of ALMs, analyzing the shift toward unified architectures and exploring the implications for model design and learning paradigms. Drawing from insights in the evaluation literature and benchmarks, we will reflect on the "musical highs and lows" of these models, examining persistent bottlenecks where systems excel at "reading" music through linguistic or symbolic priors but struggle to truly "hear" and ground their reasoning in the acoustic signal. Finally, we will discuss how we can tighten the link between language and music to achieve genuine musical understanding.
Enrico Palumbo researches and builds Generative AI features for Search and Recommendations at Spotify, with a focus on agentic technologies and generative recommendations. Before joining Spotify, he was a Research Scientist at Amazon, developing language understanding models for Alexa in non-English locales. He holds a PhD on Knowledge Graph Embeddings for Recommender Systems, carried out jointly between the Polytechnic University of Turin, EURECOM, and Links Foundation.
About his talk
Title: You Say Search, I Say Recs: A Language-based Approach to Music Recommendation
Search and Recommendations are generally implemented as separate systems in online content platforms - Search focuses on retrieving content based on natural language queries, and Recommender Systems provide suggestions based on the user's listening history.
In recent years, Large Language Models (LLMs) have started to challenge this traditional divide. Thanks to their impressive query understanding abilities and extensive world knowledge, LLMs enable content recommendation experience that leverages both highly sophisticated user prompts and their past preferences.
This talk goes through some recent research in the space of LLMs for music recommendations, leveraging techniques such as generative retrieval and agentic workflows to provide language-based music recommendations at Spotify.
Harin Lee is a Junior Research Fellow at King's College, University of Cambridge combining big data analysis with cross-cultural experiments. His research focuses on the psychological foundations of music cognition, the characteristics and evolutionary patterns of music globally, and how one's aesthetic taste is shaped by the environment. Harin's work spans field experiments with Tsimané villagers in the Bolivian Amazon to developing online paradigms investigating cultural evolution in artificial worlds. He earned his PhD from the Max Planck Institute for Human Cognitive and Brain Sciences, MSc in Music, Mind, and Brain from Goldsmiths, and completed a research internship at Deezer, Paris. He is one of the founders of 'aiar', an art-science collective that integrates real-time brain imaging into live audiovisual performances, including recent events at venues such as Berghain, Berlin.
About his talk
Title: Understanding music in the age of big data
Abstract: We live in an exciting time where the abundance of digital media and advanced computational tools, such as machine learning, enables the study of music and culture on an unprecedented scale. What people listen to across different regions, communities, and historical periods offers valuable insights into understanding how human culture evolves, underlying cognitive processes and societal dynamics. In this talk, I will showcase big-data projects that explore these themes using billions of music events and psychological experiments conducted cross culturally.
Willem Zuidema is associate professor of NLP and Explainable AI at the Institute for Logic, Language & Computation (ILLC) at the University of Amsterdam. He has published widely on computational models of language, including comparisons with music, animal communication and explorations of their evolutionary origins. He has done pioneering work in deep learning for NLP and interpretability methods for LSTMs and Transformers. He leads the InDeep consortium, focused on interpretability for text, speech and music, involving 7 PhD students and 5 universities in the Netherlands.
About his talk
Title: What does your GenAI model know about language, speech and music? Using the full interpretability toolbox to find out.
How much has a modern AI model, trained on massive datasets of music, really learned about the building blocks of music? The modern (mechanistic) interpretability toolbox offers a rich set of methods to find out. I will demonstrate the usefulness of this toolbox (and the need to use a variety of tools) by showing results from applying this toolbox to speech models. We can decode articulatory movements, phoneme repertoires, syllable categories, knowledge of grammar and more, and diagnose limitations, from the internals of models such as wav2vec2, Hubert, WavLM and ParlerTTS. I will close with some initial parallel results in the domain of music.