Project Description


Music is built from sound, ultimately resulting from an elaborate interaction between the sound-generating properties of physical objects (i.e. music instruments) and the sound perception abilities of the human auditory system. Humans, even without any kind of formal music training, are typically able to extract, almost unconsciously, a great amount of relevant information from a musical signal. Features such as the beat of a musical piece, the main melody of a complex musical arrangement, the sound sources and events occurring in a complex musical mixture or the song structure are just some examples of the level of knowledge that a naive listener is commonly able to extract just from listening to a musical piece. In order to do so, the human auditory system uses a variety of cues for perceptual grouping such as similarity, proximity, harmonicity, common fate, among others [1].

Typical computational system for sound analysis and Music Information Retrieval (MIR) represent statistically the entire polyphonic or complex sound mixture (e.g. [2, 3]), without any attempt to first identify the different sound entities or events that may coexist in the signal. There is however some evidence that this approach has reached a 'glass ceiling' [4] in terms of analysis and retrieval performance.

The main problem this project addresses is the identification and segregation of sound events in 'real-world' polyphonic music signals (including monaural audio signals). The goal is to individually characterize the different sound events comprising the polyphonic mixture, and use this structured representation to improve the extraction of perceptually relevant information from complex audio and musical mixtures. 

The proposed project will follow a Computational Auditory Scene Analysis (CASA) approach for modeling perceptual grouping in music listening [5]. This approach is inspired by the current knowledge of how listeners perceive sound events in music signals, be it music notes, harmonic textures, melodic contours, instruments or other type of event [1], requiring a multidisciplinary approach to the problem [6, pp.14]. Although the demanding challenges faced by such CASA approaches make their performance still quite limited when compared to the human auditory system, some recent results already provide alternative and improved approaches to common sound analysis and MIR applications [T1].

Project Objectives

The common purpose of this project is to build upon the research results already obtained by the proposed team, placing it in a good position to articulate knowledge from the different disciplines in order to design, implement and validate innovative methodologies and technologies that are useful for sound and music analysis using computer systems, namely:
  1. an efficient, extensible and open source CASA software framework for modeling perceptual grouping in music listening, which results in a mid-level, structured and perceptually inspired representation of polyphonic music signals,
  2. software technologies for the visualization, sonification, interaction and evaluation of sound events automatically segregated from polyphonic music signals, 
  3. evaluation datasets for sound segregation in music signals.
In order to pursue these objectives 7 tasks have been planned and include research work on:
  • sound analysis front-ends, new grouping cues and sequential grouping methods that model the perceptual mechanisms involved in human hearing, 
  • new methods for the extraction of descriptors (e.g. pitch, timbre) directly from the mid-level representation of music signals, 
  • design, development and optimization of software modules and framework, 
  • contributions to new approaches for the evaluation of computational sound analysis and segregation systems.


  • [1] Bregman, A. (1990). Auditory Scene Analysis – The Perceptual Organization of Sound. MIT Press.
  • [2] Pachet, F. and Cazaly, D. (2000). A classification of musical genre. In Proc. RIAO Content- Based Multimedia Information Access Conference.
  • [3] Tzanetakis, G. and Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Acoustics, Speech and Signal Processing, 10:293–302.
  • [4] Aucouturier, J.-J. and Pachet, F. (2004). Improving timbre similarity: How high’s the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1).
  • [5] Wang, D. and Brown, G. J., editors (2006). Computational Auditory Scene Analysis: Principles, Algorithms and Applications. Wiley-IEEE Press.
  • [6] Scheirer, E. D. (2000). Music-Listening Systems. Phd thesis, Massachusetts Institute of Technology (MIT).
  • [T1] Lagrange, M., Martins, L. G., Murdoch, J., and Tzanetakis, G. (2008). Normalized cuts for predominant melodic source separation. IEEE Transactions on Audio, Speech, and Language Processing, 16(2). Special Issue on MIR.
  • [T2] Martins, L. G. (2009). A Computational Framework for Sound Segregation in Music Signals. Phd thesis, Faculdade de Engenharia da Universidade do Porto (FEUP).
  • [T3] Martins, L. G., Burred, J. J., Tzanetakis, G., and Lagrange, M. (2007). Polyphonic instrument recognition using spectral clustering. In Proc. International Conference on Music Information Retrieval (ISMIR), Vienna, Austria.
  • [T4] Proceedings of the International Symposium on Performance Science 2007, edited by Aaron Williamon and Daniela Coimbra, published by the European Association of Conservatoires (AEC), Utrecht, The Netherlands. ISBN 978-90- 9022484-8.
  • [T5] Gouyon F., Dixon S., Widmer G. (2007). Evaluating low-level features for beat classification and tracking. IEEE International Conference on Acoustics, Speech, and Signal Processing