Project Summary

Automatic Music Analysis

Automatic music analysis is the automated extraction of relevant perceptual information (notes, instruments, etc.) from music files (like mp3s). First attempted in the 1970s at Stanford University [Moorer], it remains an unsolved problem.
The problem is highly multifaceted and interdisciplinary, requiring the extraction of musical notes, instruments, percussion, emotion, etc., and drawing from fields as varied as computer science, mathematics, biology, physics, psychology, and electrical engineering. The problem's difficulty lies in a necessity to reverse-engineer the human brain.


The problem is motivated by several important applications. First, musicologists must transcribe thousands of hours of music by hand; an automatic score transcriber would make this process tremendously more efficient. Second, useful perceptual features can play a key role in efficient music database organization and retrieval, a growing field of interest due to the recent explosion of digital musical data. Third, from a psychoacoustics standpoint, discovering techniques for perceptual analysis may provide insights into the inner workings of the human auditory system. Fourth, automatic music analysis has roles in musical education, where music analysis programs can provide live feedback to music students. Finally, the ability to represent a music file by its salient features makes for efficient manipulation and coding.

Top-Down vs. Bottom-Up Systems

Music analysis is clearly a multifaceted problem, involving note transcription, instrument classification, emotion recognition, etc. Most current systems analyze these facets independently in a top-down fashion: they do not make use of the underlying components. They instead take ad hoc, heuristic-driven approaches that function poorly. While it is widely agreed that bottom-up systems are more theoretically sound than top-down systems, bottom-up systems come with the inherent issue that errors in low-level analysis propagate upward through the system. Thus, my research has been in developing sound foundational algorithms to form a solid framework for a bottom-up music analysis system.

My Novel Algorithms

The first foundational algorithm is designed to extract descriptive physical features from a music signal, namely sinusoids and onsets. Difficulties with current methods range from the Heisenberg uncertainty principle to diversity among onset types. The novel approach I developed, based on linear programming rather than Fourier analysis, achieves THREE TIMES the fequency resolution of the Fourier transform over the same analysis frame length. It also achieves 98.1% onset detection accuracy on a benchmark set, dramatically outclassing the best known result on this set of 85.5% published just last year. The second foundational algorithm is a novel theoretical machine learning mechanism designed to analyze sinusoidal models extracted via the first foundational algorithm. Current machine learning mechanisms lack the theoretical computational power to process sinusoidal models due to either requiring a fixed number of input features or prior probabilistic assumptions that cannot realistically be made. I generalized sinusoidal models to a higher-level class of inputs: sparse signal representations on complete vector spaces (SSRoCVS). The novel algorithm I developed, called the continuous-weight neural network, has the theoretical computational power to process SSRoCVS where previous machine learning mechanisms do not.

Wide Applicability

The problems I solved were inspired by music analysis, but I made no music-specific assumptions in forming my solutions. My work thus has wide applicability outside of music analysis, to nearly any field reliant on Fourier or wavelet analysis, including but not limited to:
  • Speech recognition [McAulay]
  • Image processing [Watson]
  • Electrocardiogram analysis [Haque]
  • Cancer detection [Sahu]
  • Seismology [Xu]
Example 1: Sinusoidal modeling is often used as a denoising technique for electrocardiograms. By sharply improving current sinusoidal modeling techniques, I could potentially recover more accurate estimates of denoised ECG signals. Additionally, since the shape of an ECG signal is important for diagnosis of heart conditions, more precise time-varying parameter estimation can lead to improved shape detection and thus more consistent diagnoses. My onset detection algorithm can be applied to ECG analysis as well as a method for tracking arrhythmia, abrupt short-lived changes in the signal indicative of harmful cardiac issues. Finally, my novel learning mechanism can be used to process extracted sinusoidal models for information related to heart diseases.

Example 2: Users of current hearing aid technology often have difficulty separating sound sources [Ellis]. My onset detection technique can segment speech signals into phonemes, while my sinusoidal modeling algorithm can precisely determine the time-varying frequencies present in the speech signal. The harmonic structure of speech can then be exploited alongside these methods to separate multiple-voice settings into individual speakers, improving hearing aid technology.


Novel Foundational Algorithms for Automatic Music Analysis with Wide Applicability in Signal Processing