This project pursues automatic extraction of vocal melodies from accompanied singing recordings. The extraction is based on a model of vocal pitch likelihood that integrates acoustic-phonetic knowledge and real-world data. The likelihood model evaluates a timbral fitness score, as well as the loudness, of each pitch candidate. The timbral fitness is measured for the partial amplitudes of the pitch candidate, with respect to a small set of vocal timbre examples. The pitch-specific measurement of timbral fitness depends on an acoustic-phonetic pitch modification of each timbre example. In the loudness part of the likelihood model, sinusoids are detected, tracked, and pruned to give loudness values that minimize the interference from the accompaniment. The final pitch estimate is determined by a prior model of pitch sequence in addition to the likelihood model. The extraction is completed by detecting voiced time positions according to the singing voice loudness variations given by the estimated pitch sequence.
For detailed description, see:
Y.-R. Chien, H.-M. Wang, and S.-K. Jeng, "An acoustic-phonetic model of F0 likelihood for vocal melody extraction," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 9, pp. 1457-1468, September 2015.