IJCNN competition - Speech-Music discrimination algorithms in the context of unseen audio data
Organizer of this IJCNN competition:
Aggelos Pikrakis, Eng., PhD.,
Lecturer, Dept. of Informatics, Univ. of Piraeus, Greece
pikrakis@unipi.gr, https://sites.google.com/site/aggelospikrakis/
80 Karaoli & Dimitriou Str., 18534, Piraeus, Greece
Tel: 2104142148, FAX: 2104142264
Category: C (no plan to organize an associated special session, please refer to the IJCNN website for more details on this).
Starting date: Feb. 1st, 2015.
End date: (including) March 15th, 2015, 23:59 pm (extended deadline)
The aims and objectives of the competition
The goal of the proposed competition is to evaluate the generalization capabilities of Speech-Music
discrimination algorithms in the context of unseen audio data without making any assumptions with
respect to the origin of the signals. The main idea is to first train each competing algorithm in a
small, publicly available dataset and then evaluate the trained algorithms with several hours of
audio recordings from various audio content distribution channels: commercial CDs, audio streams
from video-sharing sites (YouTube) and radio broadcasts over the Internet. Speech-Music
discrimination is here defined as the binary problem of classifying pre-segmented audio data to the
speech or music class. We impose the restriction that a segment is 30 s long and may either contain
speech or music data and that mixed (speech over music) segments are not allowed [1].
The rules of the competition
1. To enter this competition, a method can be previously published as a journal or conference paper
and must fall in the topics covered by the IJCNN conference or journal, as described in the
respective call for papers (http://www.ijcnn.org/call-for-papers). Unpublished work is also
welcome, provided that it is accompanied by a short description (1-page abstract).
2. Each submitted method must provide an implementation of a training algorithm and an
implementation of a testing function, in MATLAB, Python, or as a Windows/Linux executable.
Other implementation languages are also permitted but a prior arrangement with the proposer of the
competition is needed to ensure that the respective infrastructure is available.
3. The training algorithm will receive as input
(a) a text file, with one pathname per line. Each pathname points to a .WAVE file.
(b) a corresponding text file, with one label per line. Each label can be either 0 (for speech)
and 1 (for music)
The training algorithm is allowed to create temporary files during its operation and it is also allowed
to save its output to a file (to be used by the testing algorithm).
4. The testing algorithm will receive as input
(a) a text file, with one pathname per line. Each pathname points to a .WAVE file of
unknown label.
5. The testing algorithm will provide as output
(a) a corresponding text file, with one label per line. Each label can be either 0 (for speech)
and 1 (for music) and corresponds to the audio type of the respective file in 4.(a).
6. The training algorithm will only use the GTZAN Music/Speech Dataset [9]
(http://marsyasweb.appspot.com/ download/data_sets/)
7. Pretrained modules are not allowed.
8. All submitted methods will be evaluated by the proposer (who is not taking part in the
competition) at his home institution (and/or IJCNN organizers if desired) on a dataset consisting of
50 hours of recordings, whose detailed description will be made available to all contestants after the
competition has been completed. The following is a non-exclusive list of audio data sources to be
used: commercial CDs (both compressed and uncompressed), YouTube audio data and radio
broadcasts recorded over the Internet. Each audio file will be 30 s long. The dataset will be
augmented with spectrally distorted and/or time-stretched versions of randomly selected recordings
(mild distortions).
9. The proposer has the right to publish the evaluation results in a research report/paper. The results
of the competition will be made available to the submitters and the public in accordance with the
rules of IJCNN.
10. The starting date for this competition is Feb. 1st, 2015. The end date is (including) Feb., 28th, 2015, 23:59 pm.
11. Rules 1-11 can be made publicly available by the IJCNN organizing committee.
How the competition will serve the artificial intelligence community/ society
Although the Speech-Music classification problem is already almost two decades old [1] and
several algorithms have been proposed over the years, e.g., [1-8], there still remains to answer the
question whether the proposed solutions can address audio streams from diverse sources or audio
streams that suffer from (mild) spectral and time-streching distortions. Most published work focuses
on evaluating the respective methods on radio broadcasts over the Internet or, in general, on non-
reproducible audio streams.
Furthermore, to the best of our knowledge, there has not yet been a competition to address this
generalization issue, although this demand is of increasing importance these days due to the big-
data issues in the audio analysis context, and also due to the increasing need for efficient, audio-
based preprocessing and filtering mechanisms that facilitate content-based indexing and retrieval in
audio-sharing sites.
Overall, this competition will permit the machine learning community to gain a better
understanding of what Speech-Music classification algorithms learn and why they may fail to
produce the expected results on certain occasions.
How to enter the competition and how evaluation is performed
To enter the competition, the potential submitter has to
(a) Send an email to the proposer, Aggelos Pikrakis, pikrakis@unipi.gr, to specify the submission
details, e.g., submission code format and communication channel to submit the code (e.g., via
Dropbox or e-mail).
(b) Send the submission in adherence with the “Rules of the competition” section.
The submissions will be evaluated as it has been described in the “Rules of the competition”
section.
References
[1] E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature
speech/music discriminator,” Proceedings IEEE ICASSP, 1997, vol. 2, pp. 1331–1334.
[2] C. Panagiotakis and G. Tziritas, “A speech/music discriminator based on rms and zero-
crossings,” Multimedia, IEEE Trans. on, vol. 7, no. 1, pp. 155–166, 2005.
[3] A. Pikrakis, T. Giannakopoulos, and S. Theodoridis, “A speech/music discriminator of radio
recordings based on dynamic programming and bayesian networks,” Multimedia, IEEE Trans. on,
vol. 10(5), 2008.
[4] Jan Schluter and Reinhard Sonnleitner, “Unsupervised feature learning for speech and music
detection in radio broadcasts,” in Proceedings of DAFx, 2012.
[5] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton, “Acoustic modeling using deep
belief networks,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1,
pp. 14–22, 2012.
[7] Wei-Ho Tsai; Cin-Hao Ma, "Speech and Singing Discrimination for Audio Data Indexing," Big
Data (BigData Congress), 2014 IEEE International Congress on , vol., no., pp.276,280, June 27
2014-July 2 2014
[8] Aggelos Pikrakis and Sergios Theodoridis, “Speech-Music Discrimination: a Deep Learning
Perspective”, Proceedings EUSIPCO 2014, Lisbon, Portugal.
[9] G. Tzanetakis and P. Cook, “Marsyas: A framework for audio analysis,” Organised sound,
vol. 4(3), 2000.