IJCNN competition - Speech-Music discrimination algorithms in the context of unseen audio data

Organizer of this IJCNN competition:

Aggelos Pikrakis, Eng., PhD.,

Lecturer, Dept. of Informatics, Univ. of Piraeus, Greece

pikrakis@unipi.gr, https://sites.google.com/site/aggelospikrakis/

80 Karaoli & Dimitriou Str., 18534, Piraeus, Greece

Tel: 2104142148, FAX: 2104142264

Category: C (no plan to organize an associated special session, please refer to the IJCNN website for more details on this).

Starting date: Feb. 1st, 2015.

End date: (including) March 15th, 2015, 23:59 pm (extended deadline)

The aims and objectives of the competition

The goal of the proposed competition is to evaluate the generalization capabilities of Speech-Music

discrimination algorithms in the context of unseen audio data without making any assumptions with

respect to the origin of the signals. The main idea is to first train each competing algorithm in a

small, publicly available dataset and then evaluate the trained algorithms with several hours of

audio recordings from various audio content distribution channels: commercial CDs, audio streams

from video-sharing sites (YouTube) and radio broadcasts over the Internet. Speech-Music

discrimination is here defined as the binary problem of classifying pre-segmented audio data to the

speech or music class. We impose the restriction that a segment is 30 s long and may either contain

speech or music data and that mixed (speech over music) segments are not allowed [1].

The rules of the competition

1. To enter this competition, a method can be previously published as a journal or conference paper

and must fall in the topics covered by the IJCNN conference or journal, as described in the

respective call for papers (http://www.ijcnn.org/call-for-papers). Unpublished work is also

welcome, provided that it is accompanied by a short description (1-page abstract).

2. Each submitted method must provide an implementation of a training algorithm and an

implementation of a testing function, in MATLAB, Python, or as a Windows/Linux executable.

Other implementation languages are also permitted but a prior arrangement with the proposer of the

competition is needed to ensure that the respective infrastructure is available.

3. The training algorithm will receive as input

(a) a text file, with one pathname per line. Each pathname points to a .WAVE file.

(b) a corresponding text file, with one label per line. Each label can be either 0 (for speech)

and 1 (for music)

The training algorithm is allowed to create temporary files during its operation and it is also allowed

to save its output to a file (to be used by the testing algorithm).

4. The testing algorithm will receive as input

(a) a text file, with one pathname per line. Each pathname points to a .WAVE file of

unknown label.

5. The testing algorithm will provide as output

(a) a corresponding text file, with one label per line. Each label can be either 0 (for speech)

and 1 (for music) and corresponds to the audio type of the respective file in 4.(a).

6. The training algorithm will only use the GTZAN Music/Speech Dataset [9]

(http://marsyasweb.appspot.com/ download/data_sets/)

7. Pretrained modules are not allowed.

8. All submitted methods will be evaluated by the proposer (who is not taking part in the

competition) at his home institution (and/or IJCNN organizers if desired) on a dataset consisting of

50 hours of recordings, whose detailed description will be made available to all contestants after the

competition has been completed. The following is a non-exclusive list of audio data sources to be

used: commercial CDs (both compressed and uncompressed), YouTube audio data and radio

broadcasts recorded over the Internet. Each audio file will be 30 s long. The dataset will be

augmented with spectrally distorted and/or time-stretched versions of randomly selected recordings

(mild distortions).

9. The proposer has the right to publish the evaluation results in a research report/paper. The results

of the competition will be made available to the submitters and the public in accordance with the

rules of IJCNN.

10. The starting date for this competition is Feb. 1st, 2015. The end date is (including) Feb., 28th, 2015, 23:59 pm.

11. Rules 1-11 can be made publicly available by the IJCNN organizing committee.

How the competition will serve the artificial intelligence community/ society

Although the Speech-Music classification problem is already almost two decades old [1] and

several algorithms have been proposed over the years, e.g., [1-8], there still remains to answer the

question whether the proposed solutions can address audio streams from diverse sources or audio

streams that suffer from (mild) spectral and time-streching distortions. Most published work focuses

on evaluating the respective methods on radio broadcasts over the Internet or, in general, on non-

reproducible audio streams.

Furthermore, to the best of our knowledge, there has not yet been a competition to address this

generalization issue, although this demand is of increasing importance these days due to the big-

data issues in the audio analysis context, and also due to the increasing need for efficient, audio-

based preprocessing and filtering mechanisms that facilitate content-based indexing and retrieval in

audio-sharing sites.

Overall, this competition will permit the machine learning community to gain a better

understanding of what Speech-Music classification algorithms learn and why they may fail to

produce the expected results on certain occasions.

How to enter the competition and how evaluation is performed

To enter the competition, the potential submitter has to

(a) Send an email to the proposer, Aggelos Pikrakis, pikrakis@unipi.gr, to specify the submission

details, e.g., submission code format and communication channel to submit the code (e.g., via

Dropbox or e-mail).

(b) Send the submission in adherence with the “Rules of the competition” section.

The submissions will be evaluated as it has been described in the “Rules of the competition”

section.

References

[1] E. Scheirer and M. Slaney, “Construction and evaluation of a robust multifeature

speech/music discriminator,” Proceedings IEEE ICASSP, 1997, vol. 2, pp. 1331–1334.

[2] C. Panagiotakis and G. Tziritas, “A speech/music discriminator based on rms and zero-

crossings,” Multimedia, IEEE Trans. on, vol. 7, no. 1, pp. 155–166, 2005.

[3] A. Pikrakis, T. Giannakopoulos, and S. Theodoridis, “A speech/music discriminator of radio

recordings based on dynamic programming and bayesian networks,” Multimedia, IEEE Trans. on,

vol. 10(5), 2008.

[4] Jan Schluter and Reinhard Sonnleitner, “Unsupervised feature learning for speech and music

detection in radio broadcasts,” in Proceedings of DAFx, 2012.

[5] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton, “Acoustic modeling using deep

belief networks,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1,

pp. 14–22, 2012.

[7] Wei-Ho Tsai; Cin-Hao Ma, "Speech and Singing Discrimination for Audio Data Indexing," Big

Data (BigData Congress), 2014 IEEE International Congress on , vol., no., pp.276,280, June 27

2014-July 2 2014

[8] Aggelos Pikrakis and Sergios Theodoridis, “Speech-Music Discrimination: a Deep Learning

Perspective”, Proceedings EUSIPCO 2014, Lisbon, Portugal.

[9] G. Tzanetakis and P. Cook, “Marsyas: A framework for audio analysis,” Organised sound,

vol. 4(3), 2000.