Datasets

Building Corpora and Datasets Discussion

14:30-15:30, Wednesday 15the October

8 people in attendance

Datasets are culture-specific and topic specific.

Dataset exchange is still not happening in any structured or cohesive way.

Geoffroy Peeters and Karën Fort paper:

Towards a (better) definition of the description of annotated MIR corpora

http://recherche.ircam.fr/equipes/analyse-synthese/peeters/ARTICLES/Peeters_2012_ISMIR_Annotation.pdf

This addresses: What does it mean to set up and publish a dataset for the research community? What are the minimum meta data you do for the dataset?

It’s hard to get complete legality for commercial recordings.

CompMusic has about 100 CDs - they can make streaming audio available via Dunya. Users of Dunya consent to being evaluators of the system in order to listen to the content.

Full access to the audio doesn’t happen this way.

Do all MIR researchers really need the audio? Why not simply use features? Isn’t that enough?

Pop/Western music very complicated from a legal perspective. Universities don’t want to be liable legally for potential consequences.

In CompMusic context, copyright clearance can be obtained from artists directly by signing a consent form.

Could we find an incentive for artists/labels to work with us?

Musicbrainz model - have a centralised meta-data information hub, that everyone can benefit from, if you’re not there you don’t exist, both commercial and non-commercial.

Distinction between corpus and dataset.

How can we control for datasets to reliably train and validate systems?

what are strategies? top-down and bottom up strategy?

Principled approach to sampling e.g. salami

There should always be a paper to accompany a dataset.

The dataset should be attached to problem.

There should be a methodology for dataset compilation. How can we validate that it’s fit for purpose?

A formal process, Evaluation of assumptions of datasets.

Can version control annotations.

NYU have a new JSON format for storing annotations.

Can cope with annotation for multiple tasks, and multiple annotations.

Set of python tools. Currently memory problems in porting this to Matlab.

Can include richer information about annotations, how made etc.

We don’t want to discourage researchers from making datasets.