Evaluation

First part of the discussion on rhyrhm similarity:

[...]

Hard to decouple rhythm similarity with genre and sub-genre. perceptives studies may help on that

Second part of the discussion on for beat and downbeat tracking:

It may be advantageous to define the purpose of your method to choose the appropriate evaluation measure. For example, do you want to use your beat tracker by itself or as an intermediate step for something else.

However, you do not always know the purpose, or you want to stay as generic as possible.

The best evaluation would probably be to assess the subjective response to your system by a listener comittee. It is hard to do that in practice.

1) The most practical ways is to use a fixed evaluation measure defined by euristics (F-measure for example) between your system and a single ground truth. It is what is done in most cases right now.

2) However, how can we choose the single ground truth? The perception of beat and downbeat is subjective and that there is often not one single answer. Annotations need to be coherent at the song level. So instead of annotating one, we could annotate several meaningful ones. It takes more time in practice to annotate songs like that but it is still practically feasible. Several annotators could be included in the process for robustness.

-> It could be useful for that purpose to share the annotation and allow edits more easily (with github?)

3) But how can we choose a proper evaluation measure? Ideally, it would be an evaluation measure that match listeners perception

It can be the evaluation measure that has the best correlation with listeners perception. Matthew and Sebastian worked on that recently "M. E. P. Davies and S. Böck. Evaluating the Evaluation Measures for Beat Tracking. To appear in Proc of ISMIR 2014, Taipei, Taiwan, October, 2014."

We would like to extend this approach, by

- selecting a commitee of beat and downbeat trackers

- selecting a corpus that is representatives of different styles, tempi, rhythm, expressiveness, meter, change in tempo/time signature, accompaniment, onset types, sound quality...

- carrying out experiments with an appropriate number of listeners (and also listeners familiar with the concept of downbeat for example) and an appropriate evaluation process (length of the excerpt, rating from 1 to 5?)

- finding relevant evaluation measures or annotation (for example, an appropriate window shape at the song or the section level, one that depend on the tempo, the styles...)

- finding the one that has the best correlation the listeners response