SOUNDS OUT OF PLÄCE? SCORE-INDEPENDENT DETECTION OF CONSPICUOUS MISTAKES IN PIANO PERFORMANCES

Additional Content

The purpose of this companion page is to:
1) Show a few labelled examples of PF, SR and Synthetic data.
2) Show examples of our model predictions.
3) Provide the BM data under CC BY-NC-SA

PF and SR Data Examples

To illustrate what is meant by a conspicuous error, we share a few audio snippets of labelled mistake regions from our gathered data.

PF Example 1

section without error label

first-030-036.wav

section with error label

first-151-200..wav

PF Example 2

section without error label

second-215-222.wav

section with error label

second-225-233.wav

PF Example 3

section without error label

third-055-103.wav

section with error label

third-103-118.wav

SR Example 1

section without error label

sr-1-002-009.wav

section with error label

sr-1-10-18.wav

SR Example 2

section without error label

sr-2.wav

section with error label

sr-2-29-36.wav

Synthetic Mistake Examples:

Model Predictions

We show a few piano-roll examples of the predicted labels for our 5 selected models (Baseline, SYNTH, SYNTH-FT, AE, AE-SYNTH). The topmost row is the ground truth label. On the right, we provide a description of the error mode encountered.

BM subset (Evaluation set)

b-07-annot.mid

b-07-fragment.wav

False positive for inconspicuous Pitch Insertion

before 1450,  An inconspicuous pitch insertion is detected by the SYNTH, SYNTH-FT, and AE-SYNTH models. It is sensible because this error mode is heavily present in synthetic mistakes. This kind of pitch insertion was detected by the score follower system.

b-07-annot-pitchinsertion-before1450.wav

Correct Prediction of Missed Notes

Missing note in a locally consistent rhythmic pattern is estimated as an error. This occurs twice between 1480 - 1570.  However, only the second occurrence is detected by most models.



b-10-annot.mid

b-10-annot-from-15s.wav

Strange rhythm is estimated as an error (frame 250-280)

In this example, there are 2 labelled portions very shortly spaced. This is a sensible annotation because they seem as two separate mistakes. However, it is worth highlighting that this is not always the case, and sometimes it is not clear when an error starts and ends. 

All models (except AE) predicted two separately labelled regions.

b-02-annot.mid

b-02-annot.wav

Hitting adjacent keys is estimated as an error (frame 150)

b-02-annot-pitchinsertion.wav

However, it is not clear why the baseline model consistently predicting short mistakes between 50 and 100. 

b-05-annot-15-29.wav

Abrupt silence, (potentially a hesitation) during a run is estimated as an error (frame 400)

b-05-annot-15s.wav

More Failure Modes (False Positives)

Around frame 150 - as a flipside of detecting erroneous pauses in music performance, the system sometimes mistakes notated musical pauses as errors.  

Around frame 1700 - the right hand contains three repetitions of a motif (ascending thirds).   Some trained models consider this as an error, presumably since a common error mode in beginning pianist is to pause and repeat parts where a mistake has been made.   Humans presumably disambiguate repetition as a result of composition versus errorneous performance by looking at a longer musical context and metric structure. 

Around error 120-200 - a climactic held chord is detected as an error, presumably because a novice pianist has a tendency to hold the pressed keys when sightreading. 

Around frame 2300 - This piece contains graceful ornaments (appoggiatura), which is often mistaken as errors.  This happens presumably because hitting adjacent keys is a common error mode, so it is difficult to disambiguate between intentional hitting of the adjacent keys (ornaments) versus mistakes. 

Observed Patterns in Model Predictions


Open Questions

Can there ever be a consensus on what is the 'correct' span of labels? When does a mistake start or end? Especially as sometimes the mistake is an absence of a note.


BM Data