Multi-f0 MIDI-to-MIDI Similarity Evaluation for Piano Transcription Model Testing (2022)
In collaboration with Alex Chuk
Final project for Music Information Retrieval (Fall 2022)
In collaboration with Alex Chuk
Final project for Music Information Retrieval (Fall 2022)
This project is aimed at building a pipeline for evaluating a piano transcription model proposed in the following paper:
Hawthorne, C., Elsen, E., Song, J., Roberts, A., Simon, I., Raffel, C., ... & Eck, D. (2017). Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153.
with a multi-f0, MIDI-to-MIDI comparison metric function, mir_eval.multipitch.
The test dataset is composed of 42 audio clips are recorded with 3 different microphones and performed by 2 players on a Yamaha Disklavier, which outputs ground truth (reference) MIDI files that is perfectly aligned with the actual performance. MIDI estimation is generated from test audio samples with Piano-Scribe, a web application implementation of the paper.
By calculating the multi-f0 accuracy scores for each reference-estimation pair, we analyze the possible causalities of MIDI mismatch and provide insight into improving the robustness of future transcription models and evaluation metrics.
KEYWORDS:
Piano Transcription; MIDI; Multi-f0; Music Information Retrieval (MIR)
The testing dataset consists of 42 audio samples of either pristine or processed recording, as well as ground truth MIDI data perfectly in sync with the recordings.
The following 3 excerpts, chosen from different time periods and artistic styles in the classical piano solo repertoire, are performed by 2 players:
J. S. Bach: Prelude in C (mm. 1 - 11), labeled Bach
Debussy: Arabesque No.2 (mm. 5 - 14), labeled Debussy1
Debussy: Arabesque No.2 (mm. 28 - 36), labeled Debussy2
One excerpt from an original piano composition is performed by 1 player to test if the model performed significantly better on classical excerpts, which might be included in the training set for the transcription model:
Mao: Gingko Leaves (mm. 1 - 17), labeled Leaves
All of the 7 performances are recorded with 3 studio microphones:
Spot: Sennheiser MKH 8040
Near-field: Shure SM58, Neumann U87
All of the 21 studio recordings are further augmented with pink noise via internal recording in Pro Tools, summing up to a testing set (21 original, 21 noised tracks).
To compare a MIDI estimation generated by Piano-Scribe against its reference MIDI (ground truth) with mir_eval.multipitch, both MIDI files are parsed by the parse_midi() function into a representation form of a timebase and a corresponding list of arrays of frequency estimates.
A shared timebase is generated with the user-defined parameter fps, slicing the timeline into equal frames of length 1/fps seconds; the starting time of the frame represents the timebase of that frame. parse_midi() then runs through all the MIDI notes in a reference/estimation MIDI file and stores the frequency values of present notes in a given time frame.
All metrics are computed with mir_eval.multipitch.metrics() at the macro level and summed across all time frames. Precision, recall, and accuracy scores are compared for analysis.
Compare audio quality: Recording inputs with added pink noise results in higher accuracy scores
Compare pieces: Bach and Debussy2, excerpts with less dynamics in tempo and less musical ornaments have more accurate estimations than Debussy1 and Leaves
Compare microphones: Spot microphone (Sennheiser MKH 8040) recordings generally have more accurate estimations than near-field microphones (Shure SM58, Neumann U87)
Compare players: Jiawen wins in overall accuracy, Alex is more consistent in quality
One major factor on estimation accuracy is the amount of reverb in the recording. The MIDI estimation from Piano-Scribe interprets pedaling a note as elongation instead of estimating pedal (CC 64) message, resulting in decreased similarity. Adding in the pink noise effectively raises the noise floor and masks some of the pedaled reverb, resulting in shorter note estimations. Closer microphone position also result in less reverb in the recording.
Flexible parsing frame rate is recommended for a more detailed analysis: coarse-grain parsing (smaller fps) for metrics at the macro level and fine-grain (larger fps) time frames to look at specific note clusters, such as ornaments (e.g., trills).