88 x 127: Deconstructing a Sample Library
Music and Speech Modelling 2018
Abstract
It is difficult to get a life-like performance from sample based virtual instruments without expert tweaking of parameters. This project produced a tool which shows the user the amplitude and pitch variations of every sample, and suggests the best sample for a given note, so that they can more accurately reproduce or create a musical performance.
Introduction
Sample based virtual instruments (VI) work by having a corpus of thousands of recordings of a live instruments, which are triggered (and combined and looped) on receipt of a MIDI note-on message. The VI chooses the best sample to play for the given note number and velocity value, and then plays the sample (or combination of samples) until it receives a corresponding note-off message. Other MIDI controller messages can be used to control certain aspects of the VI output, such as vibrato, timbre and articulations.
In order to best reproduce (or create) a musical performance, the current best practice is for an expert to manually set the values (both settings within the instrument and with the note input) based on knowledge of the original instrument and of the method of simulation. Music notation software also has algorithms for setting these values based on the musical input.
In this project, I have made a software tool to automatically catalog and quantify the full range of the output of such a sample based VI, and also shown how the data can be used to aid in the reproduction of a live performance. In order to reduce the difficulty of signal analysis, I decided to use a VI of a single instrument, namely the fretless bass guitar. This was due to the percussive and predictable nature of the sound, the expressive possibilities of a continuous pitch spectrum, and my own familiarity with the instrument.
Previous Work
There has been much previous work in a similar research field, some of which has been consulted to inform this project and avoid duplication. Dannenberg and Derenyi’s 1998 paper [1] was a seminal work on deriving data from musical performances, particularly on applying neural networks to classify and predict note transitions in wind and brass performances. In Abeßer’s work from 2014 [2], he also uses machine learning techniques to analyse musical nuances, here to identify the style in which a musician is playing. Schuller et al in 2015 [3] used similar cues to try and identify physical attributes of an instrument, with the ultimate goal of making a physical modelling VI of the instrument. They also use timbre cues to identify which string a note is played on. Mitchell’s work published in 2015 [4] is similar to this project, in that they aim to automate synthesiser settings, but using simpler FM synthesisers. Finally, the AVA work of Yang, Rajab and Chew from 2016 [5] was very relevant for the state of the art in extracting portamento and vibrato information from the audio of a performance.
Method
The proposed approach to the challenge is shown in Figure 1. The two possible paths in the centre of the flow chart indicate the manual annotation and automatic annotation approaches that could be pursued for using the output data. By the point of submission, the tool was able to automatically pick the most appropriate velocity setting and annotate a performance for amplitude and pitch variations, but manual annotation of note start times was still required.
Figure 1: System flow chart
The initial work on the project was to program and code the algorithms for the analysis of the musical signals. This prototyping work was done in MATLAB. Once the basic template for splitting an audio file into separate notes (given the data of the start and end points, as well as the pitch), various algorithms were implemented and tested. I decided to concentrate on loudness and pitch for the analysis, as these are relatively easy to quantify, and are both effected by the velocity of the note.
For loudness, the RMS method was used, with a 1024 sample window and 512 sample hop size, which was shown to capture the characteristic curve of the decaying percussive note. For pitch, the YIN pitch identification algorithm [6] was used, as it gives very good resolution (even at the lower bass pitches), and it gives a confidence rating which can be used to determine whether to trust the output. The use of this was greatly aided by the fact that the intended pitch of each note is already known (so only a semitone above and below that pitch were searched) means that octave errors are eliminated and some pitches are identified in the transient periods at the start of note. Portamento and vibrato values were identified by looking at the initial pitch, the time to the intended pitch, and the amount and rate of variation. This was inspired by the methods of the AVA project [5], but did not use their advanced algorithms because of time constraints.
Twelve parameters were identified as together characterising the loudness and pitch of a note, namely: maximum amplitude value, maximum amplitude time, maximum RMS value, maximum RMS time, time to 50% / 25% / 10% / 5% of maximum RMS, initial pitch variation, portamento time, vibrato rate and vibrato extent.
The user interface for the application was then developed. It is split into five tabs, which take the user through the data collection, analysis and prediction stages. Figure 2 shows the first tab, where the MIDI file for testing is generated. The MIDI code was based on the work of Ken Schutte [7], with alterations made to the "matrix2midi.m" file to allow it to output controller values as well as the basic note on/off capabilities of the original. The user sets the range of notes and velocities to test, the duration of notes and the velocity increments. A file name is set and a MIDI file generated. This tab also allows the user to save and load data from previous settings, which proved to be crucial as the analysis can be very time-consuming.
Figure 2: 88x127 Tab 1
Figure 3 shows the outputted MIDI file opened in an digital audio workstation (Logic), with the instrument to be tested (the EXS24 sampler standard fretless bass guitar) opened on the track. This data is used to bounce an audio file to disk. This audio file is then selected and opened in the project application, and Figure 4 shows the second tab where the file can be selected and a button for starting the analysis. When the user presses the "Analyse" button, the audio file is first checked to see that it is of sufficient duration (and then assumed to include the data lengths corresponding to those set on the first tab) before the analysis is performed, with a progress bar showing the user which note is being currently processed. Analysis of the full range of the bass guitar, at velocity values from 67-127 (in increments of 5) took 28 minutes to complete. Values of less than 67 were ignored as the VI produces only percussive noises for these values.
Figure 3: 88x127 MIDI output opened in Logic
Figure 4: 88x127 Tab 2
The third tab, shown in Figure 5, shows the results of the analysis for a single note and velocity in three graphs. The first is the amplitude of the signal, the second the RMS curve and the third the fundamental frequency variation (as MIDI pitch). The RMS and pitch graphs also show the positions of the appropriate derived parameters. The user can use the buttons to change through the range of notes and velocities, and the graphs update to show the data of the currently selected combination.
Figure 5: 88x127 Tab 3
The basic methodology for matching an input audio note is to find the velocity value for the given pitch that best matches the RMS curve and pitch variations, then to find the amplitude attenuation needed so that the sample output volume matches the input.
The forth tab, shown in Figure 6, allows the user to input the parameters of a target note (possibly derived from analysis done by hand in a audio editing application). Pressing the button triggers a k-means nearest neighbour search of all of the parameters of velocities for the given note number, and the best match is displayed, along with a value representing the closeness to the input values and the amount that the volume will need to be changed to match the input.
Figure 6: 88x127 Tab 4
The final tab, shown in Figure 7, performs a similar search on an array of notes, which can be inputted by the user as a .mat file. The best fit for each note is found, and the values outputted as a MIDI file.
Figure 7: 88x127 Tab 5
Some research was done into how to match the volume of the found velocity value to that of the input note audio. There are two MIDI controllers that affect the relative volume of a given pitch / velocity pair: CC7 (MIDI volume) and CC11 (expression). By doing an exhaustive brute force search of value combinations, it was found that the output amplitude variation was not consistent throughout the range of notes or velocities, although it was a predictable curve given a fixed note and velocity. Hence another lookup table is required to match adjust the amplitude of the found velocity, but not all volume and expression values need to be included as interpolation could be used. This is the case for this sampler and this library: it may, however, not be the case for all sample libraries.
Results
To test the tool, I picked three extracts from the track “Bright Sized Life”, from the album of the same name, recorded by the Pat Metheny Trio in 1976. This was chosen for the virtuosic and expressive performances of fretless bass guitar player Jaco Pastorius, and because the panning of Metheny’s guitar and Pastorius’s bass relatively hard to the right and left respectively made it easy to isolate the bass in the performance (after some low pass filtering to reduce the cymbal sounds).
Three extracts were chosen, and the bass part transcribed by the author. The audio files were manually annotated for the start and end timings of each note using Sonic Visualiser, and the values outputted as a .csv file. A MATLAB script was written to use this file, along with the audio file, to get the parameters for each note with the same algorithms as in the main tool, and the same 12 parameters derived for each note. The resultant array was used with the fifth tab of the 88x127 tool to create MIDI files for each extract. Figure 8 shows the outputted file opened in Logic with the velocity settings visible, and Figure 9 shows the same data with the expression values (which control the relative output values of each note).
Figure 8: Logic piano roll - BSL extract 1 velocity output
Figure 9: Logic piano roll - BSL extract 1 expression output
Figures 10 & 11, 12 & 13, and 14 & 15 show the score extract analysed, and the audio signal for the original audio (top), the output from the 88x127 tool (middle), and a raw MIDI performance (bottom). Audio files of the corresponding three versions of each of the three extracts are below each pair of figures.
Figure 10: BSL Extract 1
Figure 11: BSL Extract 1 audio comparison
Figure 12: BSL Extract 2
Figure 13: BSL Extract 2 audio comparison
Figure 14: BSL Extract 3
Figure 15: BSL Extract 3 audio comparison
Finally, an audio file of the 88x127 tool output mixed in with the original file (with the original bass removed) is presented for comparison.
Conclusion
The articulation of the notes (as well as some of the pitch variation) is captured and reproduced, and there is a noticeable improvement in realistic performance over the raw MIDI performance. This shows that the technique is valid, and hopefully further refinement of the system will produce further improvements.
Future developments
Improvements to the tool would start by including the sweep of possible volume / expression values into the main tool. For percussive instruments, such as the fretless bass guitar, only a very short amount of audio (c.0.05s) is needed for find the maximum amplitude. Further analysis of the audio signal, such as features associated with timbre, could also be captured and the data points used to better improve either the sample matching or by providing controller settings for the synthesiser.
The only really noticeably artificial sounding elements of the output are those moment where the performer applies "hammer-ons" and "pull-offs", that is not re-articulating the 2nd note of a pair. Further work could identify these moments, and provide control to the VI to better model them.
Improvements could be made in the sample matching / suggesting process by testing the output and using machine learning to give a set of weights to apply to the derived features, rather than giving them all equal weighting.
This tool is a step towards the ultimate goal of improved automatic computer musical performance, based on machine learning analysis of expert performance.
Bibliography
[1] Dannenberg & Derenyi. 1998. Combining instrument and performance models for high-quality music synthesis.
[2] Abeßer. 2014. Automatic transcription of bass guitar tracks applied for music genre classification and sound synthesis.
[3] Schuller et al. 2015. Parameter extraction for bass guitar sound models including playing styles.
[4] Mitchell. 2012. Automated evolutionary synthesis matching: Advanced evolutionary algorithms for difficult sound matching problems.
[5] Yang, Rajab, Chew. 2016. AVA: An interactive system for visual and quantitive analysis of vibrato and portamento performance styles.
[6] De Cheveigne. 2002. YIN, a fundamental frequency estimator for speech and music.