0.0 Abstract
A computational model is described that seeks to extract from music a continuous surface – a monophonic line that consists of the most salient elements of a musical score present at any moment in time. This musical surface is suggested as the primary feature that occupies a listener’s attention when listening to music – especially when hearing a piece for the first time. However, most computational models that look at polyphonic music use all parts in their entirety. This means all notes are treated with the same weighting to the model, regardless of their actual perceptual prominence, leading to a lack of ecological validity. The implementation of such a proposed model would allow for modelling that functions closer to cognitive processes and may lead to ultimately less demanding computation.
The model is based on the principles of rhythmic grouping – where musical material that occurs simultaneously and with the same durations are heard as one – and of entropy – a measure of the most informationally-rich (or most perceptually-salient) musical material. A moving window is employed which identifies the surface throughout an entire musical work. The implementation is conducted using MathWorks MATLAB. The model is demonstrated on three stylistically distinct pieces.
The model and test files can be found at the end of this article.
1.0 Introduction
Most empirical modelling of polyphonic music takes approaches that examine pieces or segments of music in their entirety – all notes in all voices are treated equally, no matter their positioning and relative importance within the texture. However, this process encounters problems of ecological validity, especially for those models that seek to understand more about cognition; it most likely does not reflect how we actually listen to music, especially upon hearing a new piece (a condition on which many studies are based). Instead, it is suggested that a musical surface may exist within a work – the most salient features within the texture – that the listener uses as cues to build up the piece in its entirety (Cambouropoulos, 2001, 2006; Deliège, 2001; Deliège et al., 1996).
While this task is simple in a qualitative approach – the surface could be imagined as the line one might sing or hum to describe a wider polyphonic piece; to be useful for future empirical applications, the tracking of the salient cues can be modelled. This model attempts to make an identification of the most salient surface cues at any given moment in a piece. The basis for the model can be informed by ideas already present in music theory where these salient features are more discussed in terms of musical motifs. For the present scope of this model, it would be best to limit the input features to pitch and rhythm only out of the large range of parameters that are likely to influence such decisions – e.g. dynamics, instrument and timbre, as well as a whole range of performance-related cues.
2.0 Background
The musical surface – while often being considered in musicology, analysis and composition – has been the subject of limited empirical psychological research. The distinction between musical surface and deeper structural elements features largely in Lerdahl and Jackenoff’s A Generative Theory of Tonal Music and associated work (Lerdahl & Jackendoff, 1983, 1985). Here hierarchies are generated within music according to the perceptual prominence of various different features and that are used to build up the overall structure of the piece. In particular, larger structural elements are inferred from the musical surface by means of ‘grouping structure’, ‘metrical structure’, ‘time-span reduction’ and ‘prolongational reduction’ (Lerdahl & Jackendoff, 1983, p. 231) – techniques informed by music theory (especially from Schenkerian analysis). While this work acknowledges the importance of the musical surface and discusses the features of which it consists, their approach is confined to theory.
The psychological validity of the musical surface is yet to be specifically tested, however, other research has supported the importance of prominent features when listening to music and the role melodic contour plays (Krumhansl, 1991). Deliège et al. (1996) investigates how music is represented in listeners’ recent memory and how subjects use any knowledge of tonal structure in the organisation of material. Three experiments were conducted to identify the cues used by listeners, their ability to identify the temporal location of cues, and the use of tonal structure. In the first experiment, listeners (non-musicians) were asked to identify the elements of a piece (Schubert’s Valse sentimentale) that could be a cue (‘events they found most salient, those that might be considered as points of reference or “landmarks”). The results appear to show that the subjects mostly relied on musical elements at a surface level when forming cognitive representations. In a second experiment, subjects heard the piece three times before being asked to order segments of the piece. This tested how musical information was stored in memory and the ability to locate the segments temporally in making a ‘mental line’. The non-musician listeners had difficulty in localizing the cues of the piece. In the third experiment, the constructional abilities of both non-musicians and musicians were tested. Subjects were asked to order the segments that made up the original work to create a coherent piece. In this experiment ‘non-musicians exhibited little if any sensitivity to the tonal functions of segments in respect of a complete piece.’ Trained musicians, however, were more capable of using knowledge about tonal relations in respect of the completed piece.
Although the results from these three experiments showed differences between non-musicians and musicians, all subjects better identified the surface cues rather than any larger musical structure. These ideas on cue abstraction are important for the study of musical surface. The cues discussed in Deliège et al. (1996) could be seen as all contributing to an overall musical surface.
The ideas presented by Deliège et al. on cue abstraction are continued and developed in several papers by Emilios Cambouropoulos (Cambouropoulos and Widmer 2000; Cambouropoulos 2001, 2006). In the first of these, a computational model is developed that attempts to organise melodic surface cues into “significant” musical categories such as motifs. The computational algorithm constructs a representation of each segment of musical surface in terms of its melodic and rhythmic aspects. A clustering (Unscramble) algorithm is then applied to organise the segments into ‘meaningful’ categories. Cambouropoulos 2001 uses a similar algorithm based on formal definitions to test melodic clustering and membership prediction. The model was then applied to data taken from two psychological experiments (Deliège 1996, 1997) with results supporting the models of cue abstraction and categorisation.
3.0 Model Design
The computational model to extract the musical surface draws on several elements present in the literature. Firstly, as with Lerdahl and Jackendoff, an approach based in musical theory is taken. The model seeks to extract the surface from a musical score, rather than from a recording of a performance. Also in similarity with Lerdahl and Jackenoff is the decision to use the grouping of rhythmical and metrical elements as a key indicator of surface. Secondly, with the work on cue abstraction (in particular Deliège et al., 1996 and Krumhansl, 1991), the idea of a salient musical surface is shown to have some validity in our perception of music. Additionally, the kinds features that lead to salient musical “landmarks” are discussed and so a suitable implementation can be employed so as to best identify them.
3.1 Initial Assumptions
The model proposed here is by no means intended to provide a full and accurate prediction of a piece’s surface – the actual surface perceived by a listener is likely to be highly subjective, certainly varying between listeners and even possibly between different hearings by the same listener. What this model does intend to do, therefore, is provide a robust extraction of surface based on low-level, theoretically-grounded attributes. The model should, then, be able to provide a result that can be generalizable across many listeners for use in further psychological study and computational modelling.
Three basic assumptions are made in the model design: 1. That rhythms are only considered to be the same if the notes have exactly the same onset and duration; 2. That the highest pitch in any vertical rhythmic grouping is the most prominent; and 3. That salient material can be identified by it having the highest entropy at that moment. The grouping of rhythms this way would be a major issue in a performance-based model, however, in the score-based approach here this strict grouping is far less a problem – rhythms to be played the same are written the same (if there are no errors in the score) and, in fact, a composer can distinguish between separate parts by small differences in notation. Entropy measures the level of information complexity; in music, it’s often thought of as being the inverse of predictability. In this model, the entropy calculation is based on note distributions. While this does not absolutely take into account rhythmic complexity which may be wished for here, it provides some useful analogies to cognition that account for it on other ways – sections with many smaller notes will have a higher entropy if the notes have a large variation, if they are all the same (and so musically uninteresting) entropy will be low.
4.0 The Model
The model is based in three segments to extract the musical surface: 1. Vertical rhythmic grouping is used to identify all the notes in the piece that could be considered to act as single vertical chords. The highest pitch from each vertical group is then used to represent each group; 2. Layer extraction – the highest notes from the rhythmic groupings are separated in to monophonic layers; and 3. Entropy calculation where a moving window of entropy is calculated for each line. The highest relative entropy at each point is then used to identify the salient features that make up the musical surface.
The model is implemented in Mathworks MATLAB, taking MIDI representations of musical scores as its input. The model makes use of Toiviainen and Eerola’s MIDI Toolbox for some manipulations and analysis of the material (Toiviainen & Eerola, 2016).
4.1 Rhythmic Grouping
After importing a MIDI file and performing some basic cleaning – removing any channel information and rounding note lengths and onsets to a specified resolution (e.g. a quaver resolution) – the rhythmic groups are identified. A pairwise comparison is made for every two-note combination in the note matrix to identify those instances where a pair shares the same onset and duration. These pairs are added to a new sameRhythms matrix and repeated values are removed. The highest pitched note in each vertical grouping is then identified and a new note matrix created containing only these notes. Notes that had no rhythmic pair are then identified from the file and added back into matrix promLines. This creates a simplified representation of the score where all rhythmic information is preserved and all the inner pitches are removed. Examples of this vertical grouping are shown in Figure 1 in Schubert's Valse Sentimentale No. 6 (as used in Deliège et al., 1996).
4.2 Layer Extraction
Monophonic lines are extracted from this promLines note matrix of rhythmic groupings. Lines are extracted from the top downwards using the toolkit’s extreme(promLinesMidi,'high') function. Each line is extracted and stored in the line structure with a label from a to the number of lines needed. One a line is extracted, it is remove and the new line below is uncovered. It should be noted that this procedure is not intended as a part extraction exercise – it is not trying to recreate different voices or instruments in the music – it functions as a method to divide the material of the piece into monophonic layers that can be used in the calculations of entropy.
Empty spaces in these lines are filled out with material from an upper line (shown in Figure 2 – the upper line.a will be continuous as long as music is playing). This avoids a high entropy value being reported by the moving window method on encountering the start of material, leading to possible misidentification of salient material.
4.3 Calculation of Line Entropy
A moving window is used with predetermined parameters of size and step set by the user. The entropy function of the MIDI toolbox is employed to perform the calculation for each of the window frames for each of the lines. The toolbox’s function produces the result in terms of predictability, which then needs to be inverted to provide the entropy. In this model, it has been decided that the entropy calculation shall be based on the pitch distributions of the window (pcdist1) as it better accounts for the properties of note durations, however, the distribution of intervals in the window could be used as an alternative.
At each step (as determined by the window step size) the line with the highest entropy is identified. The material in that line at this point is added to the musicalSurface note matrix that builds up a complete picture of the surface throughout the piece. A final check is performed to ensure that a monophonic line is created and no overlapping notes have accidentally being added.
5.0 Examples of the Model
The functionality of the musical surface model can be seen in the example implementation of two works of differing style: J. S. Bach Brandenburg Concerto No. 3, Movement 1 (bars 1–46); and Billy Joel The Longest Time.
5.1 Brandenburg Concerto
The Brandenburg Concerto has a complex style of polyphony – the parts often interweave with different material at overlapping pitches. From the plot of entropy, it can be seen that there are two main lines functioning throughout the majority of the piece. It can also be seen that where most material is homogenous (at least at the quaver-note resolution), there is only a single line – such as the opening material, up to beat 30 after which the lines diverge (see figure 5). The contour of the extracted musical surface shows several features that are recognisable as surface landmarks from listening to the piece (figure 6). In particular, the effectiveness of the model can be seen in the handing-off of material downwards through the parts in beats 32–37.
5.2 The Longest Time
Another good illustration of the effectiveness of the model is seen in its application to Bill Joel’s The Longest Time. This too is a piece with many crossing parts, where salient material frequently appears in different voices within the texture. The coloured piano roll figure shows the number of voices made by the model to capture the content of the piece (figure 7). This piece creates a far more complex plot of entropy of the lines within the piece (figure 8). By looking at the features of the musical surface generated for the first four bars (figure 10) there are a number of features that indicate the model has correctly identified the surface – especially that it has identified the salient parts that are not at the top of the texture (red, blue, yellow and green selections show comparisons between figure 10 and the score in figure 11). The opening ‘Dum, dum, dum’ of the three-note anacrusis in the bass part is the selected; this is then followed by the upper line of the tune in the first full bar where all parts sing the same rhythm. The moving part in the bass line (at the bottom of the texture) in the second bar – ‘oh, long-est’ – is then correctly identified, as to is the high turn figure of the second vocal part in the third bar.
6.0 Limitations
The model does, however, have several flaws when faced with certain material. In the two complex examples, both suffer from the relatively high values used for resolution – in the Bach, the quaver resolution used here leads to many parts being treated as rhythmically the same when, in reality, they are not. The Longest Time too suffers from a similar problem; although they are still correctly identified, figures like the turn in the third bar are smoothed over. The model also has the potential to miss moving inner lines that have exactly the same rhythm as a simpler upper part.
There is also the possibility that the occasional misidentification based on single high-entropy instances. No smoothing is applied, the absolute element with highest entropy is taken, no matter what it is. This can lead to the model’s surface jumping suddenly between layers for a single note that happens to have a marginally higher entropy – no smoothing or approximation method is applied in the model.
7.0 Psychological Validity
The model may be considered to have a more music-theoretical approach (as with Lerdahl and Jackenoff; 1983) – this lets it lean on musical understanding develop over a long time. A useful and insightful future step would be to involve the model in some psychological study; if the model holds here, it could provide a useful analogy to aid areas of music perception research. The model does contain some deviation from cognitive process, as already seen: the model lacks certain approximation and smoothing mechanisms that we often employ in our listening to music. Overall, however, the model can function as a reasonable method to extract a robust musical surface that can be used for further study and in other computational models. The model holds up to a small qualitative examination.
References:
Cambouropoulos, E. (2001). Melodic Cue Abstraction, Similarity, and Category Formation: A Formal Model. Music Perception: An Interdisciplinary Journal, 18(3), 347–370. https://doi.org/10.1525/mp.2001.18.3.347
Cambouropoulos, E. (2006). Musical Parallelism and Melodic Segmentation. Music Perception: An Interdisciplinary Journal, 23(3), 249–268. https://doi.org/10.1525/mp.2006.23.3.249
Deliège, Irène. (2001). Similarity Perception ↔ Categorization ↔ Cue Abstraction. Music Perception: An Interdisciplinary Journal, 18(3), 233–243. https://doi.org/10.1525/mp.2001.18.3.233
Deliège, Iréne, Mélen, M., Stammers, D., & Cross, I. (1996). Musical Schemata in Real-Time Listening to a Piece of Music. Music Perception: An Interdisciplinary Journal, 14(2), 117–159. https://doi.org/10.2307/40285715
Krumhansl, C. L. (1991). Memory for musical surface. Memory & Cognition, 19(4), 401–411. https://doi.org/10.3758/BF03197145
Lerdahl, F., & Jackendoff, R. (1983). An Overview of Hierarchical Structure in Music. Music Perception: An Interdisciplinary Journal, 1(2), 229–252. https://doi.org/10.2307/40285257
Lerdahl, F., & Jackendoff, R. (1985). A Generative Theory of Tonal Music. Cambridge, Mass.: MIT Press.
Toiviainen, P., & Eerola, T. (2016). MIDI Toolbox 1.1. Retrieved from https://github.com/miditoolbox/1.1