0. Abstract
This project investigates the melody discrimination test, a common experimental paradigm in music psychology research. Previous research (Harrison & Müllensiefen, 2016) has successfully modelled task performance in pre-existing melodic discrimination tests using formal measures of melodic similarity, melodic complexity, and musical training. However, this model has yet to be validated in a controlled experimental design. In this project, I construct an algorithm that automatically generates melodic discrimination items, and use it to create a set of 80 items that systematically explores the parameter space of melodic similarity and complexity. Half of these items are constructed from real Irish folk melodies, and half are constructed from pastiche folk melodies generated by an automatic composition system (Racchman-Oct2010; Collins, Laney, Willis, & Garthwaite, 2016). These items are then evaluated in a perceptual experiment with 20 participants. Results indicate significant effects of melodic complexity but not of melodic similarity. In addition, the results indicate that automatically generated melodies were not perceived as less stylistically effective than authentic folk melodies.
1. Introduction
The melodic discrimination test is a common experimental paradigm in music psychology research. For many years melody discrimination tests have been used to assess individuals' musical aptitude and expertise (e.g. Bentley, 1966; Gaston, 1957; Gordon, 1965, 1982; Müllensiefen, Gingras, Musil, & Stewart, 2014; Seashore, 1919; Wallentin, Nielsen, Friis-Olivarius, Vuust, & Vuust, 2010; Wing, 1961). These tests have also been used in many experimental psychology studies to investigate fundamental properties of melodic processing (e.g. Cuddy, Cohen, & Mewhort, 1981; Cuddy, Cohen, & Miller, 1979; Cuddy & Lyons, 1981; Dowling & Bartlett, 1981; Dowling & Fujitani, 1971; Dowling, 1978; Mikumo, 1992; Schulze, Dowling, & Tillmann, 2012).
In each trial of a melodic discrimination test, participants are played several similar versions of an unfamiliar melody. Their task is to detect differences between these melodies. The precise nature of this task can vary in several ways, but in general these variants seem to engage very similar cognitive processes (Harrison & Müllensiefen, 2016).
This project investigates a three-alternative forced-choice (3-AFC) oddity version of the melodic discrimination test. Participants are presented with three melody versions in each trial, exactly two of which are the same (ignoring any transposition between melodies), and the participant's task is to determine which of these versions is the odd one out. You can take an example of such a test here by selecting the "melodic memory" option (it takes about 6 minutes). Here's an example of a 3-AFC oddity trial where the third melody is the odd one out:
In this example, each melody is transposed one semitone higher than the previous one. Transpositions are often used in melodic discrimination tests, meaning that test-takers have to compare melodies in terms of relative pitch intervals rather than absolute pitch content. In the case of the example above, the three melodies can each be represented as a series of six intervals. Writing these intervals in units of semitones, and identifying ascending intervals with positive values and descending intervals with negative values, we can represent the first two melodies as follows:
{4, 3, -3, 3, 2, -2}.
The third melody, meanwhile, has a different intervallic representation:
{4, 3, 1, -2, 2, -2}.
The two modified intervals (bold) correspond to the two intervals either side of the note highlighted in red in the musical notation above. The listener's task here, as in all melodic discrimination tasks, is to detect this change in interval pattern. However, not all changes in interval pattern are equally easy to detect.
In order to understand what makes a particular item easier than another item, it is important to understand the various cognitive processes involved in carrying out the melodic discrimination task. Five main processes can be isolated: perceptual encoding, memory encoding, memory retention, similarity comparison, and decision-making (Harrison and Müllensiefen, 2016). These processes are briefly described below.
Perceptual encoding. Here the listener develops cognitive representations for a melody on the basis of its corresponding audio signals. These representations include features such as pitch interval sequence, pitch contour, tonal structure, metrical structure, and so on.
Memory encoding. The cognitive representations for a melody are translated to a working memory representation, so that they can persist in memory after the melody finishes. Working memory has limited capacity, and if these representations are too complex then they may not be retained with perfect precision. The last melody in the trial does not usually need to be encoded in memory, as will be explained later.
Memory retention. The working memory representation of the melody is retained so that it can be compared to future melodies. Retention success depends on the length of the time interval between melodies as well as any distractions within this interval.
Similarity comparison. Here the memory representation of one melody is compared with a new melody as it is being heard, and the listener makes a judgement of the similarity between the two. Similarity judgements are made along a number of dimensions; these dimensions are determined primarily by what features are available in the memory representation of the first melody.
Decision-making. Finally, the findings of the similarity judgements are analysed in order to decide what response to give. The particular strategy will depend both on the melodic discrimination task being used as well as the participant themselves. In the case of the 3-AFC oddity task, there are three possible pairwise similarity comparisons that can be made: 1 vs. 2, 1 vs. 3, and 2 vs. 3. One possible strategy is as follows: Look at the similarity for each of these pairs, and determine the pair with the highest similarity. The odd one out will then be the melody that was not part of that pair. For example, suppose the first melody is the odd one out. Then 1 and 2 are different, as are 1 and 3, but 2 and 3 are the same, therefore the pair 2 vs. 3 ought to have highest similarity. The only melody not in the pair 2 vs. 3 is the first melody, so we (correctly) deduce that the first melody is the odd one out. Other strategies for this task are also possible.
Each of these steps is important for performance in the melody discrimination test. Features that impair performance in any one of these steps should also impair overall performance. For example, perceptual encoding can be impaired if the prior musical context is a harmonically distant key. Memory encoding can be impaired if the melody exceeds the capacity of working memory, such
Each of these steps is important for performance in the melody discrimination test. Features that impair performance in any one of these steps should also impair overall performance. For example, perceptual encoding can be impaired if the prior musical context is a harmonically distant key. Memory encoding can be impaired if the melody exceeds the capacity of working memory, which is more likely to happen if the melody is long, complex, or does not conform to Western musical schemata such as tonal organization. Memory retention can be impaired if there is a long time interval between melodies, or if this time interval contains auditory information that displaces the original information from memory. Similarity comparison is impaired if the melodies being compared are structurally similar on the dimensions available in the memory representation of the first melody. Lastly, decision-making can be impaired by the listener employing a sub-optimal strategy.
A number of previous studies have manipulated item features in melodic discrimination tests to investigate their roles in determining task difficulty. These manipulations have usually been categorical, for example comparing nine-note melodies with six-note melodies (Akiva-Kabiri, Vecchi, Granot, Basso, & Schön, 2009), tonal melodies with atonal melodies (Cuddy, Cohen, & Mewhort, 1981), and contour-preserving with contour-violating melody pairs (Dowling & Fujitani, 1971). However, treating these variables at categorical is not ideal, as features such as tonalness and contour similarity are really continuous phenomena. In particular, it prevents the results from these studies from being generalised to predict performance for new items that do not fall exactly within these categories.
One solution to this problem is to use formal measures to quantify these item features on continuous rather than discrete scales (Harrison & Müllensiefen, 2016).These measures can then be used in a regression model to predict item difficulty. These authors used formal measures of melodic complexity and melodic similarity to model difficulty in three pre-existing melodic discrimination tests originally designed to test musical aptitude and expertise (Gordon, 1989; Müllensiefen, Gingras, Musil, & Stewart, 2014; Wallentin, Nielsen, Friis-Olivarius, Vuust, & Vuust, 2010). Melodic complexity was associated with higher item difficulty, probably because complex melodies place a higher demand on working memory and hence cannot be encoded as reliably. Melodic similarity was also associated with higher item difficulty, probably because similar melodies have more similar cognitive representations and are hence harder to distinguish. The authors also found that self-reported musical training was a strong predictor of melodic discrimination ability. The authors hypothesised that these same relationships should generalise to other types of melodic discrimination tests, such as the 3-AFC oddity paradigm.
The ability to predict item difficulty accurately on the basis of item features can be very useful for constructing computerised adaptive tests (CATs). CATs represent a modern approach to testing where, instead of giving all test-takers the same set of items, items are instead chosen to match the test-taker’s estimated ability throughout the test. This approach to testing can be much more efficient than traditional testing, as candidates no longer have to answer questions that are too easy or too difficult. CATs are usually very expensive to construct, requiring extensive empirical pre-calibration, but this calibration process can be made much more efficient if item difficulty can be predicted on the basis of item features. This allows large pre-calibrated item banks to be constructed using automatic item generation techniques (Harrison, 2015). Harrison and Müllensiefen (2016) constructed a regression model for predicting item difficulty in the "same-different" melodic discrimination task, but unfortunately this task is unsuited for CAT construction (see Harrison, 2015 for more details).
The main purpose of this project is to develop a predictive model of item difficulty suitable for use in future CATs. To do this, it is first necessary to choose a more appropriate melodic discrimination paradigm. Here I use the 3-AFC oddity task described earlier. This particular task has not yet been used before in the published literature on melodic discrimination, so one aim of the present work is to determine whether the regression model of Harrison and Müllensiefen (2016) also applies here. This model predicts item difficulty using a linear combination of measures of melodic complexity and melodic similarity. Melodic complexity is proposed to impair memory encoding, as complex melodies place too much demand on the limit capacity of working memory. Melodic similarity is proposed to impair similarity comparison, as more similar melodies have cognitive representations that are more difficult to distinguish.
Melodic discrimination items need to use melodies that are unfamiliar to the test-taker, otherwise the memory retention phase is bypassed. One way of obtaining these melodies is by traditional composition. However, this approach is not ideal for CAT construction, where very large item banks are required. An interesting alternative approach is to employ algorithmic approaches to melody construction. However, in this case it is important to ensure that the automatic composition process performs sufficiently realistically so as to maintain ecological validity for the discrimination test. In this project I therefore trial an algorithmic approach to item generation, and I assess the perceived stylistic success of the generated melodies as compared to "real" melodies.
Previous research has used several different approaches to assessing stylistic success of automatically generated music. One is the musical Turing test (Marsden, 2000), where participants are instructed to categorise extracts as either human-composed or computer-generated. However, this approach has been criticised for promoting unrealistic listening modes in participants (Wiggins, Pearce, & Müllensiefen, 2009). An alternative approach has been suggested by Wiggins et al. (2009), who implemented a variant of the consensual assessment test (Amabile, 1996) to assess stylistic success in music. Here participants are not told of the computational original of the stimuli, and instead are asked simply to rate the stylistic success of the music they heard with reference to an exemplar from that musical style. This is the approach used in the current study. However, unlike Wiggins et al. (2009), this study uses non-expert participants, and these participants are not asked to give any verbal justifications for their ratings, as this task is very difficult for individuals without musical training.
In summary, therefore, this project aimed to address three aims:
develop an algorithm for automatically generating 3-AFC oddity melodic discrimination items;
develop a predictive model of melodic discrimination difficulty for the 3-AFC oddity task and validate it experimentally;
investigate the perceived stylistic success of automatically generated melodic discrimination items.
2. Method
2.a. Formal measures of complexity and similarity
Melodic complexity and similarity are operationalised using the same measures as Harrison and Müllensiefen (2016).
2.a.i. Melodic complexity
Melodic complexity is assessed using a linear combination of three measures from the FANTASTIC melody analysis toolbox (Müllensiefen, 2009): length, interval entropy, and step contour local variation. Length is defined as the number of note onsets in the melody. Interval entropy describes the amount of intervallic variation within the melody. Let
denote the number of times that an interval of i semitones occurs in the melody, where positive values of i correspond to ascending intervals and negative values of i to negative intervals. Define the relative frequency of each interval as:
with j ranging over all intervals in the melody. Interval entropy is then defined as:
Step contour local variation describes local variation in pitch. This measure is computed first by computing a step contour vector x for the melody. The step contour vector has length 64, and each element corresponds to samples of the raw pitch values (as MIDI note numbers) of the melody at equally spaced time intervals along the whole melody. Step contour local variation is then defined as the mean absolute difference between adjacent elements in this vector:
Length, interval entropy, and step contour local variation are combined in a linear combination with weights derived from a principal component analysis used in Harrison and Müllensiefen (2016). All weights in the linear combination are positive. The reason that this principal component analysis was necessary was that these three variables tend to be fairly strongly correlated in melodies of the short lengths typically used in melodic discrimination tests. Nonetheless, it seems that these three variables measure complementary, if related, facets of melodic complexity.
2.a.ii. Melodic similarity
Melodic similarity is assessed using a linear combination of two measures from the SIMILE toolbox (Müllensiefen & Frieler, 2007). The first measure describes the contour similarity between the two melodies, and is calculated by finding the normalised edit distance for the melodies when transformed to a contour representation according to Steinbeck's (1982) algorithm. The second measure describes the harmonic similarity between the two melodies, defined as the normalised edit distance of bar-wise harmonic symbols as produced by the Krumhansl-Schmuckler key-finding algorithm (Krumhansl, 1990). These two measures are combined in a linear combination with equal weights.
2.b. Stimuli
Experimental stimuli were generated algorithmically, according to the following steps:
Generate raw melodic material. Raw melodies were generated using the automatic music composition algorithm Racchman-Oct2010 (Random Constrained CHain of MArkovian Nodes; Collins et al., 2015). Racchman-Oct2010 trains a first-order Markov model on a corpus of source music, then generates from this Markov model using both forward and backward composition. The source corpus used here was a collection of Irish folk melodies transcribed by Damien Sagrillo (http://kern.humdrum.org/), filtered only to include triple-metre melodies. In addition to these automatically generated melodies, a set of "real" melodies was constructed by extracting phrases from the source corpus, matched for length with the automatically generated melodies.
Filter melodies for appropriate characteristics. Melodies were filtered to ensure that they contained no out-of-key notes and to make sure they only contained a moderate note density (i.e. not too many notes with long duration or short duration).
Make altered versions of melodies. Melodies were altered randomly to produce the odd one out for each 3-AFC trial. Any note could be altered except for the first or the last note, and the number of allowed alterations per melody increased with the length of the melody. Alterations could deviate from the original note by up to three semitones.
Make 3-AFC items. 3-AFC items were determined by randomly choosing which of the three melodies in a trial would be the odd one out. For each item, the first version would be in D major, the second transposed up one semitone to E flat major, and the third transposed up another semitone to E major.
Compute melodic similarity and complexity. Measures of melodic similarity and complexity were computed for each 3-AFC item using the models described above.
Select final items. After generating a large number of items according to the five steps above, 80 items were selected for the final experiment under the constraint that exactly half were derived from automatically generated melodies and half from real melodies, and that within these two groups items were uniformly distributed and matched between groups for melodic similarity and complexity.
Synthesise audio. Each of the 80 items were synthesised to audio using a piano timbre and a tempo of 120 crotchets per minute. Each melody in the 3-AFC trial was separated by a short gap.
2.c. Participants
Twenty participants took part in the experiment, none of whom reported hearing problems. The mean age was 24.7 years (SD = 3.5). Twelve participants identified as females, 6 as males, and two opted out of reporting gender. Eighteen were students, and two were in full-time employment.
2.d. Procedure
Testing sessions were conducted in a quiet room, and lasted an average of forty minutes. Each participant was tested individually through a computer interface, using the Concerto platform (Scalise & Allen, 2015), and audio examples were presented over Bose QuietComfort 15 headphones at a comfortable volume.
The testing session began by asking the participant to rate their familiarity with Irish folk music on a scale from 1 (not at all familiar) to 7 (very familiar). They were then played an example folk melody from the source corpus of Irish folk melodies, with the constraint that this melody did not occur in the test set. Participants were then introduced to the 3-AFC oddity melodic discrimination paradigm, and given an example item to attempt along with feedback. Next, participants were instructed on how they would evaluate the perceived stylistic success of the melodies. Specifically, they were told the following:
"Your second task in each question will be to evaluate the stylistic success of the melodies used. By stylistic success, we mean 'how well does the melody fit the general style of the example Irish folk melody you heard earlier'. You will give your rating on a scale of 1 to 7."
"For each question, you will have heard three versions of the same melody. We want you to make your stylistic judgement based on the two melodies that are not the odd one out. Don't worry, however, if you aren't sure which one the odd one out is - just make your best guess."
Participants were then given a second practice question where they performed both the melodic discrimination task and the stylistic evaluation task. After this point, they were told to ask the experimenter if any instructions remained unclear. They were then able to proceed to the main test.
Each participant was administered 80 experimental trials corresponding to the 80 melodic discrimination items generated previously. Each trial began by presenting the melodic discrimination item over headphones. Once the audio finished, the participant was given three response options for the melodic discrimination task, allowing them to identify which melody had been the odd one out. Participants responded by clicking on labelled buttons on the computer screen. The participant was then asked to evaluate the stylistic success of the melody on a scale from 1 to 7. The precise wording was:
"How stylistically successful (as Irish folk melodies) were the two same melodies? (Ignore the odd one out.)"
After these 80 trials, each participant was presented with a musical training questionnaire from the Goldsmiths Musical Sophistication Index (Gold-MSI; Müllensiefen et al., 2014), and a short questionnaire about age, gender, and current occupation.
3. Results
Participant scores were first tested to screen for guessing behaviour, by checking whether their scores on the 3-AFC melody discrimination test significantly exceeded chance level (33.3%). Two participants had total scores that did not exceed chance level (binomial test, p > .05), and so their results were excluded from future analysis. All other participants scored significantly above chance (binomial test, p < .05). The score distributions of the remaining participants are plotted below.
On average, participants scored 64.8% on the melodic discrimination test (SD = 47.8%). The data seem clearly to be bimodally distributed, with one low-ability group scoring an average of 52.5% and one high-ability group scoring an average of 72.6%. One possible interpretation of this bimodality is that participants tended to fall into one of two classes: musicians and non-musicians. To investigate this possibility, we can look at the distribution of musical training scores in the sample group, plotted below.
As suspected, the distribution of musical training scores also shows clear bimodality: participants fall into one of two groups, a low-training group (M = 12.3) or a high-training group (M = 36.8). We can now investigate whether the bimodality in these two distributions is causally associated by plotting the relationship between musical training scores and melodic discrimination scores.
In this figure, the grey shaded area corresponds to a 95% confidence interval. The results indicate that musical training, as assessed by the Gold-MSI questionnaire, was significantly positively associated with melodic discrimination scores (Pearson's r(16) = .67, p = .002).
We now investigate performance on individual items, rather than the test as a whole, using explanatory item response modelling (de Boeck & Wilson, 2004; de Boeck et al., 2011). To do this, we fit a generalised linear mixed model to the response data using a logistic link function modified to give a lower asymptote of 33.3%. First, we specify a null model containing just random intercepts for participant and for item (AIC = 1752.2, BIC = 1768.1). We then incrementally build up the mode by adding fixed effects until we match the regression model of Harrison & Müllensiefen (2016), testing the plausibility of each model extension using likelihood ratio tests. All continuous predictors were scaled and centred before being entered into the model. First, musical training was added to the model. Musical training made a significant positive contribution to performance (chi-squared(1) = 11.1, p < .001, AIC = 1743.1, BIC = 1764.2). Next, melodic complexity was added to the model, and made a significant negative contribution to performance (chi-squared(1) = 10.3, p = .001, AIC = 1734.8, BIC = 1761.2). Lastly, melodic similarity was added to the model, but this predictor did not significantly improve the model (chi-squared(1) = 1.33, p = .25, AIC = 1735.5, BIC = 1767.1). Adding further interactions to the model prevented it from converging. The final model explained item difficulty solely using melodic complexity: increasing melodic complexity by one standard deviation increased difficulty (item response theory metric; see e.g. de Ayala, 2009) by 0.95 (SE = 0.29). The estimated residual item difficulty not explained by the effect of complexity corresponded to a standard deviation of 2.05 in item difficulty.
Lastly, we investigate the perceived stylistic success of the automatically generated melodic discrimination items. Mean stylistic success ratings were calculated for every melody, and are plotted below, split into artificial and real melody groups.
On average, the real melodies were rated as slightly less stylistically successful than the artificially generated melodies (mean score of 3.99 versus mean score of 4.10), but this difference was not statistically significant (Welch t-test, t(74.2) = 0.71, p = .48). At the beginning of the test, participants had reported a mean familiarity level with Irish folk music of 2.38 (SD = 1.56) out of 7.
4. Discussion
This project had three main aims: to construct an algorithm for automatically generating 3-AFC melodic discrimination items, to try and develop a predictive model of item difficulty for these items, and to assess the perceived stylistic success of these melodies. The first aim, algorithm construction, was successfully achieved as part of the experimental setup. Though only 40 completely automatically generated items were used in the present experiment, the same algorithm could be used to produce an effectively unlimited number of equivalent items.
On the basis of prior research (Harrison & Müllensiefen, 2016), it was hypothesised that item difficulty should be positively related to both melodic complexity and melodic similarity. Unexpectedly, this relationship was only seen for melodic complexity and not for melodic similarity. The 3-AFC oddity task has been used once before in melodic discrimination research (Harrison, 2015), where relationships between melodic similarity and item difficulty were observed, but using carefully constrained and matched contour and tonality manipulations. There are a couple of possible interpretations for the lack of effect observed in this study. One is that the 3-AFC task does not depend as much on melodic similarity as do traditional melodic discrimination tasks, such as the "same-different" task. Another is that the formal measures of melodic similarity used here were not appropriate to this scenario. Future research should aim to distinguish between these two possibilities.
Independently of melodic similarity, melodic complexity did have a strong effect on item difficulty, and could therefore be used in a regression model for predicting the difficulty of automatically generated items. However, a large proportion of item difficulty still remained unexplained after taking into account melodic complexity, and this needs to be addressed before applying the model to CAT construction.
The last aim was to assess the perceived success of the automatically generated melodies. The results indicated that listeners did not perceive these artificial melodies as any less realistic than real melodies. This result is exciting, as it shows that automatically generated items may not suffer from diminished ecological validity on account of not being "real" melodies. It must be acknowledged that these listeners reported rather low prior familiarity with Irish folk music. However, this does not undermine the validity of the findings: the melodic discrimination test is intended to assess abilities in the general population, and so the important fact is that these average listeners did not perceive these melodies as being unrealistic.
These results bode moderately well for future CAT construction. An effective algorithm was presented for automatically generating melodic discrimination items, and these items were rated as equally stylistically successful as items that were constructed from real Irish folk melodies. Furthermore, a statistical model was then constructed that partially predicted item difficulty on the basis of melodic complexity. However, further work is required to improve this statistical model before it can predict item difficulties accurately enough for CAT construction.
5. Acknowledgements
Elaine Chew, for directing the Music and Speech Processing module, and giving advice on this project; Daniel Müllensiefen, for use of the FANTASTIC and SIMILE toolboxes as well as advice on this project; Tom Collins, for use of Racchman-Oct2010.
6. References
Akiva-Kabiri, L., Vecchi, T., Granot, R., Basso, D., & Schön, D. (2009). Memory for tonal pitches: A music-length effect hypothesis. Annals of the New York Academy of Sciences, 1169, 266–269. doi:10.1111/j.1749-6632.2009.04787.x
Amabile, T. M. (1996). Creativity in context. Boulder, CO: Westview Press.
Bentley, A. (1966). Measures of musical abilities. London, England: George A. Harrap.
Collins, T., Laney, R., Willis, A., & Garthwaite, P. H. (2016). Developing and evaluating computational models of musical style. Artificial Intelligence for Engineering Design, Analysis and Manufacturing, 30(1), 16–43. doi:10.1017/S0890060414000687
Cuddy, L. L., Cohen, A. J., & Mewhort, D. J. K. (1981). Perception of structure in short melodic sequences. Journal of Experimental Psychology: Human Perception and Performance, 7(4), 869–883.
Cuddy, L. L., Cohen, A. J., & Miller, J. (1979). Melody recognition: The experimental application of musical rules. Canadian Journal of Psychology, 33(3), 148–157. doi:10.1037/h0081713
Cuddy, L. L., & Lyons, H. I. (1981). Musical pattern recognition: A comparison of listening to and studying tonal structures and tonal ambiguities. Psychomusicology: A Journal of Research in Music Cognition, 1(2), 15–33. doi:10.1037/h0094283
de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: The Guilford Press.
de Boeck, P., Bakker, M., Zwitser, R., Nivard, M., Hofman, A., Tuerlinckx, F., & Partchev, I. (2011). The estimation of item response models with the lmer function from the lme4 package in R. Journal of Statistical Software, 39(12).
de Boeck, P., & Wilson, M. (2004). Descriptive and explanatory response models. In Explanatory item response models: A generalized linear and nonlinear approach (pp. 43–74). New York, NY: Springer. doi:10.1007/978-1-4757-3990-9
Dowling, W. J. (1978). Scale and contour: Two components of a theory of memory for melodies. Psychological Review, 85(4), 341–354.
Dowling, W. J., & Bartlett, J. C. (1981). The importance of interval information in long-term memory for melodies. Psychomusicology: A Journal of Research in Music Cognition, 1(1), 30–49.
Dowling, W. J., & Fujitani, D. S. (1971). Contour, interval, and pitch recognition in memory for melodies. The Journal of the Acoustical Society of America, 49(2B), 524–531. doi:10.1121/1.1912382
Gaston, E. T. (1957). A test of musicality: Manual of Directions. Lawrence, KA: Odell’s Instrumental Service.
Gordon, E. E. (1965). Musical aptitude profile. Boston, MA: Houghton Mifflin.
Gordon, E. E. (1982). Intermediate measures of music audiation. Chicago, IL: G.I.A. Publications.
Harrison, P. M. C. (2015). Constructing computerised adaptive tests of musical listening abilities (Unpublished master's thesis). Goldsmiths College, University of London
Harrison, P. M. C., Musil, J., & Müllensiefen, D. (submitted). Modelling melodic discrimination with formal models of similarity and complexity.
Krumhansl, C. L. (1990). Cognitive foundations of musical pitch. New York, NY: Oxford University Press.
Marsden, A. (2000). Music, intelligence and artificiality. In Readings in Music and Artificial Intelligence (pp. 15–28). Amsterdam, The Netherlands: Harwood Academic Publishers.
Mikumo, M. (1992). Encoding strategies for tonal and atonal melodies. Music Perception, 10(1), 73–82.
Müllensiefen, D. (2009). FANTASTIC: Feature ANalysis Technology Accessing STatistics (In a Corpus): Technical Report v. 1.5. London: Goldsmiths, University of London. Retrieved from http://www.doc.gold.ac.uk/isms/m4s/FANTASTIC_docs.pdf
Müllensiefen, D., & Frieler, K. (2007). Modelling experts’ notions of melodic similarity. Musicae Scientiae, Disc.4A, 183–210. doi:10.1177/102986490701100108
Müllensiefen, D., Gingras, B., Musil, J., & Stewart, L. (2014). The musicality of non-musicians: An index for assessing musical sophistication in the general population. PLoS ONE, 9(2). doi:10.1371/journal.pone.0089642
Scalise, K., & Allen, D. D. (2015). Use of open-source software for adaptive measurement: Concerto as an R-based computer adaptive development and delivery platform. British Journal of Mathematical and Statistical Psychology. doi:10.1111/bmsp.12057
Schulze, K., Dowling, W. J., & Tillmann, B. (2012). Working memory for tonal and atonal sequences during a forward and backward recognition task. Music Perception, 29(3), 255–267.
Seashore, C. E. (1919). The psychology of musical talent. Boston, MA: Silver, Burdett and Company.
Steinbeck, W. (1982). Struktur und Ähnlichkeit: Methoden automatisierter Melodieanalyse. In Kieler Schriften zur Musikwissenschaft XXV. Kassel, Germany: Bärenreiter.
Wallentin, M., Nielsen, A. H., Friis-Olivarius, M., Vuust, C., & Vuust, P. (2010). The Musical Ear Test, a new reliable test for measuring musical competence. Learning and Individual Differences, 20(3), 188–196. doi:10.1016/j.lindif.2010.02.004
Wing, H. D. (1961). Standardised tests of musical intelligence. The Mere, England: National Foundation for Educational Research.