Overview

As humans, we regularly use the voice as a means of recognising people - for example, when someone calls us on the telephone, or shouts to us from another room. While humans are relatively good at recognising familiar, or at least predictable voices, identifying unfamiliar voices is much more difficult. This is also often the task in forensic and investigative contexts. In such cases, a comparison is made of the voices of an unknown criminal and a known suspect, with the ultimate aim of assessing the likelihood that they belong to the same individual. Increasingly, around the world, speaker recognition machines (i.e. pieces of software) are used for these purposes. However, a critical question remains unanswered: do machines recognise speakers in the way that humans do? This question has received relatively little attention in the literature. The studies that have examined this issue are all small scale and simply compare the results of human recognition with those of machine recognition using overall error rates.

However, what is much more important is understanding the contexts in which one method might outperform the other, and whether there is any benefit in combining the approaches. In addressing these issues, our research will provide a better understanding of how speaker recognition machines work and how they might be improved. Further, previous work has overlooked the many factors that may affect human recognition performance, such as cognitive bias. In this project, we assess the variability in human judgements as a function of different amounts of contextual information, especially in the context of a criminal trial where there may be other information pertinent to the case or even a forensic expert providing voice evidence which could influence the decision-making process involved in the speaker recognition task.

In order to compare and combine human and machine responses, we will develop a bespoke computer game that elicits human judgements that are conceptually equivalent to those produced by the machine. In doing so, we will also test the viability of using the voice as the central element in a computer game; an area of computer game development that has received relatively little attention.

The project has a number of specific research questions:

  1. How do humans and machines perform at speaker recognition relative to each other, and can we improve performance by combining the two approaches? To what extent, therefore, do these methods capture the same information?

  2. In what contexts (using speakers with different regional accents and diverse speech samples with varying durations and recording quality) do humans outperform machines?

  3. How do different listener groups perform in speaker comparison tasks? Does familiarity with the regional accent improve performance?

  4. To what extent are human judgements affected by contextual information that may occur in a forensic case, such as (i) the knowledge that it is a criminal case, (ii) other evidence from the case, or (iii) a forensic expert's opinion?