Talker change detection: A comparison of human and machine performance

J Acoust Soc Am. 2019 Jan;145(1):131. doi: 10.1121/1.5084044.

Abstract

The automatic analysis of conversational audio remains difficult, in part, due to the presence of multiple talkers speaking in turns, often with significant intonation variations and overlapping speech. The majority of prior work on psychoacoustic speech analysis and system design has focused on single-talker speech or multi-talker speech with overlapping talkers (for example, the cocktail party effect). There has been much less focus on how listeners detect a change in talker or in probing the acoustic features significant in characterizing a talker's voice in conversational speech. This study examines human talker change detection (TCD) in multi-party speech utterances using a behavioral paradigm in which listeners indicate the moment of perceived talker change. Human reaction times in this task can be well-estimated by a model of the acoustic feature distance among speech segments before and after a change in talker, with estimation improving for models incorporating longer durations of speech prior to a talker change. Further, human performance is superior to several online and offline state-of-the-art machine TCD systems.

MeSH terms

  • Adult
  • Female
  • Humans
  • Male
  • Natural Language Processing*
  • Psychoacoustics
  • Speech Intelligibility
  • Speech Perception*
  • Speech Recognition Software / standards