arxiv:2409.12042

ASR Benchmarking: Need for a More Representative Conversational Dataset

Published on Sep 18, 2024

Authors:

Abstract

Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2409.12042 in a model README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2409.12042 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.