Models only transcript a small part and then gives up

by IDontKnowWhatToNameMyself - opened Nov 19, 2022

Nov 19, 2022

It's happened repeatedly across all models where it just transcribes a small part of my audio. Like less than a fraction of what I recorded. It may be because of the audio quality because I'm recording things with my phone, but it's really weird how it just doesn't even try with things that sound obvious in the audio.

sanchit-gandhi

Nov 21, 2022

Hey @IDontKnowWhatToNameMyself ! The Whisper model works in 30s batches: any audio samples longer than 30s are truncated (cut-short) by the model (see this blog post for details: https://huggingface.co/blog/fine-tune-whisper#load-whisperfeatureextractor)

Adding long-form transcription to handle audio samples > 30s is a TODO! See here: https://github.com/huggingface/transformers/issues/19887

IDontKnowWhatToNameMyself

Nov 21, 2022

It's weird because my audio samples are less than 30 seconds and it only does like 2 words of it

sanchit-gandhi

Nov 21, 2022

Do you have code to reproduce? Or is this done through the Hosted Inference API?

IDontKnowWhatToNameMyself

Nov 21, 2022

I'm using the hosted inference API

sanchit-gandhi

Nov 24, 2022

•

edited Nov 24, 2022

Could be a quirk of the Whisper model that it behaves badly with low qual audio.

Here's what we could do! Could you use the same inputs and pass them through the Whisper model here: https://huggingface.co/spaces/openai/whisper

Once the predictions have been generated, you can share them in the community tab (input audio + predictions)

You can tag me in the comment once you've done that!

That would help in trying to uncover why the model stops so early

IDontKnowWhatToNameMyself

Nov 24, 2022

Could be a quirk of the Whisper model that it behaves badly with low qual audio.

Here's what we could do! Could you use the same inputs and pass them through the Whisper model here: https://huggingface.co/spaces/openai/whisper

Once the predictions have been generated, you can share them in the community tab (input audio + predictions)

You can tag me in the comment once you've done that!

That would help in trying to uncover why the model stops so early

I actually already posted something in the community tab and it was the same issue.
https://huggingface.co/spaces/openai/whisper/discussions/50#63743abb7da0b794a582366e

sanchit-gandhi

Nov 24, 2022

Thanks, replied there!

In short, the model terminates the generation process when it reaches the long period of no speech between the two spoken sentences.

IDontKnowWhatToNameMyself changed discussion status to closed Nov 24, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment