Models only transcript a small part and then gives up
It's happened repeatedly across all models where it just transcribes a small part of my audio. Like less than a fraction of what I recorded. It may be because of the audio quality because I'm recording things with my phone, but it's really weird how it just doesn't even try with things that sound obvious in the audio.
Hey @IDontKnowWhatToNameMyself ! The Whisper model works in 30s batches: any audio samples longer than 30s are truncated (cut-short) by the model (see this blog post for details: https://huggingface.co/blog/fine-tune-whisper#load-whisperfeatureextractor)
Adding long-form transcription to handle audio samples > 30s is a TODO! See here: https://github.com/huggingface/transformers/issues/19887
It's weird because my audio samples are less than 30 seconds and it only does like 2 words of it
Do you have code to reproduce? Or is this done through the Hosted Inference API?
I'm using the hosted inference API
Could be a quirk of the Whisper model that it behaves badly with low qual audio.
Here's what we could do! Could you use the same inputs and pass them through the Whisper model here: https://huggingface.co/spaces/openai/whisper
Once the predictions have been generated, you can share them in the community tab (input audio + predictions)
You can tag me in the comment once you've done that!
That would help in trying to uncover why the model stops so early
Could be a quirk of the Whisper model that it behaves badly with low qual audio.
Here's what we could do! Could you use the same inputs and pass them through the Whisper model here: https://huggingface.co/spaces/openai/whisper
Once the predictions have been generated, you can share them in the community tab (input audio + predictions)
You can tag me in the comment once you've done that!
That would help in trying to uncover why the model stops so early
I actually already posted something in the community tab and it was the same issue.
https://huggingface.co/spaces/openai/whisper/discussions/50#63743abb7da0b794a582366e
Thanks, replied there!
In short, the model terminates the generation process when it reaches the long period of no speech between the two spoken sentences.