Different outputs when using the original openai whisper model and the model wrapped by huggingface
Hi there,
Thank you for an amazing contribution for the whisper models. I observed some different (worse) results when using the model wrapped inside transformers package compared to the original openai whisper code. I am just wondering if:
- There's a different decoding algorithm
- There's some configuration I forgot to configure
The apparent problems are the overlapping speech problems. While I realized there are some similar discussion in the openai github community here: https://github.com/openai/whisper/discussions/434, but during my internal testing I saw that the model from openai is far better at multiple speakers multi turns environment.
Thank you.
I think you can try using pipeline for audios larger than 30 secs in length.
Long-Form Transcription
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. It can also be extended to predict utterance level timestamps by passing return_timestamps=True:
>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset
>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"
>>> pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-small.en",
chunk_length_s=30,
device=device,
)
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> prediction = pipe(sample.copy())["text"]
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."
>>> # we can also return timestamps for the predictions
>>> prediction = pipe(sample, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
'timestamp': (0.0, 5.44)}]
I confirmed that the solution is working, thanks!