Different outputs when using the original openai whisper model and the model wrapped by huggingface

#15
by philip30 - opened

Hi there,

Thank you for an amazing contribution for the whisper models. I observed some different (worse) results when using the model wrapped inside transformers package compared to the original openai whisper code. I am just wondering if:

  1. There's a different decoding algorithm
  2. There's some configuration I forgot to configure

The apparent problems are the overlapping speech problems. While I realized there are some similar discussion in the openai github community here: https://github.com/openai/whisper/discussions/434, but during my internal testing I saw that the model from openai is far better at multiple speakers multi turns environment.

Thank you.

image.png

I think you can try using pipeline for audios larger than 30 secs in length.

Long-Form Transcription
The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. It can also be extended to predict utterance level timestamps by passing return_timestamps=True:

>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset

>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> pipe = pipeline(
  "automatic-speech-recognition",
  model="openai/whisper-small.en",
  chunk_length_s=30,
  device=device,
)

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]

>>> prediction = pipe(sample.copy())["text"]
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."

>>> # we can also return timestamps for the predictions
>>> prediction = pipe(sample, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
  'timestamp': (0.0, 5.44)}]

I confirmed that the solution is working, thanks!

philip30 changed discussion status to closed

Sign up or log in to comment