Multilingual transcription: howto w/o specifying language?

#81
by sanjaymk908 - opened

A very imp usecase for me is transcription (to say a subset of Indian languages & English). openai/whisper-large preserves the spoken language thru transcription.

I have been following the excellent tutorial: https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb
& a few other similar approaches (e.g https://wandb.ai/parambharat/whisper_finetuning/reports/Fine-tuning-Whisper-for-low-resource-Dravidian-languages--VmlldzozMTYyNTg0). One of my (n00bie) observations is - the target language is specified during training. And the emitted tokens are always in this language (so, e.g kannada or marathi would be emitted in Hindi).

I tried changing fine_tune_whisper.ipynb by omitting language="Hindi" when WhisperTokenizer & WhisperProcessor are inited.
But the demo inference still emits Hindi transcriptions.

So Q: how does one finetune using a dataset for a specific language (say kannada) & still get transcriptions for other languages (say hindi)?

sanjaymk908 changed discussion title from Multilingual transcription: howoto w/o specifying language? to Multilingual transcription: howto w/o specifying language?
sanjaymk908 changed discussion status to closed
sanjaymk908 changed discussion status to open

Hey @sanjaymk908 ! What happens when we fine-tune Whisper for one language is that it becomes more biased to this language. We also risk 'catastrophic forgetting' by fine-tuning on one language - the model might forget how to transcribe in other languages. If you want to preserve performance across lots of languages, it's best to use the pre-trained model.

If your fine-tuning language is similar to other languages of interest (e.g. fine-tune on Hindi, and care about Hindi and Urdu), then you'll probably see a benefit for both through fine-tuning. In this case, you can follow the fine-tuning tutorial and set the tokenizer language as required. At inference time, you should set the forced decoder ids to None, i.e. replace this line:
https://huggingface.co/spaces/whisper-event/whisper-demo/blob/5d4e526c32efcf0bdf726d84160c776d0374fd0b/app.py#L19
With

pipe.model.config.forced_decoder_ids = None

The model will then transcribe the most likely language.

Hi @sanchit-gandhi ! If I want to train it on English language only, what parts of code should I change from your model on Hindi language?
I followed your tutorial with my custom dataset but I get too high WER. There's something I'm missing and I don't understand where I'm wrong

Hey @scanne ! In this case, you should load the model from one of the English-only checkpoints (detailed in the intro of the blog post: https://huggingface.co/blog/fine-tune-whisper#introduction, e.g. openai/whisper-small.en instead of openai/whisper-small), and you should also omit both the language and task arguments from the tokenizer and processor, i.e.:

from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small.en")

and:

from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small.en")

Then just make sure your dataset is properly formatted (see https://huggingface.co/docs/datasets/audio_dataset). Otherwise there are no further changes required to fine-tune on English (I've done this previously and it works well: https://huggingface.co/sanchit-gandhi/whisper-large-v2-ft-ls-960h/tree/main)

Sign up or log in to comment