How to finetune this model?
How to fine-tune this model? As I can see, it is based on the Wenet framework, and I want to train this model on my own dataset in Colab using the example provided by Sanchit Gandhi https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb
It is based on Whisper base ASR model. What changes do I need to make?
hi!
I don't have a direct answer for you ... but I am curious why you want to fine-tune it.
Did you test on some of your data and had poor performance?
Are you working with very different types of audios?
Is your data in English?
In the past, we have tried to fine-tune the model for specific use-cases, and we were not always very successful. We found it hard to improve for a certain type of audio without reducing the accuracy in the overall setting.
But, I will let the team answer your fine-tuning question as they may have made progress since I last talked to them.
Thank you for the reply. My use case is personalization since I use ASR for everything. I made my own streaming Python code, which is most similar to the example code in the Reverb GitHub repo, but I trigger it with a special hotkey. I use ASR for taking notes, posting comments on social media, and everything in between. Until now, I was using Whisper small.en fine-tuned on my recorded data of roughly 30 hours in English. Since I use the same setup for inference, it worked beautifully. But for the last few days, I started using reverb_asr_v1 locally, and it's pretty accurate. It generalises and transcribes even more accurately than my fine-tuned Whisper model. But Reverb can't handle specific words, names, and nicknames of people I personally know. So either I have to fine-tune it or add an extra layer of my personal dictionary before the final output. For example, when I say "inference," it transcribes "information," but the issue is, sometimes it transcribes "interface," so it skips my final layer.
Since the out-of-the-box reverb_asr_v1 is very accurate, I was thinking about implementing LoRA for fine-tuning, so it will freeze the original layers and only learn from additional data without losing the model's generalization capability. I have done the same with Whisper small.en and Whisper turbo and the results were way better than fine-tuning the entire model.
Hey there!
Excited to hear the you're using reverb! The only way the moment to finetune the model is to use the training script in our reverb repo -- you could point the script to the config.yaml and the reverb_asr_v1.pt. You'll have to modify the yaml file to fit the parameters you'd like to use for finetuning. Unfortunately the documentation isn't great so there would be a bit of playing around with the scripts to make everything work -- you might be able to find some information on the main Github but we'd also be happy to try to help you work through any issues you run into.
We're actually working on making it easier to finetune reverb and will hopefully be releasing the code soon!
Thanks, I will try using the repo script and keep you updated.