File size: 5,367 Bytes
1a9df55 673eaaf 1a9df55 673eaaf 1a9df55 673eaaf 1a9df55 673eaaf 1a9df55 673eaaf 1a9df55 469502d 1a9df55 469502d 1a9df55 2111b26 1a9df55 2111b26 1a9df55 469502d 1a9df55 2111b26 1a9df55 469502d 2111b26 1a9df55 525cab3 5e728f8 525cab3 1a9df55 469502d 1a9df55 2111b26 1a9df55 469502d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
language:
- pt
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Whisper Medium Portuguese
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: mozilla-foundation/common_voice_11_0 pt
type: mozilla-foundation/common_voice_11_0
config: pt
split: test
args: pt
metrics:
- name: Wer
type: wer
value: 6.5785713084850626
---
# Whisper Medium Portuguese 🇧🇷🇵🇹
Bem-vindo ao whisper medium para transcrição em português 👋🏻
If you are looking to **quickly**, and **reliably**, transcribe Portuguese audio to text, you are in the right place!
With a state-of-the-art [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) of just **6.579** in Common Voice 11, this model offers an **x2** precision increase compared to prior state-of-the-art [wav2vec2](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese) models. Compared to the original [whisper-medium](https://huggingface.co/openai/whisper-medium) model it delivers an **x1.2** improvement 🚀.
This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [mozilla-foundation/common_voice_11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) dataset.
The following table displays a **comparison** between the results of our model and those achieved by the most downloaded models in the hub for [Portuguese Automatic Speech Recognition](https://huggingface.co/models?language=pt&pipeline_tag=automatic-speech-recognition&sort=downloads) 🗣:
| Model | WER | Parameters |
|--------------------------------------------------|:--------:|:------------:|
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 8.100 | 769M |
| [jlondonobo/whisper-medium-pt](https://huggingface.co/jlondonobo/whisper-medium-pt) | **6.579** 🤗 | 769M |
| [jonatasgrosman/wav2vec2-large-xlsr-53-portuguese](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese) | 11.310 | 317M |
| [Edresson/wav2vec2-large-xlsr-coraa-portuguese](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese) | 20.080 | 317M |
### How to use
You can use this model directly with a pipeline. This is especially useful for short audio. For **long-form** transcriptions please use the code in the [Long-form transcription](#long-form-transcription) section.
```bash
pip install git+https://github.com/huggingface/transformers --force-reinstall
pip install torch
```
```python
>>> from transformers import pipeline
>>> import torch
>>> device = 0 if torch.cuda.is_available() else "cpu"
# Load the pipeline
>>> transcribe = pipeline(
... task="automatic-speech-recognition",
... model="jlondonobo/whisper-medium-pt",
... chunk_length_s=30,
... device=device,
... )
# Force model to transcribe in Portuguese
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="pt", task="transcribe")
# Transcribe your audio file
>>> transcribe("audio.m4a")["text"]
'Eu falo português.'
```
#### Long-form transcription
To improve the performance of long-form transcription you can convert the HF model into a `whisper` model, and use the original paper's matching algorithm. To do this, you must install `whisper` and a set of tools developed by [@bayartsogt](https://huggingface.co/bayartsogt).
```bash
pip install git+https://github.com/openai/whisper.git
pip install git+https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets
```
Then convert the HuggingFace model and transcribe:
```python
>>> import torch
>>> import whisper
>>> from multiple_datasets.hub_default_utils import convert_hf_whisper
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
# Write HF model to local whisper model
>>> convert_hf_whisper("jlondonobo/whisper-medium-pt", "local_whisper_model.pt")
# Load the whisper model
>>> model = whisper.load_model("local_whisper_model.pt", device=device)
# Transcribe arbitrarily long audio
>>> model.transcribe("long_audio.m4a", language="pt")["text"]
'Olá eu sou o José. Tenho 23 anos e trabalho...'
```
### Training hyperparameters
We used the following hyperparameters for training:
- `learning_rate`: 1e-05
- `train_batch_size`: 32
- `eval_batch_size`: 16
- `seed`: 42
- `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_steps`: 500
- `training_steps`: 5000
- `mixed_precision_training`: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | Wer |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.0698 | 1.09 | 1000 | 0.1876 | 7.189 |
| 0.0218 | 3.07 | 2000 | 0.2254 | 7.110 |
| 0.0053 | 5.06 | 3000 | 0.2711 | 6.969 |
| 0.0017 | 7.04 | 4000 | 0.3030 | 6.686 |
| 0.0005 | 9.02 | 5000 | 0.3205 | **6.579** 🤗 |
### Framework versions
- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2 |