|
--- |
|
language: fa |
|
datasets: |
|
- common_voice_6_1 |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
license: mit |
|
widget: |
|
- example_title: Common Voice Sample 1 |
|
src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/0/audio/audio.mp3 |
|
- example_title: Common Voice Sample 2 |
|
src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/1/audio/audio.mp3 |
|
model-index: |
|
- name: Sharif-wav2vec2 |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice Corpus 6.1 (clean) |
|
type: common_voice_6_1 |
|
config: clean |
|
split: test |
|
args: |
|
language: fa |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 6.0 |
|
--- |
|
|
|
# Sharif-wav2vec2 |
|
|
|
This is the fine-tuned version of Sharif Wav2vec2 for Farsi. The base model was fine-tuned on 108 hours of Commonvoice's Farsi samples with a sampling rate equal to 16kHz. Afterward, we trained a 5gram using [kenlm](https://github.com/kpu/kenlm) toolkit and used it in the processor which increased our accuracy on online ASR. When using the model make sure that your speech input is sampled at 16Khz. Prior to the usage, you may need to install the below dependencies: |
|
|
|
```shell |
|
pip install pyctcdecode |
|
pip install pypi-kenlm |
|
``` |
|
|
|
For testing you can use the hoster API at the hugging face (There are provided examples from common voice) it may take a while to transcribe the given voice. Or you can use bellow code for local run: |
|
|
|
```python |
|
import tensorflow |
|
import torchaudio |
|
import torch |
|
import numpy as np |
|
|
|
from transformers import AutoProcessor, AutoModelForCTC |
|
|
|
processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2") |
|
model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2") |
|
|
|
speech_array, sampling_rate = torchaudio.load("path/to/your.wav") |
|
speech_array = speech_array.squeeze().numpy() |
|
|
|
features = processor( |
|
speech_array, |
|
sampling_rate=processor.feature_extractor.sampling_rate, |
|
return_tensors="pt", |
|
padding=True) |
|
|
|
with torch.no_grad(): |
|
logits = model( |
|
features.input_values, |
|
attention_mask=features.attention_mask).logits |
|
prediction = processor.batch_decode(logits.numpy()).text |
|
|
|
print(prediction[0]) |
|
# تست |
|
``` |
|
|
|
|
|
*Result (WER)*: |
|
|
|
| "clean" | "other" | |
|
|---|---| |
|
| 3.4 | 8.6 | |