SLPL
/

Sharif-wav2vec2

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

Sharif-wav2vec2 / README.md

sadrasabouri's picture

Update README.md

55c34da about 2 years ago

|

2.32 kB

	---
	language: fa
	datasets:
	- common_voice_6_1
	tags:
	- audio
	- automatic-speech-recognition
	license: mit
	widget:
	- example_title: Common Voice Sample 1
	src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/0/audio/audio.mp3
	- example_title: Common Voice Sample 2
	src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/1/audio/audio.mp3
	model-index:
	- name: Sharif-wav2vec2
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice Corpus 6.1 (clean)
	type: common_voice_6_1
	config: clean
	split: test
	args:
	language: fa
	metrics:
	- name: Test WER
	type: wer
	value: 6.0
	---

	# Sharif-wav2vec2

	This is the fine-tuned version of Sharif Wav2vec2 for Farsi. The base model was fine-tuned on 108 hours of Commonvoice's Farsi samples with a sampling rate equal to 16kHz. Afterward, we trained a 5gram using [kenlm](https://github.com/kpu/kenlm) toolkit and used it in the processor which increased our accuracy on online ASR. When using the model make sure that your speech input is sampled at 16Khz. Prior to the usage, you may need to install the below dependencies:

	```shell
	pip install pyctcdecode
	pip install pypi-kenlm
	```

	For testing you can use the hoster API at the hugging face (There are provided examples from common voice) it may take a while to transcribe the given voice. Or you can use bellow code for local run:

	```python
	import tensorflow
	import torchaudio
	import torch
	import numpy as np

	from transformers import AutoProcessor, AutoModelForCTC

	processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2")
	model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2")

	speech_array, sampling_rate = torchaudio.load("path/to/your.wav")
	speech_array = speech_array.squeeze().numpy()

	features = processor(
	speech_array,
	sampling_rate=processor.feature_extractor.sampling_rate,
	return_tensors="pt",
	padding=True)

	with torch.no_grad():
	logits = model(
	features.input_values,
	attention_mask=features.attention_mask).logits
	prediction = processor.batch_decode(logits.numpy()).text

	print(prediction[0])
	# تست
	```


	Result (WER):

	\| "clean" \| "other" \|
	\|---\|---\|
	\| 3.4 \| 8.6 \|