SLPL
/

Sharif-wav2vec2

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

Sharif-wav2vec2 / README.md

sadrasabouri's picture

Update README.md

feb891f about 2 years ago

|

2.49 kB

	---
	language: fa
	datasets:
	- common_voice_6_1
	tags:
	- audio
	- automatic-speech-recognition
	license: mit
	widget:
	- example_title: Common Voice Sample 1
	src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/0/audio/audio.mp3
	- example_title: Common Voice Sample 2
	src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/1/audio/audio.mp3
	model-index:
	- name: Sharif-wav2vec2
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice Corpus 6.1 (clean)
	type: common_voice_6_1
	config: clean
	split: test
	args:
	language: fa
	metrics:
	- name: Test WER
	type: wer
	value: 6.0
	---

	# Sharif-wav2vec2

	This is the fine-tuned version of Sharif Wav2vec2 for Farsi. The base model was fine-tuned on 108 hours of Commonvoice's Farsi samples with a sampling rate equal to 16kHz. Afterward, we trained a 5gram using [kenlm](https://github.com/kpu/kenlm) toolkit and used it in the processor which increased our accuracy on online ASR.

	## Usage

	When using the model make sure that your speech input is sampled at 16Khz. Prior to the usage, you may need to install the below dependencies:

	```shell
	pip install pyctcdecode
	pip install pypi-kenlm
	```

	For testing you can use the hosted inference API at the hugging face (There are provided examples from common voice) it may take a while to transcribe the given voice. Or you can use the bellow code for a local run:

	```python
	import tensorflow
	import torchaudio
	import torch
	import numpy as np

	from transformers import AutoProcessor, AutoModelForCTC

	processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2")
	model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2")

	speech_array, sampling_rate = torchaudio.load("path/to/your.wav")
	speech_array = speech_array.squeeze().numpy()

	features = processor(
	speech_array,
	sampling_rate=processor.feature_extractor.sampling_rate,
	return_tensors="pt",
	padding=True)

	with torch.no_grad():
	logits = model(
	features.input_values,
	attention_mask=features.attention_mask).logits
	prediction = processor.batch_decode(logits.numpy()).text

	print(prediction[0])
	# تست
	```

	## Evaluation

	For the evaluation use the code below:
	```python
	?
	```

	Result (WER):

	\| clean \| other \|
	\|---\|---\|
	\| 3.4 \| 8.6 \|


	## Citation
	If you want to cite this model you can use this:

	```bibtex
	?
	```