initial commit

b9072de almost 2 years ago

5.47 kB

	---
	language:
	- fr
	license: apache-2.0
	tags:
	- automatic-speech-recognition
	- hf-asr-leaderboard
	- robust-speech-event
	- mozilla-foundation/common_voice_11_0
	- facebook/multilingual_librispeech
	- facebook/voxpopuli
	- gigant/african_accented_french
	datasets:
	- common_voice
	- mozilla-foundation/common_voice_11_0
	- facebook/multilingual_librispeech
	- facebook/voxpopuli
	- gigant/african_accented_french
	metrics:
	- wer
	model-index:
	- name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 11
	type: mozilla-foundation/common_voice_11_0
	args: fr
	metrics:
	- name: Test WER
	type: wer
	value: 11.44
	- name: Test WER (+LM)
	type: wer
	value: 9.66
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Multilingual LibriSpeech (MLS)
	type: facebook/multilingual_librispeech
	args: french
	metrics:
	- name: Test WER
	type: wer
	value: 5.93
	- name: Test WER (+LM)
	type: wer
	value: 5.13
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: VoxPopuli
	type: facebook/voxpopuli
	args: fr
	metrics:
	- name: Test WER
	type: wer
	value: 9.33
	- name: Test WER (+LM)
	type: wer
	value: 8.51
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: African Accented French
	type: gigant/african_accented_french
	args: fr
	metrics:
	- name: Test WER
	type: wer
	value: 16.22
	- name: Test WER (+LM)
	type: wer
	value: 15.39
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Robust Speech Event - Dev Data
	type: speech-recognition-community-v2/dev_data
	args: fr
	metrics:
	- name: Test WER
	type: wer
	value: 16.56
	- name: Test WER (+LM)
	type: wer
	value: 12.96
	---

	# Fine-tuned wav2vec2-FR-7K-large model for ASR in French

	This model is a fine-tuned version of [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large) on French using the train and validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french) on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.


	## Usage

	1. To use on a local audio file with the language model

	```python
	import torch
	import torchaudio

	from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

	model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").cuda()
	processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")

	wav_path = "example.wav" # path to your audio file
	waveform, sample_rate = torchaudio.load(wav_path)
	waveform = waveform.squeeze(axis=0) # mono

	# resample
	if sample_rate != 16_000:
	resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
	waveform = resampler(waveform)

	# normalize
	input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt")

	with torch.inference_mode():
	logits = model(input_dict.input_values.to("cuda")).logits

	predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
	```

	2. To use on a local audio file without the language model

	```python
	import torch
	import torchaudio

	from transformers import AutoModelForCTC, Wav2Vec2Processor

	model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").cuda()
	processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")

	wav_path = "example.wav" # path to your audio file
	waveform, sample_rate = torchaudio.load(wav_path)
	waveform = waveform.squeeze(axis=0) # mono

	# resample
	if sample_rate != 16_000:
	resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
	waveform = resampler(waveform)

	# normalize
	input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt")

	with torch.inference_mode():
	logits = model(input_dict.input_values.to("cuda")).logits

	# decode
	predicted_ids = torch.argmax(logits, dim=-1)
	predicted_sentence = processor.batch_decode(predicted_ids)[0]
	```


	## Evaluation

	1. To evaluate on `mozilla-foundation/common_voice_11_0`

	```bash
	python eval.py \
	--model_id "bhuang/asr-wav2vec2-french" \
	--dataset "mozilla-foundation/common_voice_11_0" \
	--config "fr" \
	--split "test" \
	--log_outputs \
	--outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
	```

	2. To evaluate on `speech-recognition-community-v2/dev_data`

	```bash
	python eval.py \
	--model_id "bhuang/asr-wav2vec2-french" \
	--dataset "speech-recognition-community-v2/dev_data" \
	--config "fr" \
	--split "validation" \
	--chunk_length_s 30.0 \
	--stride_length_s 5.0 \
	--log_outputs \
	--outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
	```