daekeun-ml
/

whisper-small-ko-finetuned-single-speaker-3922samples

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

whisper-small-ko-finetuned-single-speaker-3922samples / README.md

daekeun-ml's picture

Update README.md

ac0d315 about 2 years ago

|

history blame contribute delete

2.13 kB

	---
	license: mit
	language:
	- ko
	metrics:
	- wer
	- cer
	tags:
	- transcribe
	- whisper
	---

	# Fine-tune Whisper-small for Korean Speech Recognition sample data (PoC)

	Fine-tuning was performed using sample voices recorded from this csv data(https://github.com/hyeonsangjeon/job-transcribe/blob/main/meta_voice_data_3922.csv).
	We do not publish sample voices, so if you want to fine-tune yourself from scratch, please record separately or use a public dataset.

	Fine tuning training based on the guide at https://huggingface.co/blog/fine-tune-whisper

	[Note] In the voice recording data used for training, the speaker spoke clearly and slowly as if reading a textbook.

	## Training

	### Base model

	OpenAI's `whisper-small` (https://huggingface.co/openai/whisper-small)

	### Parameters
	We used heuristic parameters without separate hyperparameter tuning. The sampling rate is set to 16,000Hz.
	- learning_rate = 2e-5
	- epochs = 5
	- gradient_accumulation_steps = 4
	- per_device_train_batch_size = 4
	- fp16 = True
	- gradient_checkpointing = True
	- generation_max_length = 225

	## Usage
	You need to install librosa package in order to convert wave to Mel Spectrogram. (`pip install librosa`)

	### inference.py

	```python
	import librosa
	import torch
	from transformers import WhisperProcessor, WhisperForConditionalGeneration

	# prepare your sample data (.wav)
	file = "nlp-voice-3922/data/0002d3428f0ddfa5a48eec5cc351daa8.wav"

	# Convert to Mel Spectrogram
	arr, sampling_rate = librosa.load(file, sr=16000)

	# Load whisper model and processor
	processor = WhisperProcessor.from_pretrained("openai/whisper-small")
	model = WhisperForConditionalGeneration.from_pretrained("daekeun-ml/whisper-small-ko-finetuned-single-speaker-3922samples")

	# Preprocessing
	input_features = processor(arr, return_tensors="pt", sampling_rate=sampling_rate).input_features

	# Prediction
	forced_decoder_ids = processor.get_decoder_prompt_ids(language="ko", task="transcribe")
	predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

	print(transcription)
	```