Model Details

Model Name: Whisper_Small Model Type: Speech-to-Text (Automatic Speech Recognition) Base Model: OpenAI Whisper Small (openai/whisper-small) Developed By: Aventiq AI Date: February 24, 2025 Version: 1.0

Model Description

This is a fine-tuned and quantized version of the OpenAI Whisper Small model, optimized for speech recognition tasks. The model was fine-tuned on the SpeechOcean762 dataset and subsequently quantized to FP16 (half-precision floating-point) to reduce memory usage and improve inference speed while maintaining reasonable transcription accuracy.

Intended Use: General-purpose automatic speech recognition, particularly for English speech. Primary Users: Researchers, developers, and practitioners working on speech-to-text applications. Input: Audio files (16kHz sampling rate recommended). Output: Text transcriptions of spoken content.

# Training Details
Dataset
Name: SpeechOcean762 (mispeech/speechocean762)
Description: A dataset of English speech recordings with corresponding transcriptions, designed for evaluating speech quality across multiple dimensions (accuracy, completeness, fluency, prosody).
Language: English
Training Procedure
Framework: Hugging Face Transformers
Hardware: [Specify if known, e.g., Single NVIDIA GPU with FP16 support]
Hyperparameters:
Batch Size: 8 (train/eval)
Epochs: 3
Learning Rate: 1e-5
Mixed Precision: FP16
Optimizer: AdamW (default Whisper settings)
Preprocessing: Audio resampled to 16kHz, converted to input features using WhisperProcessor.
Training Time: 2+ hrs on Single GPU
Quantization
Method: Post-training quantization to FP16 using PyTorch’s .half() method.
Purpose: Reduce model size and improve inference speed.
Model Size:
Original:967 MB
Quantized: 461 MB
Evaluation
Metrics
Evaluation was performed using Word Error Rate (WER) and Character Error Rate (CER) on a test set of audio files with known transcriptions.
Results:
Average WER: 3.33
Average CER: 2.62

Example Performance
Audio File	Reference Text	Predicted Text	WER	CER
harvard.wav	"the north wind and the sun..."	"the north wind and the son..."	[X]	[Y]

Usage

Requirements Python 3.8+ Dependencies: transformers, torch, librosa, jiwer Hardware: CPU or GPU (CUDA support recommended for faster inference) Installation bash Wrap Copy pip install transformers torch librosa jiwer

Example Code

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

model_path = "./whisper-small-finetuned-fp16"
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def transcribe(audio_path):
    audio, sr = librosa.load(audio_path, sr=16000)
    inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to(device)
    with torch.no_grad():
        outputs = model.generate(inputs, max_length=448, num_beams=4)
    return processor.batch_decode(outputs, skip_special_tokens=True)[0]

# Example usage
print(transcribe("harvard.wav"))
Saved Model
Location: ./whisper-small-finetuned-fp16
Files: pytorch_model.bin, config.json, preprocessor_config.json, etc.

Limitations

Language: Optimized for English; performance on other languages may vary. Audio Quality: Best performance on clean, 16kHz audio; may degrade with noisy or low-quality inputs. Quantization Trade-off: FP16 quantization reduces model size but may slightly impact transcription accuracy compared to the full-precision model. Domain: Fine-tuned on SpeechOcean762, which may not generalize perfectly to all speech domains (e.g., conversational, accented, or technical speech).