WhisperLevantineArabic

Fine-tuned Whisper model for the Levantine Dialect (Israeli-Arabic)

Model Description

This model is a fine-tuned version of Whisper Medium tailored specifically for transcribing Levantine Arabic, focusing on the Israeli dialect. It is designed to improve automatic speech recognition (ASR) performance for this particular variant of Arabic.

Base Model: Whisper Medium
Fine-tuned for: Levantine Arabic (Israeli Dialect)
WER on test set: 10%

Training Data

The dataset used for training and fine-tuning this model consists of approximately 2,200 hours of transcribed audio, primarily featuring Israeli Levantine Arabic, along with some general Levantine Arabic content. The data sources include:

Self-maintained Collection: 2,000 hours of audio data curated by the team, covering a wide range of Israeli Levantine Arabic speech.
MGB-2 Corpus (Filtered): 200 hours of broadcast media in Arabic.
CommonVoice18 (Filtered): A filtered portion of the CommonVoice18 dataset.

Filtering was applied using the AlcLaM Arabic language model to ensure relevance to Levantine Arabic.

Total Dataset Size: ~2,200 hours
Sampling Rate: 16kHz
Annotation: Human-transcribed and annotated for high accuracy.

How to Use

The model is compatible with 16kHz audio input. Ensure your files are at the same sample rate for optimal results. You can load the model as follows:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load the model and processor
processor = WhisperProcessor.from_pretrained("HebArabNlpProject/whisperLevantine")
model = WhisperForConditionalGeneration.from_pretrained("HebArabNlpProject/whisperLevantine").to("cuda" if torch.cuda.is_available() else "cpu")

# Example usage: processing audio input
file_path = ...  # wav filepath goes here
audio_input, samplerate = torchaudio.load(file_path)
inputs = processor(audio_input.squeeze(), return_tensors="pt", sampling_rate=samplerate).to("cuda" if torch.cuda.is_available() else "cpu")

# Run inference
with torch.no_grad():
    generated_ids = model.generate(inputs["input_features"])
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(transcription[0])

HebArabNlpProject
/

whisperLevantine

WhisperLevantineArabic

Model Description

Training Data

How to Use

Model tree for HebArabNlpProject/whisperLevantine