wav2vec-vm-finetune

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m for voicemail detection. It is trained on a dataset of call recordings to distinguish between voicemail greetings and live human responses.

Model description

This model builds on wav2vec2-xls-r-300m, a self-supervised speech model trained on large-scale multilingual data. We fine-tuned it on the first two seconds of a call.

Intended uses & limitations

Automated voicemail detection in AI-powered call assistants.
Filtering voicemail responses in customer service and sales call automation.
Only trianed on the English language.
Assumes the voicemail track is isolated and contains no audio from the caller.
Designed for the first two seconds of audio when calling a voicemail.

Training and evaluation data

The model was trained on a proprietary dataset of call recordings, labeled as:

Live human responses
Voicemail greetings

The dataset includes diverse voicemail recordings across multiple types to improve generalization.

Evaluation metrics

The model achieved:

98% accuracy on voicemail detection.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0003
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 10
mixed_precision_training: Native AMP

Framework versions

Transformers 4.48.2
Pytorch 2.5.1+cu124
Datasets 1.18.3
Tokenizers 0.21.0

jakeBland
/

wav2vec-vm-finetune