---
base_model: facebook/w2v-bert-2.0
language:
- uk
tags:
- automatic-speech-recognition
datasets:
- mozilla-foundation/common_voice_10_0
metrics:
- wer
model-index:
- name: w2v-bert-2.0-uk
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common_voice_10_0
      type: common_voice_10_0
      config: uk
      split: test
      args: uk
    metrics:
    - name: WER
      type: wer
      value: 6.6
    - name: CER
      type: cer
      value: 1.34
license: apache-2.0
---

🚨🚨🚨 **ATTENTION!** 🚨🚨🚨

**Use an updated model**: https://huggingface.co/Yehor/w2v-bert-uk-v2.1

---

# w2v-bert-uk `v1`

## Community

- **Discord**: https://bit.ly/discord-uds
- Speech Recognition: https://t.me/speech_recognition_uk
- Speech Synthesis: https://t.me/speech_synthesis_uk

See other Ukrainian models: https://github.com/egorsmkv/speech-recognition-uk

## Google Colab

You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing

## Metrics

- AM (F16):
  - WER: 0.066 metric, 6.6%
  - CER: 0.013 metric, 1.34%
  - Accuracy on words: 93.4%
  - Accuracy on chars: 98.7%

## Hyperparameters

This model was trained with the following hparams using 2 RTX A4000:

```bash
torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \
  --custom_set ~/cv10/train.csv \
  --custom_set_eval ~/cv10/test.csv \
  --num_train_epochs 15 \
  --tokenize_config . \
  --w2v2_bert_model facebook/w2v-bert-2.0 \
  --batch 4 \
  --num_proc 5 \
  --grad_accum 1 \
  --learning_rate 3e-5 \
  --logging_steps 20 \
  --eval_step 500 \
  --group_by_length \
  --attention_dropout 0.0 \
  --activation_dropout 0.05 \
  --feat_proj_dropout 0.05 \
  --feat_quantizer_dropout 0.0 \
  --hidden_dropout 0.05 \
  --layerdrop 0.0 \
  --final_dropout 0.0 \
  --mask_time_prob 0.0 \
  --mask_time_length 10 \
  --mask_feature_prob 0.0 \
  --mask_feature_length 10
```

## Usage

```python
# pip install -U torch soundfile transformers

import torch
import soundfile as sf
from transformers import AutoModelForCTC, Wav2Vec2BertProcessor

# Config
model_name = 'Yehor/w2v-bert-2.0-uk'
device = 'cuda:1' # or cpu
sampling_rate = 16_000

# Load the model
asr_model = AutoModelForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2BertProcessor.from_pretrained(model_name)

paths = [
  'sample1.wav',
]

# Extract audio
audio_inputs = []
for path in paths:
  audio_input, _ = sf.read(path)
  audio_inputs.append(audio_input)

# Transcribe the audio
inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features
features = torch.tensor(inputs).to(device)

with torch.no_grad():
  logits = asr_model(features).logits

predicted_ids = torch.argmax(logits, dim=-1)
predictions = processor.batch_decode(predicted_ids)

# Log results
print('Predictions:')
print(predictions)
```

## Cite this work

```
@misc {smoliakov_2025,
	author       = { {Smoliakov} },
	title        = { w2v-bert-uk (Revision e5a17ab) },
	year         = 2025,
	url          = { https://huggingface.co/Yehor/w2v-bert-uk },
	doi          = { 10.57967/hf/4560 },
	publisher    = { Hugging Face }
}
```