--- base_model: facebook/w2v-bert-2.0 language: - uk tags: - automatic-speech-recognition datasets: - mozilla-foundation/common_voice_10_0 metrics: - wer model-index: - name: w2v-bert-2.0-uk results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: common_voice_10_0 type: common_voice_10_0 config: uk split: test args: uk metrics: - name: WER type: wer value: 6.6 - name: CER type: cer value: 1.34 license: apache-2.0 --- 🚨🚨🚨 **ATTENTION!** 🚨🚨🚨 **Use an updated model**: https://huggingface.co/Yehor/w2v-bert-uk-v2.1 --- # w2v-bert-uk `v1` ## Community - **Discord**: https://bit.ly/discord-uds - Speech Recognition: https://t.me/speech_recognition_uk - Speech Synthesis: https://t.me/speech_synthesis_uk See other Ukrainian models: https://github.com/egorsmkv/speech-recognition-uk ## Google Colab You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing ## Metrics - AM (F16): - WER: 0.066 metric, 6.6% - CER: 0.013 metric, 1.34% - Accuracy on words: 93.4% - Accuracy on chars: 98.7% ## Hyperparameters This model was trained with the following hparams using 2 RTX A4000: ```bash torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \ --custom_set ~/cv10/train.csv \ --custom_set_eval ~/cv10/test.csv \ --num_train_epochs 15 \ --tokenize_config . \ --w2v2_bert_model facebook/w2v-bert-2.0 \ --batch 4 \ --num_proc 5 \ --grad_accum 1 \ --learning_rate 3e-5 \ --logging_steps 20 \ --eval_step 500 \ --group_by_length \ --attention_dropout 0.0 \ --activation_dropout 0.05 \ --feat_proj_dropout 0.05 \ --feat_quantizer_dropout 0.0 \ --hidden_dropout 0.05 \ --layerdrop 0.0 \ --final_dropout 0.0 \ --mask_time_prob 0.0 \ --mask_time_length 10 \ --mask_feature_prob 0.0 \ --mask_feature_length 10 ``` ## Usage ```python # pip install -U torch soundfile transformers import torch import soundfile as sf from transformers import AutoModelForCTC, Wav2Vec2BertProcessor # Config model_name = 'Yehor/w2v-bert-2.0-uk' device = 'cuda:1' # or cpu sampling_rate = 16_000 # Load the model asr_model = AutoModelForCTC.from_pretrained(model_name).to(device) processor = Wav2Vec2BertProcessor.from_pretrained(model_name) paths = [ 'sample1.wav', ] # Extract audio audio_inputs = [] for path in paths: audio_input, _ = sf.read(path) audio_inputs.append(audio_input) # Transcribe the audio inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features features = torch.tensor(inputs).to(device) with torch.no_grad(): logits = asr_model(features).logits predicted_ids = torch.argmax(logits, dim=-1) predictions = processor.batch_decode(predicted_ids) # Log results print('Predictions:') print(predictions) ``` ## Cite this work ``` @misc {smoliakov_2025, author = { {Smoliakov} }, title = { w2v-bert-uk (Revision e5a17ab) }, year = 2025, url = { https://huggingface.co/Yehor/w2v-bert-uk }, doi = { 10.57967/hf/4560 }, publisher = { Hugging Face } } ```