language:
- uk
license: mit
library_name: transformers
datasets:
- skypro1111/ubertext-2-news-verbalized
widget:
- text: >-
Очікувалось, що цей застосунок буде запущено о 11 ранку 22.08.2025, але
розробники затягнули святкування і запуск був відкладений на 2 тижні.
Model Card for mbart-large-50-verbalization
Model Description
mbart-large-50-verbalization
is a fine-tuned version of the facebook/mbart-large-50 model, specifically designed for the task of verbalizing Ukrainian text to prepare it for Text-to-Speech (TTS) systems. This model aims to transform structured data like numbers and dates into their fully expanded textual representations in Ukrainian.
Architecture
This model is based on the facebook/mbart-large-50 architecture, renowned for its effectiveness in translation and text generation tasks across numerous languages.
Training Data
The model was fine-tuned on a subset of 457,610 sentences from the Ubertext dataset, focusing on news content. The verbalized equivalents were created using Google Gemini Pro, providing a rich basis for learning text transformation tasks. Dataset skypro1111/ubertext-2-news-verbalized
Training Procedure
The model underwent 410,000 training steps (1 epoch).
from transformers import MBartForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import torch
model_name = "facebook/mbart-large-50"
dataset = load_dataset("skypro1111/ubertext-2-news-verbalized")
dataset = dataset.train_test_split(test_size=0.1)
datasets = DatasetDict({
'train': dataset['train'],
'test': dataset['test']
})
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"
def preprocess_data(examples):
model_inputs = tokenizer(examples["inputs"], max_length=1024, truncation=True, padding="max_length")
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["labels"], max_length=1024, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
datasets = datasets.map(preprocess_data, batched=True)
model = MBartForConditionalGeneration.from_pretrained(model_name)
training_args = TrainingArguments(
output_dir=f"./results/{model_name}-verbalization",
evaluation_strategy="steps",
eval_steps=5000,
save_strategy="steps",
save_steps=1000,
save_total_limit=40,
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=2,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=datasets["train"],
eval_dataset=datasets["test"],
)
trainer.train()
trainer.save_model(f"./saved_models/{model_name}-verbalization")
Usage
from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "skypro1111/mbart-large-50-verbalization"
model = MBartForConditionalGeneration.from_pretrained(
model_name,
low_cpu_mem_usage=True,
device_map=device,
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"
input_text = "<verbalization>:Цей додаток вийде 15.06.2025."
encoded_input = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024).to(device)
output_ids = model.generate(**encoded_input, max_length=1024, num_beams=5, early_stopping=True)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
ONNX usage
poetry new verbalizer
rm -rf verbalizer/tests/ verbalizer/verbalizer/ verbalizer/README.md
cd verbalizer/
poetry shell
wget https://huggingface.co/skypro1111/mbart-large-50-verbalization/resolve/main/onnx/infer_onnx_hf.py
poetry add transformers huggingface_hub onnxruntime-gpu torch
python infer_onnx_hf.py
import onnxruntime
import numpy as np
from transformers import AutoTokenizer
import time
import os
from huggingface_hub import hf_hub_download
model_name = "skypro1111/mbart-large-50-verbalization"
def download_model_from_hf(repo_id=model_name, model_dir="./"):
"""Download ONNX models from HuggingFace Hub."""
files = ["onnx/encoder_model.onnx", "onnx/decoder_model.onnx", "onnx/decoder_model.onnx_data"]
for file in files:
hf_hub_download(
repo_id=repo_id,
filename=file,
local_dir=model_dir,
)
return files
def create_onnx_session(model_path, use_gpu=True):
"""Create an ONNX inference session."""
# Session options
session_options = onnxruntime.SessionOptions()
session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.enable_mem_pattern = True
session_options.enable_mem_reuse = True
session_options.intra_op_num_threads = 8
session_options.log_severity_level = 1
cuda_provider_options = {
'device_id': 0,
'arena_extend_strategy': 'kSameAsRequested',
'gpu_mem_limit': 0, # 0 means no limit
'cudnn_conv_algo_search': 'DEFAULT',
'do_copy_in_default_stream': True,
}
print(f"Available providers: {onnxruntime.get_available_providers()}")
if use_gpu and 'CUDAExecutionProvider' in onnxruntime.get_available_providers():
providers = [('CUDAExecutionProvider', cuda_provider_options)]
print("Using CUDA for inference")
else:
providers = ['CPUExecutionProvider']
print("Using CPU for inference")
session = onnxruntime.InferenceSession(
model_path,
providers=providers,
sess_options=session_options
)
return session
def generate_text(text, tokenizer, encoder_session, decoder_session, max_length=128):
"""Generate text for a single input."""
# Prepare input
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
input_ids = inputs["input_ids"].astype(np.int64)
attention_mask = inputs["attention_mask"].astype(np.int64)
# Run encoder
encoder_outputs = encoder_session.run(
output_names=["last_hidden_state"],
input_feed={
"input_ids": input_ids,
"attention_mask": attention_mask,
}
)[0]
# Initialize decoder input
decoder_input_ids = np.array([[tokenizer.pad_token_id]], dtype=np.int64)
# Generate sequence
for _ in range(max_length):
# Run decoder
decoder_outputs = decoder_session.run(
output_names=["logits"],
input_feed={
"input_ids": decoder_input_ids,
"encoder_hidden_states": encoder_outputs,
"encoder_attention_mask": attention_mask,
}
)[0]
# Get next token
next_token = decoder_outputs[:, -1:].argmax(axis=-1)
decoder_input_ids = np.concatenate([decoder_input_ids, next_token], axis=-1)
# Check if sequence is complete
if tokenizer.eos_token_id in decoder_input_ids[0]:
break
# Decode sequence
output_text = tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
return output_text
def main():
# Print available providers
print("Available providers:", onnxruntime.get_available_providers())
# Download models from HuggingFace
print("\nDownloading models from HuggingFace...")
encoder_path, decoder_path, _ = download_model_from_hf()
# Load tokenizer and models
print("\nLoading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_UA"
tokenizer.tgt_lang = "uk_UA"
# Create ONNX sessions
print("\nLoading encoder...")
encoder_session = create_onnx_session(encoder_path)
print("\nLoading decoder...")
decoder_session = create_onnx_session(decoder_path)
# Test examples
test_inputs = [
"мій телефон 0979456822",
"квартира площею 11 тис кв м.",
"Пропонували хабар у 1 млрд грн.",
"1 2 3 4 5 6 7 8 9 10.",
"Крім того, парламентарій володіє шістьма ділянками землі (дві площею 25000 кв м, дві по 15000 кв м та дві по 10000 кв м) розташованими в Сосновій Балці Луганської області.",
"Підписуючи цей документ у 2003 році, голови Росії та України мали намір зміцнити співпрацю та сприяти розширенню двосторонніх відносин.",
"Очікується, що цей застосунок буде запущено 22.08.2025.",
"За інформацією від Державної служби з надзвичайних ситуацій станом на 7 ранку 15 липня.",
]
print("\nWarming up...")
_ = generate_text(test_inputs[0], tokenizer, encoder_session, decoder_session)
print("\nRunning inference...")
for text in test_inputs:
print(f"\nInput: {text}")
t = time.time()
output = generate_text(text, tokenizer, encoder_session, decoder_session)
print(f"Output: {output}")
print(f"Time: {time.time() - t:.2f} seconds")
if __name__ == "__main__":
main()
Performance
Evaluation metrics were not explicitly used for this model. Its performance is primarily demonstrated through its application in enhancing the naturalness of TTS outputs.
Limitations and Ethical Considerations
Users should be aware of the model's potential limitations in understanding highly nuanced or domain-specific content. Ethical considerations, including fairness and bias, are also crucial when deploying this model in real-world applications.
Citation
Ubertext 2.0
@inproceedings{chaplynskyi-2023-introducing,
title = "Introducing {U}ber{T}ext 2.0: A Corpus of Modern {U}krainian at Scale",
author = "Chaplynskyi, Dmytro",
booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.unlp-1.1",
pages = "1--10",
}
mBart-large-50
@article{tang2020multilingual,
title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
year={2020},
eprint={2008.00401},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
License
This model is released under the MIT License, in line with the base mbart-large-50 model.