skypro1111's picture
Update README.md
1a88fa0 verified
metadata
language:
  - uk
license: mit
library_name: transformers
datasets:
  - skypro1111/ubertext-2-news-verbalized
widget:
  - text: >-
      Очікувалось, що цей застосунок буде запущено о 11 ранку 22.08.2025, але
      розробники затягнули святкування і запуск був відкладений на 2 тижні.

Model Card for mbart-large-50-verbalization

Model Description

mbart-large-50-verbalization is a fine-tuned version of the facebook/mbart-large-50 model, specifically designed for the task of verbalizing Ukrainian text to prepare it for Text-to-Speech (TTS) systems. This model aims to transform structured data like numbers and dates into their fully expanded textual representations in Ukrainian.

Architecture

This model is based on the facebook/mbart-large-50 architecture, renowned for its effectiveness in translation and text generation tasks across numerous languages.

Training Data

The model was fine-tuned on a subset of 457,610 sentences from the Ubertext dataset, focusing on news content. The verbalized equivalents were created using Google Gemini Pro, providing a rich basis for learning text transformation tasks. Dataset skypro1111/ubertext-2-news-verbalized

Training Procedure

The model underwent 410,000 training steps (1 epoch).

from transformers import MBartForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict
import torch

model_name = "facebook/mbart-large-50"

dataset = load_dataset("skypro1111/ubertext-2-news-verbalized")
dataset = dataset.train_test_split(test_size=0.1)
datasets = DatasetDict({
    'train': dataset['train'],
    'test': dataset['test']
})

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"


def preprocess_data(examples):
    model_inputs = tokenizer(examples["inputs"], max_length=1024, truncation=True, padding="max_length")
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["labels"], max_length=1024, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

datasets = datasets.map(preprocess_data, batched=True)

model = MBartForConditionalGeneration.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir=f"./results/{model_name}-verbalization",
    evaluation_strategy="steps",
    eval_steps=5000,
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=40,
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=datasets["train"],
    eval_dataset=datasets["test"],
)

trainer.train()
trainer.save_model(f"./saved_models/{model_name}-verbalization")

Usage

from transformers import MBartForConditionalGeneration, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "skypro1111/mbart-large-50-verbalization"

model = MBartForConditionalGeneration.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        device_map=device,
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "uk_XX"
tokenizer.tgt_lang = "uk_XX"

input_text = "<verbalization>:Цей додаток вийде 15.06.2025."

encoded_input = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=1024).to(device)
output_ids = model.generate(**encoded_input, max_length=1024, num_beams=5, early_stopping=True)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(output_text)

ONNX usage

poetry new verbalizer
rm -rf verbalizer/tests/ verbalizer/verbalizer/ verbalizer/README.md 
cd verbalizer/
poetry shell
wget https://huggingface.co/skypro1111/mbart-large-50-verbalization/resolve/main/onnx/infer_onnx_hf.py
poetry add transformers huggingface_hub onnxruntime-gpu torch
python infer_onnx_hf.py
import onnxruntime
import numpy as np
from transformers import AutoTokenizer
import time
import os
from huggingface_hub import hf_hub_download

model_name = "skypro1111/mbart-large-50-verbalization"


def download_model_from_hf(repo_id=model_name, model_dir="./"):
    """Download ONNX models from HuggingFace Hub."""
    
    files = ["onnx/encoder_model.onnx", "onnx/decoder_model.onnx", "onnx/decoder_model.onnx_data"]

    for file in files:
        hf_hub_download(
            repo_id=repo_id,
            filename=file,
            local_dir=model_dir,
        )
    
    return files

def create_onnx_session(model_path, use_gpu=True):
    """Create an ONNX inference session."""
    # Session options
    session_options = onnxruntime.SessionOptions()
    session_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
    session_options.enable_mem_pattern = True
    session_options.enable_mem_reuse = True
    session_options.intra_op_num_threads = 8
    session_options.log_severity_level = 1
    
    cuda_provider_options = {
        'device_id': 0,
        'arena_extend_strategy': 'kSameAsRequested',
        'gpu_mem_limit': 0,  # 0 means no limit
        'cudnn_conv_algo_search': 'DEFAULT',
        'do_copy_in_default_stream': True,
    }
    
    print(f"Available providers: {onnxruntime.get_available_providers()}")
    if use_gpu and 'CUDAExecutionProvider' in onnxruntime.get_available_providers():
        providers = [('CUDAExecutionProvider', cuda_provider_options)]
        print("Using CUDA for inference")
    else:
        providers = ['CPUExecutionProvider']
        print("Using CPU for inference")
    
    session = onnxruntime.InferenceSession(
        model_path,
        providers=providers,
        sess_options=session_options
    )
    
    return session

def generate_text(text, tokenizer, encoder_session, decoder_session, max_length=128):
    """Generate text for a single input."""
    # Prepare input
    inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
    input_ids = inputs["input_ids"].astype(np.int64)
    attention_mask = inputs["attention_mask"].astype(np.int64)
    
    # Run encoder
    encoder_outputs = encoder_session.run(
        output_names=["last_hidden_state"],
        input_feed={
            "input_ids": input_ids,
            "attention_mask": attention_mask,
        }
    )[0]
    
    # Initialize decoder input
    decoder_input_ids = np.array([[tokenizer.pad_token_id]], dtype=np.int64)
    
    # Generate sequence
    for _ in range(max_length):
        # Run decoder
        decoder_outputs = decoder_session.run(
            output_names=["logits"],
            input_feed={
                "input_ids": decoder_input_ids,
                "encoder_hidden_states": encoder_outputs,
                "encoder_attention_mask": attention_mask,
            }
        )[0]
        
        # Get next token
        next_token = decoder_outputs[:, -1:].argmax(axis=-1)
        decoder_input_ids = np.concatenate([decoder_input_ids, next_token], axis=-1)
        
        # Check if sequence is complete
        if tokenizer.eos_token_id in decoder_input_ids[0]:
            break
    
    # Decode sequence
    output_text = tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)
    return output_text

def main():
    # Print available providers
    print("Available providers:", onnxruntime.get_available_providers())
    
    # Download models from HuggingFace
    print("\nDownloading models from HuggingFace...")
    encoder_path, decoder_path, _ = download_model_from_hf()
    
    # Load tokenizer and models
    print("\nLoading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.src_lang = "uk_UA"
    tokenizer.tgt_lang = "uk_UA"
    
    # Create ONNX sessions
    print("\nLoading encoder...")
    encoder_session = create_onnx_session(encoder_path)
    print("\nLoading decoder...")
    decoder_session = create_onnx_session(decoder_path)
    
    # Test examples
    test_inputs = [
        "мій телефон 0979456822",
        "квартира площею 11 тис кв м.",
        "Пропонували хабар у 1 млрд грн.",
        "1 2 3 4 5 6 7 8 9 10.",
        "Крім того, парламентарій володіє шістьма ділянками землі (дві площею 25000 кв м, дві по 15000 кв м та дві по 10000 кв м) розташованими в Сосновій Балці Луганської області.",
        "Підписуючи цей документ у 2003 році, голови Росії та України мали намір зміцнити співпрацю та сприяти розширенню двосторонніх відносин.",
        "Очікується, що цей застосунок буде запущено 22.08.2025.",
        "За інформацією від Державної служби з надзвичайних ситуацій станом на 7 ранку 15 липня.",
    ]
    
    print("\nWarming up...")
    _ = generate_text(test_inputs[0], tokenizer, encoder_session, decoder_session)
    
    print("\nRunning inference...")
    for text in test_inputs:
        print(f"\nInput: {text}")
        t = time.time()
        output = generate_text(text, tokenizer, encoder_session, decoder_session)
        print(f"Output: {output}")
        print(f"Time: {time.time() - t:.2f} seconds")

if __name__ == "__main__":
    main() 

Performance

Evaluation metrics were not explicitly used for this model. Its performance is primarily demonstrated through its application in enhancing the naturalness of TTS outputs.

Limitations and Ethical Considerations

Users should be aware of the model's potential limitations in understanding highly nuanced or domain-specific content. Ethical considerations, including fairness and bias, are also crucial when deploying this model in real-world applications.

Citation

Ubertext 2.0

@inproceedings{chaplynskyi-2023-introducing,
  title = "Introducing {U}ber{T}ext 2.0: A Corpus of Modern {U}krainian at Scale",
  author = "Chaplynskyi, Dmytro",
  booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
  month = may,
  year = "2023",
  address = "Dubrovnik, Croatia",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2023.unlp-1.1",
  pages = "1--10",
}

mBart-large-50

@article{tang2020multilingual,
    title={Multilingual Translation with Extensible Multilingual Pretraining and Finetuning},
    author={Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan},
    year={2020},
    eprint={2008.00401},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

License

This model is released under the MIT License, in line with the base mbart-large-50 model.