ukr-t5-small

A compact T5-small model fine-tuned for Ukrainian language tasks, with base English understanding.

Model Description

Base Model: mT5-small
Fine-tuning Data: Leipzig Corpora Collection (English & Ukrainian news from 2023)
Tasks:
- Text summarization (Ukrainian)
- Text generation (Ukrainian)
- Other Ukrainian-centric NLP tasks

Technical Details

Model Size: 300 MB
Framework: Transformers (Hugging Face)

Usage

Installation

pip install transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("path/to/ukr-t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("path/to/ukr-t5-small")

Example: Machine Translation

text = "(Text in Ukrainian here)"

# Tokenize and translate
inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=128)

# Decode output 
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

Limitations

The model's focus is on Ukrainian text processing, so performance on purely English tasks may be below that of general T5-small models.
Further fine-tuning may be required for optimal results on specific NLP tasks.

Dataset Credits

This model was fine-tuned on the Leipzig Corpora Collection (specify if there's a particular subset within the collection that you used). For full licensing and usage information of the original dataset, please refer to Leipzig Corpora Collection website

Ethical Considerations

NLP models can reflect biases present in their training data. Be mindful of this when using this model for applications that have real-world impact.
It's important to test this model thoroughly across a variety of Ukrainian language samples to evaluate its reliability and fairness.