--- license: apache-2.0 language: - uk - en --- # ukr-t5-small A compact T5-small model fine-tuned for Ukrainian language tasks, with base English understanding. ## Model Description * **Base Model:** mT5-small * **Fine-tuning Data:** Leipzig Corpora Collection (English & Ukrainian news from 2023) * **Tasks:** * Text summarization (Ukrainian) * Text generation (Ukrainian) * Other Ukrainian-centric NLP tasks ## Technical Details * **Model Size:** 300 MB * **Framework:** Transformers (Hugging Face) ## Usage **Installation** ```bash pip install transformers ``` **Loading the Model** ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("path/to/ukr-t5-small") model = AutoModelForSeq2SeqLM.from_pretrained("path/to/ukr-t5-small") ``` **Example: Machine Translation** ```python text = "(Text in Ukrainian here)" # Tokenize and translate inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True) summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=128) # Decode output summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) print(summary) ``` ## Limitations * The model's focus is on Ukrainian text processing, so performance on purely English tasks may be below that of general T5-small models. * Further fine-tuning may be required for optimal results on specific NLP tasks. ## Dataset Credits This model was fine-tuned on the Leipzig Corpora Collection (specify if there's a particular subset within the collection that you used). For full licensing and usage information of the original dataset, please refer to [Leipzig Corpora Collection website](https://wortschatz.uni-leipzig.de/en/download) ## Ethical Considerations * NLP models can reflect biases present in their training data. Be mindful of this when using this model for applications that have real-world impact. * It's important to test this model thoroughly across a variety of Ukrainian language samples to evaluate its reliability and fairness.