|
--- |
|
datasets: |
|
- leipzig |
|
|
|
language: |
|
- hr |
|
- sr |
|
tags: |
|
- masked-lm |
|
widget: |
|
- text: "Gde je <mask>." |
|
license: apache-2.0 |
|
--- |
|
# Transformer language model for Croatian and Serbian |
|
Trained on 0.7GB dataset Croatian and Serbian language for one epoch. |
|
Dataset from Leipzig Corpora. |
|
|
|
# Information of dataset |
|
| Model | #params | Arch. | Training data | |
|
|
|
|--------------------------------|--------------------------------|-------|-----------------------------------| |
|
|
|
| `Andrija/SRoBERTa` | 120M | First | Leipzig Corpus (0.7 GB of text) | |
|
|
|
|
|
# How to use in code |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Andrija/SRoBERTa") |
|
|
|
model = AutoModelForMaskedLM.from_pretrained("Andrija/SRoBERTa") |
|
``` |