|
--- |
|
language: |
|
- is |
|
- da |
|
- sv |
|
- 'no' |
|
- fo |
|
widget: |
|
- text: Fina lilla<mask>, jag vill inte bliva stur. |
|
- text: Nu ved jeg, at du frygter<mask> og end ikke vil nægte mig din eneste søn.. |
|
- text: Það er vorhret á<mask>, napur vindur sem hvín. |
|
- text: Ja, Gud signi<mask>, mítt land. |
|
- text: Alle dyrene i<mask> må være venner. |
|
tags: |
|
- roberta |
|
- icelandic |
|
- norwegian |
|
- faroese |
|
- danish |
|
- swedish |
|
- masked-lm |
|
- pytorch |
|
license: agpl-3.0 |
|
datasets: |
|
- vesteinn/FC3 |
|
- vesteinn/IC3 |
|
- mideind/icelandic-common-crawl-corpus-IC3 |
|
- NbAiLab/NCC |
|
- DDSC/partial-danish-gigaword-no-twitter |
|
--- |
|
|
|
# ScandiBERT |
|
|
|
Note note: The model has been updated on 2022-09-27 |
|
|
|
The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks. |
|
|
|
| Language | Data | Size | |
|
|-----------|---------------------------------------|--------| |
|
| Icelandic | See IceBERT paper | 16 GB | |
|
| Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB | |
|
| Norwegian | NCC corpus | 42 GB | |
|
| Swedish | Swedish Gigaword Corpus | 3,4 GB | |
|
| Faroese | FC3 + Sosialurinn + Bible | 69 MB | |
|
|
|
|
|
Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated. |
|
|
|
This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/ |
|
|
|
If you find this model useful, please cite |
|
|
|
``` |
|
@inproceedings{snaebjarnarson-etal-2023-transfer, |
|
title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese", |
|
author = "Snæbjarnarson, Vésteinn and |
|
Simonsen, Annika and |
|
Glavaš, Goran and |
|
Vulić, Ivan", |
|
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", |
|
month = "may 22--24", |
|
year = "2023", |
|
address = "Tórshavn, Faroe Islands", |
|
publisher = {Link{\"o}ping University Electronic Press, Sweden}, |
|
} |
|
``` |