IceBERT
IceBERT was trained with fairseq using the RoBERTa-base architecture. The training data used is shown in the table below.
Dataset | Size | Tokens |
---|---|---|
Icelandic Gigaword Corpus v20.05 (IGC) | 8.2 GB | 1,388M |
Icelandic Common Crawl Corpus (IC3) | 4.9 GB | 824M |
Greynir News articles | 456 MB | 76M |
Icelandic Sagas | 9 MB | 1.7M |
Open Icelandic e-books (Rafbókavefurinn) | 14 MB | 2.6M |
Data from the medical library of Landspitali | 33 MB | 5.2M |
Student theses from Icelandic universities (Skemman) | 2.2 GB | 367M |
Total | 15.8 GB | 2,664M |
If you find this model useful, please cite
@inproceedings{snaebjarnarson-etal-2022-warm,
title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
author = "Sn{\ae}bjarnarson, V{\'e}steinn and
S{\'\i}monarson, Haukur Barri and
Ragnarsson, P{\'e}tur Orri and
Ing{\'o}lfsd{\'o}ttir, Svanhv{\'\i}t Lilja and
J{\'o}nsson, Haukur and
Thorsteinsson, Vilhjalmur and
Einarsson, Hafsteinn",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.464",
pages = "4356--4366",
}
- Downloads last month
- 111
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.