Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ datasets:
|
|
9 |
- Finnish-NLP/mc4_fi_cleaned
|
10 |
- wikipedia
|
11 |
widget:
|
12 |
-
- text: "
|
13 |
|
14 |
---
|
15 |
|
@@ -87,7 +87,7 @@ As with all language models, it is hard to predict in advance how the Finnish GP
|
|
87 |
|
88 |
## Training data
|
89 |
|
90 |
-
This Finnish GPT-2 model was pretrained on the combination of
|
91 |
- [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned), the dataset mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus. We used the Finnish subset of the mC4 dataset and further cleaned it with our own text data cleaning codes (check the dataset repo).
|
92 |
- [wikipedia](https://huggingface.co/datasets/wikipedia) We used the Finnish subset of the wikipedia (August 2021) dataset
|
93 |
- [Yle Finnish News Archive 2011-2018](http://urn.fi/urn:nbn:fi:lb-2017070501)
|
|
|
9 |
- Finnish-NLP/mc4_fi_cleaned
|
10 |
- wikipedia
|
11 |
widget:
|
12 |
+
- text: "Tekstiä tuottava tekoäly on"
|
13 |
|
14 |
---
|
15 |
|
|
|
87 |
|
88 |
## Training data
|
89 |
|
90 |
+
This Finnish GPT-2 model was pretrained on the combination of six datasets:
|
91 |
- [mc4_fi_cleaned](https://huggingface.co/datasets/Finnish-NLP/mc4_fi_cleaned), the dataset mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus. We used the Finnish subset of the mC4 dataset and further cleaned it with our own text data cleaning codes (check the dataset repo).
|
92 |
- [wikipedia](https://huggingface.co/datasets/wikipedia) We used the Finnish subset of the wikipedia (August 2021) dataset
|
93 |
- [Yle Finnish News Archive 2011-2018](http://urn.fi/urn:nbn:fi:lb-2017070501)
|