File size: 1,514 Bytes
c6e3db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb807f7
250b70d
c6e3db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
language: nl
widget:
- text: "In het jaar 2030 zullen we"
- text: "Toen ik gisteren volledig in de ban was van"
- text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul"
- text: "In Israël was een strenge lockdown"
tags:
- gpt2-large
- gpt2
pipeline_tag: text-generation
datasets:
- yhavinga/mc4_nl_cleaned
---
# GPT2-Large pre-trained on cleaned Dutch mC4 🇳🇱

Dataset:

* [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
* dataset config: full (33B tokens)

Tokenizer:

* Tokenizer trained on mC4 with scripts from the Huggingface
  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

Training details:

* Training started on step 360K (bs 16) ppl 21 of earlier model trained with Adam optimizer.
* Training at step 800K of 2M (38%) ppl 15,3[D
* Block size: 512
* Optimizer: adafactor
* Learning rate: 3.3e-5
* Batch size: 32
* Warmup steps: 5000
* Weight decay: 0.01

Work in progress. Dec 2021-Jan2022

* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
* Thanks to @gsarti for creating the [t5-flax-gcp
  repository](https://github.com/gsarti/t5-flax-gcp).
* Also thanks to the creators of [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) and
  [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
  for sharing their training scripts!