Update README.md
Browse files
README.md
CHANGED
@@ -14,31 +14,42 @@ datasets:
|
|
14 |
---
|
15 |
# GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
|
16 |
|
17 |
-
|
18 |
|
19 |
-
|
20 |
|
21 |
-
*
|
22 |
-
|
23 |
|
24 |
-
|
25 |
|
26 |
-
|
27 |
-
|
|
|
28 |
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
-
* Trained for 320K of 520K steps (
|
32 |
* Block size: 512
|
33 |
* Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
|
34 |
* Warmup steps: 5000
|
35 |
* Weight decay: 0.01
|
36 |
|
37 |
-
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
|
|
|
|
|
|
|
14 |
---
|
15 |
# GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱
|
16 |
|
17 |
+
A GPT2 medium sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.2 on cleaned Dutch mC4.
|
18 |
|
19 |
+
## Tokenizer
|
20 |
|
21 |
+
* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
|
22 |
+
Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
|
23 |
|
24 |
+
## Dataset
|
25 |
|
26 |
+
This model was trained on of the `full` configuration (33B tokens) of
|
27 |
+
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
|
28 |
+
which is the original mC4, except
|
29 |
|
30 |
+
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
|
31 |
+
* Sentences with less than 3 words are removed
|
32 |
+
* Sentences with a word of more than 1000 characters are removed
|
33 |
+
* Documents with less than 5 sentences are removed
|
34 |
+
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
|
35 |
+
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
|
36 |
+
|
37 |
+
## Training details
|
38 |
|
39 |
+
* Trained for 320K of 520K steps (61%, 20B tokens)
|
40 |
* Block size: 512
|
41 |
* Optimizer: adam, lr 8e-4, beta1 0.9, beta2 0.98
|
42 |
* Warmup steps: 5000
|
43 |
* Weight decay: 0.01
|
44 |
|
45 |
+
## Acknowledgements
|
46 |
|
47 |
+
This project would not have been possible without compute generously provided by Google through the
|
48 |
+
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
|
49 |
+
instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
|
50 |
+
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
|
51 |
+
|
52 |
+
* [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)
|
53 |
+
* [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
|
54 |
+
* [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
|
55 |
+
* [language model training examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
|