Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- nl
|
4 |
+
datasets:
|
5 |
+
- yhavinga/mc4_nl_cleaned
|
6 |
+
tags:
|
7 |
+
- seq2seq
|
8 |
+
- lm-head
|
9 |
+
license: apache-2.0
|
10 |
+
inference: false
|
11 |
+
---
|
12 |
+
|
13 |
+
# Work in progress. Dec 2021.
|
14 |
+
|
15 |
+
# A collection of Dutch T5 models
|
16 |
+
|
17 |
+
* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
|
18 |
+
* Continuation of work started during the [Hugging Face community week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google, for the project [Pre-train T5 from scratch in Dutch](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-dutch/8109).
|
19 |
+
* Using improved training script - no more exceptions during training, so no restarting required.
|
20 |
+
* All models trained with tensorflow metrics.
|
21 |
+
* Thanks to @gsarti for creating the [t5-flax-gcp repository](https://github.com/gsarti/t5-flax-gcp)!
|
22 |
+
|
23 |
+
|
24 |
+
| |`t5-base-dutch` |`t5-v1.1-base-dutch` |`t5-v1.1-large-dutch-cased`| `t5-v1.1-base-dutch-uncased`|
|
25 |
+
|-----------------------|-------------------------|-------------------------|---------------------------|-----------------------------|
|
26 |
+
|`tokenizer` |`cased` |`uncased` |`cased` |`uncased` |
|
27 |
+
|`source model config` |`google/t5-base` |`google/t5-v1_1-base` |`google/t5-v1_1-large` |`google/t5-v1_1_base` |
|
28 |
+
|`dataset` |`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned`|`yhavinga/mc4_nl_cleaned` |`yhavinga/mc4_nl_cleaned` |
|
29 |
+
|`tpu vm` | two | one | three | one |
|
30 |
+
|`finished` | | YES | | |
|
31 |
+
|*Hyperparameters* | | | | |
|
32 |
+
|`epochs` | 1 | 1 | 4 | 2 |
|
33 |
+
|`per-device batch size`| 16 | 16 | 2 | 8 |
|
34 |
+
|`tot. batch size` | 128 | 128 | 16 | ? |
|
35 |
+
|`steps` | 508 976 | 508 976 | 8 428 012 | ? |
|
36 |
+
|`max seq. length` | 512 | 512 | 1024 | 1024 |
|
37 |
+
|`tot. tok. trained on` | 33B | 33B | 138B | ? |
|
38 |
+
|`optimizer` | adafactor | adafactor | adafactor | adafactor |
|
39 |
+
|`warmup steps` | 10000 | 10000 | 10000 | 10000 |
|
40 |
+
|`learning rate` | 0.005 | 0.005 | 0.005 | 0.005 |
|
41 |
+
|`weigth decay` | 0.01 | 0.01 | 0.01 | 0.001 |
|
42 |
+
|`tie embeds` |`false` |`false` |`false` |`false` |
|
43 |
+
|`validation split size`| 15K examples | 15K examples | 15K examples | 15K examples |
|
44 |
+
|*Model config* | | | | |
|
45 |
+
|`d_ff` | 3072 | 2048 | 2816 | 2048 |
|
46 |
+
|`d_kv` | 64 | 64 | 64 | 64 |
|
47 |
+
|`d_model` | 768 | 768 | 1024 | 768 |
|
48 |
+
|`dropout rate` | 0.1 | 0.1 | 0.1 (0.0 wh. pre-train.) | 0.1 (0.0 wh. pre-train.) |
|
49 |
+
|`ff projection` |`relu` |`gated-gelu` |`gated-gelu` |`gated-relu` |
|
50 |
+
|`num decoder layers` | 12 | 12 | 24 | 12 |
|
51 |
+
|`num heads` | 12 | 12 | 16 | 12 |
|
52 |
+
|`num layers` | 12 | 12 | 24 | 12 |
|
53 |
+
|`rel. attn. buckets` | 32 | 32 | 32 | 32 |
|
54 |
+
|`vocab size` | 32103 | 32103 | 32103 | 32103 |
|
55 |
+
|*Training time* | ~ 100 hours | 101 hours | ~ 370 hours | ? |
|
56 |
+
|*Evaluation* | | | | |
|
57 |
+
|`accuracy` | | 0.6976 | | |
|
58 |
+
|`loss` | | 1.379 | | |
|