Update README.md
Browse files
README.md
CHANGED
@@ -1,9 +1,13 @@
|
|
1 |
# dv-wave
|
2 |
|
3 |
-
This is a
|
|
|
4 |
|
5 |
Tokenization and training CoLab: https://colab.research.google.com/drive/1ZJ3tU9MwyWj6UtQ-8G7QJKTn-hG1uQ9v?usp=sharing
|
6 |
|
|
|
|
|
|
|
7 |
|
8 |
## Corpus
|
9 |
|
@@ -14,4 +18,4 @@ of Dhivehi text (79MB deduped).
|
|
14 |
|
15 |
## Vocabulary
|
16 |
|
17 |
-
Included as vocab.txt in the upload - vocab_size is
|
|
|
1 |
# dv-wave
|
2 |
|
3 |
+
This is a second attempt at a Dhivehi language model trained with
|
4 |
+
Google Research's [ELECTRA](https://github.com/google-research/electra).
|
5 |
|
6 |
Tokenization and training CoLab: https://colab.research.google.com/drive/1ZJ3tU9MwyWj6UtQ-8G7QJKTn-hG1uQ9v?usp=sharing
|
7 |
|
8 |
+
V1: similar performance to mBERT after 3 epochs
|
9 |
+
|
10 |
+
V2: fixed tokenizers do_lower_case=False and strip_accents=False to preserve vowel signs of Dhivehi
|
11 |
|
12 |
## Corpus
|
13 |
|
|
|
18 |
|
19 |
## Vocabulary
|
20 |
|
21 |
+
Included as vocab.txt in the upload - vocab_size is 29874
|