Update README.md
Browse files
README.md
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# dv-wave
|
2 |
+
|
3 |
+
This is a first attempt at a Dhivehi language model trained with Google Research's [ELECTRA](https://github.com/google-research/electra).
|
4 |
+
|
5 |
+
Tokenization and training CoLab: https://colab.research.google.com/drive/1ZJ3tU9MwyWj6UtQ-8G7QJKTn-hG1uQ9v?usp=sharing
|
6 |
+
|
7 |
+
|
8 |
+
## Corpus
|
9 |
+
|
10 |
+
Trained on @Sofwath's 307MB corpus of Dhivehi news: https://github.com/Sofwath/DhivehiDatasets
|
11 |
+
|
12 |
+
[OSCAR](https://oscar-corpus.com/) was considered; as of this writing their web crawl has 126MB
|
13 |
+
of Dhivehi text (79MB deduped).
|
14 |
+
|
15 |
+
## Vocabulary
|
16 |
+
|
17 |
+
Included as vocab.txt in the upload - vocab_size is 29982
|