system HF staff commited on
Commit
4430dc6
1 Parent(s): 216b0e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -8
README.md CHANGED
@@ -4,12 +4,14 @@ language: ta
4
 
5
  # TaMillion
6
 
7
- This is a first attempt at a Tamil language model trained with
8
  Google Research's [ELECTRA](https://github.com/google-research/electra).
9
 
10
- Tokenization and pre-training CoLab: https://colab.research.google.com/drive/1GngBFn_Ge5Hd2XI2febBhZyU7GDiqw5w
11
 
12
- V2 (current): 190,000 steps; (V1 was 100,000 steps)
 
 
13
 
14
  ## Classification
15
 
@@ -19,22 +21,24 @@ https://www.kaggle.com/sudalairajkumar/tamil-nlp
19
  Notebook: https://colab.research.google.com/drive/1_rW9HZb6G87-5DraxHvhPOzGmSMUc67_?usp=sharin
20
 
21
  The model outperformed mBERT on news classification:
22
- (Random: 16.7%, mBERT: 53.0%, TaMillion: 69.6%)
23
 
24
  The model slightly outperformed mBERT on movie reviews:
25
- (RMSE - mBERT: 0.657, TaMillion: 0.627)
26
 
27
  Equivalent accuracy on the Tirukkural topic task.
28
 
29
  ## Question Answering
30
 
31
- I didn't find a Tamil-language question answering dataset, but this model could be used
32
  to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar
33
 
34
  ## Corpus
35
 
36
- Trained on a web crawl from https://oscar-corpus.com/ (deduped version, 5.1GB) and 1 July 2020 dump of ta.wikipedia.org (476MB)
 
 
37
 
38
  ## Vocabulary
39
 
40
- Included as vocab.txt in the upload - vocab_size is 40161
 
4
 
5
  # TaMillion
6
 
7
+ This is the second version of a Tamil language model trained with
8
  Google Research's [ELECTRA](https://github.com/google-research/electra).
9
 
10
+ Tokenization and pre-training CoLab: https://colab.research.google.com/drive/1Pwia5HJIb6Ad4Hvbx5f-IjND-vCaJzSE?usp=sharing
11
 
12
+ V1: small model with GPU; 190,000 steps;
13
+
14
+ V2 (current): base model with TPU and larger corpus; 224,000 steps
15
 
16
  ## Classification
17
 
 
21
  Notebook: https://colab.research.google.com/drive/1_rW9HZb6G87-5DraxHvhPOzGmSMUc67_?usp=sharin
22
 
23
  The model outperformed mBERT on news classification:
24
+ (Random: 16.7%, mBERT: 53.0%, TaMillion: 75.1%)
25
 
26
  The model slightly outperformed mBERT on movie reviews:
27
+ (RMSE - mBERT: 0.657, TaMillion: 0.626)
28
 
29
  Equivalent accuracy on the Tirukkural topic task.
30
 
31
  ## Question Answering
32
 
33
+ I didn't find a Tamil-language question answering dataset, but this model could be finetuned
34
  to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar
35
 
36
  ## Corpus
37
 
38
+ Trained on
39
+ IndicCorp Tamil (11GB) https://indicnlp.ai4bharat.org/corpora/
40
+ and 1 October 2020 dump of https://ta.wikipedia.org (482MB)
41
 
42
  ## Vocabulary
43
 
44
+ Included as vocab.txt in the upload