theyorubayesian commited on
Commit
d34a5b4
·
1 Parent(s): 0d9957d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md CHANGED
@@ -1,3 +1,61 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - castorini/wura
5
+ language:
6
+ - afr
7
+ - amh
8
+ - arz
9
+ - eng
10
+ - fra
11
+ - hau
12
+ - ibo
13
+ - kin
14
+ - mlg
15
+ - nya
16
+ - orm
17
+ - por
18
+ - sna
19
+ - som
20
+ - sot
21
+ - swa
22
+ - tir
23
+ - xho
24
+ - yor
25
+ - zul
26
  ---
27
+
28
+ # AfriTeVa V2 Large
29
+
30
+ AfriTeVa V2 Large is a multilingual T5 [Version 1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) model pretrained on [Wura](https://huggingface.co/datasets/castorini/wura) with a vocabulary size of 150,000. The model has been shown to improve over existing baselines on [Text Classification](https://huggingface.co/datasets/masakhane/masakhanews), [Machine Translation](https://huggingface.co/datasets/masakhane/mafand), [Summarization](https://huggingface.co/datasets/csebuetnlp/xlsum) and [Cross-lingual Question Answering](https://huggingface.co/datasets/masakhane/afriqa). The model has 1B parameters.
31
+
32
+ Paper: [Better Quality Pretraining Data & T5 Models for African Languages](https://openreview.net/forum?id=ybc9V6Cbq2)
33
+
34
+ Authors: *Akintunde Oladipo, Mofetoluwa Adeyemi, Orevaoghene Ahia, Abraham Toluwalase Owodunni, Odunayo Ogundepo, David Ifeoluwa Adelani, Jimmy Lin*
35
+
36
+ **NOTES**:
37
+ * Dropout was turned off during pretraining and should be re-enabled for finetuning.
38
+ * Other checkpoints are available [here](https://huggingface.co/models?search=afriteva_v2_base).
39
+
40
+ ## Abstract
41
+
42
+ In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for African languages, designed by carefully auditing existing pretraining corpora to understand and rectify prevalent quality issues. To compile this dataset, we undertake a rigorous examination of current data sources for thirteen languages within one of the most extensive multilingual web crawls, mC4, and extract cleaner data through meticulous auditing and improved web crawling strategies. Subsequently, we pretrain a new T5-based model on this dataset and evaluate its performance on multiple downstream tasks. Our model demonstrates better downstream effectiveness over existing pretrained models across four NLP tasks, underscoring the critical role data quality plays in pretraining language models in low-resource scenarios. Specifically, on cross-lingual QA evaluation, our new model is more than twice as effective as multilingual T5. All code, data and models are publicly available at [castorini/AfriTeVa-keji](https://github.com/castorini/AfriTeVa-keji).
43
+
44
+ ## Citation Information
45
+
46
+ ```bibtex
47
+ @article{OladipoBQPD2023EMNLP,
48
+ title = "Better Quality Pre-training Data and T5 Models for African Languages",
49
+ author = "Oladipo, Akintunde and
50
+ Adeyemi, Mofetoluwa and
51
+ Ahia, Orevaoghene and
52
+ Owodunni, Abraham and
53
+ Ogundepo, Odunayo and
54
+ Adelani, David and
55
+ Lin, Jimmy
56
+ ",
57
+ booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
58
+ publisher = "Association for Computational Linguistics",
59
+ year = "2023",
60
+ }
61
+ ```