|
--- |
|
language: |
|
- bn |
|
- gu |
|
- hi |
|
- mr |
|
- ne |
|
- or |
|
- pa |
|
- sa |
|
- ur |
|
|
|
library_name: transformers |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# IA-Original |
|
|
|
IA-Original is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages and subsequently evaluated on a set of diverse tasks. |
|
|
|
The 11 languages covered by IA-Original are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu. |
|
|
|
The code can be found [here](https://github.com/IBM/NL-FM-Toolkit). For more information, check-out our [paper](https://aclanthology.org/2021.emnlp-main.675/). |
|
|
|
|
|
## Pretraining Corpus |
|
|
|
We pre-trained IA-Original on the publicly available monolingual corpus. The corpus has the following distribution of languages: |
|
|
|
|
|
| **Language** | **\# Sentences** | **\# Tokens** | | |
|
| :------------ | ---------------: | ------------: | ------------: | |
|
| | | **\# Total** | **\# Unique** | |
|
| Hindi (hi) | 1552\.89 | 20,098\.73 | 25\.01 | |
|
| Bengali (bn) | 353\.44 | 4,021\.30 | 6\.5 | |
|
| Sanskrit (sa) | 165\.35 | 1,381\.04 | 11\.13 | |
|
| Urdu (ur) | 153\.27 | 2,465\.48 | 4\.61 | |
|
| Marathi (mr) | 132\.93 | 1,752\.43 | 4\.92 | |
|
| Gujarati (gu) | 131\.22 | 1,565\.08 | 4\.73 | |
|
| Nepali (ne) | 84\.21 | 1,139\.54 | 3\.43 | |
|
| Punjabi (pa) | 68\.02 | 945\.68 | 2\.00 | |
|
| Oriya (or) | 17\.88 | 274\.99 | 1\.10 | |
|
| Bhojpuri (bh) | 10\.25 | 134\.37 | 1\.13 | |
|
| Magahi (mag) | 0\.36 | 3\.47 | 0\.15 | |
|
|
|
|
|
|
|
## Evaluation Results |
|
|
|
IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the [paper](https://aclanthology.org/2021.emnlp-main.675/). |
|
|
|
|
|
|
|
## Downloads |
|
|
|
You can also download it from [Huggingface](https://huggingface.co/ibm/ia-multilingual-original-script-roberta). |
|
|
|
|
|
|
|
## Citing |
|
|
|
If you are using any of the resources, please cite the following article: |
|
|
|
``` |
|
@inproceedings{dhamecha-etal-2021-role, |
|
title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages", |
|
author = "Dhamecha, Tejas and |
|
Murthy, Rudra and |
|
Bharadwaj, Samarth and |
|
Sankaranarayanan, Karthik and |
|
Bhattacharyya, Pushpak", |
|
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", |
|
month = nov, |
|
year = "2021", |
|
address = "Online and Punta Cana, Dominican Republic", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2021.emnlp-main.675", |
|
doi = "10.18653/v1/2021.emnlp-main.675", |
|
pages = "8584--8595", |
|
} |
|
``` |
|
|
|
## Contributors |
|
|
|
- Tejas Dhamecha |
|
- Rudra Murthy |
|
- Samarth Bharadwaj |
|
- Karthik Sankaranarayanan |
|
- Pushpak Bhattacharyya |
|
|
|
|
|
## Contact |
|
|
|
- Rudra Murthy ([[email protected]](mailto:[email protected])) |