File size: 3,257 Bytes
61d73cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
language:
- bn
- gu
- hi
- mr
- ne
- or
- pa
- sa
- ur

library_name: transformers
pipeline_tag: fill-mask
---

# IA-Transliterated

IA-Transliterated is a multilingual RoBERTa model pre-trained exclusively on 11 Indian languages from the Indo-Aryan language family. It is pre-trained on the monolingual corpora of these languages. All the languages are transliterated into the Devanagari script. It is subsequently evaluated on a set of diverse tasks. 

The 11 languages covered by IA-Transliterated are: Bhojpuri, Bengali, Gujarati, Hindi, Magahi, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Urdu.

The code can be found [here](https://github.com/IBM/NL-FM-Toolkit). For more information, check-out our [paper](https://aclanthology.org/2021.emnlp-main.675/).


## Pretraining Corpus

We pre-trained IA-Transliterated on the publicly available monolingual corpus. The corpus has the following distribution of languages:


| **Language**  | **\# Sentences** | **\# Tokens** |               |
| :------------ | ---------------: | ------------: | ------------: |
|               |                  | **\# Total**  | **\# Unique** |
| Hindi (hi)    | 1552\.89         | 20,098\.73    | 25\.01        |
| Bengali (bn)  | 353\.44          | 4,021\.30     | 6\.5          |
| Sanskrit (sa) | 165\.35          | 1,381\.04     | 11\.13        |
| Urdu (ur)     | 153\.27          | 2,465\.48     | 4\.61         |
| Marathi (mr)  | 132\.93          | 1,752\.43     | 4\.92         |
| Gujarati (gu) | 131\.22          | 1,565\.08     | 4\.73         |
| Nepali (ne)   | 84\.21           | 1,139\.54     | 3\.43         |
| Punjabi (pa)  | 68\.02           | 945\.68       | 2\.00         |
| Oriya (or)    | 17\.88           | 274\.99       | 1\.10         |
| Bhojpuri (bh) | 10\.25           | 134\.37       | 1\.13         |
| Magahi (mag)  | 0\.36            | 3\.47         | 0\.15         |



## Evaluation Results

IA-Original is evaluated on IndicGLUE and some additional tasks. For more details about the tasks, refer to the [paper](https://aclanthology.org/2021.emnlp-main.675/).



## Downloads

You can also download it from [Huggingface](https://huggingface.co/ibm/ia-multilingual-transliterated-roberta).



## Citing

If you are using any of the resources, please cite the following article:

```
@inproceedings{dhamecha-etal-2021-role,
    title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
    author = "Dhamecha, Tejas  and
      Murthy, Rudra  and
      Bharadwaj, Samarth  and
      Sankaranarayanan, Karthik  and
      Bhattacharyya, Pushpak",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.675",
    doi = "10.18653/v1/2021.emnlp-main.675",
    pages = "8584--8595",
}
```

## Contributors

- Tejas Dhamecha
- Rudra Murthy
- Samarth Bharadwaj
- Karthik Sankaranarayanan
- Pushpak Bhattacharyya


## Contact

- Rudra Murthy ([[email protected]](mailto:[email protected]))