--- language: - multilingual - af - am - ar - az - be - bg - bn - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - ga - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mk - ml - mn - mr - ms - my - ne - nl - no - or - pa - pl - ps - pt - ro - ru - sa - si - sk - sl - so - sq - sr - sv - sw - ta - te - th - tl - tr - uk - ur - uz - vi - zh license: mit --- # xmod-base X-MOD is a multilingual masked language model trained on filtered CommonCrawl data containing 81 languages. It was introduced in the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) (Pfeiffer et al., NAACL 2022) and first released in [this repository](https://github.com/facebookresearch/fairseq/tree/main/examples/xmod). Because it has been pre-trained with language-specific modular components (_language adapters_), X-MOD differs from previous multilingual models like [XLM-R](https://huggingface.co/xlm-roberta-base). For fine-tuning, the language adapters in each transformer layer are frozen. # Usage ## Tokenizer This model reuses the tokenizer of [XLM-R](https://huggingface.co/xlm-roberta-base), so you can load the tokenizer as follows: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base") ``` ## Input Language Because this model uses language adapters, you need to specify the language of your input so that the correct adapter can be activated: ```python from transformers import XMODModel model = XMODModel.from_pretrained("jvamvas/xmod-base") model.set_default_language("en_XX") ``` A directory of the language adapters in this model is found at the bottom of this model card. ## Fine-tuning The paper recommends that the embedding layer and the language adapters are frozen during fine-tuning. A method for doing this is provided in the code: ```python model.freeze_embeddings_and_language_adapters() # Fine-tune the model ... ``` ## Cross-lingual Transfer After fine-tuning, zero-shot cross-lingual transfer can be tested by activating the language adapter of the target language: ```python model.set_default_language("de_DE") # Evaluate the model on German examples ... ``` # Bias, Risks, and Limitations Please refer to the model card of [XLM-R](https://huggingface.co/xlm-roberta-base), because X-MOD has a similar architecture and has been trained on similar training data. # Citation **BibTeX:** ```bibtex @inproceedings{pfeiffer-etal-2022-lifting, title = "Lifting the Curse of Multilinguality by Pre-training Modular Transformers", author = "Pfeiffer, Jonas and Goyal, Naman and Lin, Xi and Li, Xian and Cross, James and Riedel, Sebastian and Artetxe, Mikel", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.255", doi = "10.18653/v1/2022.naacl-main.255", pages = "3479--3495" } ``` # Languages This model contains the following language adapters: | lang_id (Adapter index) | Language code | Language | |-------------------------|---------------|-----------------------| | 0 | en_XX | English | | 1 | id_ID | Indonesian | | 2 | vi_VN | Vietnamese | | 3 | ru_RU | Russian | | 4 | fa_IR | Persian | | 5 | sv_SE | Swedish | | 6 | ja_XX | Japanese | | 7 | fr_XX | French | | 8 | de_DE | German | | 9 | ro_RO | Romanian | | 10 | ko_KR | Korean | | 11 | hu_HU | Hungarian | | 12 | es_XX | Spanish | | 13 | fi_FI | Finnish | | 14 | uk_UA | Ukrainian | | 15 | da_DK | Danish | | 16 | pt_XX | Portuguese | | 17 | no_XX | Norwegian | | 18 | th_TH | Thai | | 19 | pl_PL | Polish | | 20 | bg_BG | Bulgarian | | 21 | nl_XX | Dutch | | 22 | zh_CN | Chinese (simplified) | | 23 | he_IL | Hebrew | | 24 | el_GR | Greek | | 25 | it_IT | Italian | | 26 | sk_SK | Slovak | | 27 | hr_HR | Croatian | | 28 | tr_TR | Turkish | | 29 | ar_AR | Arabic | | 30 | cs_CZ | Czech | | 31 | lt_LT | Lithuanian | | 32 | hi_IN | Hindi | | 33 | zh_TW | Chinese (traditional) | | 34 | ca_ES | Catalan | | 35 | ms_MY | Malay | | 36 | sl_SI | Slovenian | | 37 | lv_LV | Latvian | | 38 | ta_IN | Tamil | | 39 | bn_IN | Bengali | | 40 | et_EE | Estonian | | 41 | az_AZ | Azerbaijani | | 42 | sq_AL | Albanian | | 43 | sr_RS | Serbian | | 44 | kk_KZ | Kazakh | | 45 | ka_GE | Georgian | | 46 | tl_XX | Tagalog | | 47 | ur_PK | Urdu | | 48 | is_IS | Icelandic | | 49 | hy_AM | Armenian | | 50 | ml_IN | Malayalam | | 51 | mk_MK | Macedonian | | 52 | be_BY | Belarusian | | 53 | la_VA | Latin | | 54 | te_IN | Telugu | | 55 | eu_ES | Basque | | 56 | gl_ES | Galician | | 57 | mn_MN | Mongolian | | 58 | kn_IN | Kannada | | 59 | ne_NP | Nepali | | 60 | sw_KE | Swahili | | 61 | si_LK | Sinhala | | 62 | mr_IN | Marathi | | 63 | af_ZA | Afrikaans | | 64 | gu_IN | Gujarati | | 65 | cy_GB | Welsh | | 66 | eo_EO | Esperanto | | 67 | km_KH | Central Khmer | | 68 | ky_KG | Kirghiz | | 69 | uz_UZ | Uzbek | | 70 | ps_AF | Pashto | | 71 | pa_IN | Punjabi | | 72 | ga_IE | Irish | | 73 | ha_NG | Hausa | | 74 | am_ET | Amharic | | 75 | lo_LA | Lao | | 76 | ku_TR | Kurdish | | 77 | so_SO | Somali | | 78 | my_MM | Burmese | | 79 | or_IN | Oriya | | 80 | sa_IN | Sanskrit |