license: cc-by-nc-4.0 | |
datasets: | |
- slone/nllb-200-10M-sample | |
pipeline_tag: translation | |
language: | |
- ak # aka_Latn Akan | |
- am # amh_Ethi Amharic | |
- ar # arb_Arab Modern Standard Arabic | |
- awa # awa_Deva Awadhi | |
- azj # azj_Latn North Azerbaijani | |
- bm # bam_Latn Bambara | |
- ban # ban_Latn Balinese | |
- be # bel_Cyrl Belarusian | |
- bem # bem_Latn Bemba | |
- bn # ben_Beng Bengali | |
- bho # bho_Deva Bhojpuri | |
- bjn # bjn_Latn Banjar (Latin script) | |
- bug # bug_Latn Buginese | |
- bg # bul_Cyrl Bulgarian | |
- ca # cat_Latn Catalan | |
- ceb # ceb_Latn Cebuano | |
- cs # ces_Latn Czech | |
- cjk # cjk_Latn Chokwe | |
- ckb # ckb_Arab Central Kurdish | |
- crh # crh_Latn Crimean Tatar | |
- da # dan_Latn Danish | |
- de # deu_Latn German | |
- dik # dik_Latn Southwestern Dinka | |
- dyu # dyu_Latn Dyula | |
- el # ell_Grek Greek | |
- en # eng_Latn English | |
- eo # epo_Latn Esperanto | |
- et # est_Latn Estonian | |
- ee # ewe_Latn Ewe | |
- fo # fao_Latn Faroese | |
- fj # fij_Latn Fijian | |
- fi # fin_Latn Finnish | |
- fon # fon_Latn Fon | |
- fr # fra_Latn French | |
- fur # fur_Latn Friulian | |
- ff # fuv_Latn Nigerian Fulfulde | |
- gaz # gaz_Latn West Central Oromo | |
- gd # gla_Latn Scottish Gaelic | |
- ga # gle_Latn Irish | |
- gl # glg_Latn Galician | |
- gn # grn_Latn Guarani | |
- gu # guj_Gujr Gujarati | |
- ht # hat_Latn Haitian Creole | |
- ha # hau_Latn Hausa | |
- he # heb_Hebr Hebrew | |
- hi # hin_Deva Hindi | |
- hne # hne_Deva Chhattisgarhi | |
- hr # hrv_Latn Croatian | |
- hu # hun_Latn Hungarian | |
- hy # hye_Armn Armenian | |
- ig # ibo_Latn Igbo | |
- ilo # ilo_Latn Ilocano | |
- id # ind_Latn Indonesian | |
- is # isl_Latn Icelandic | |
- it # ita_Latn Italian | |
- jv # jav_Latn Javanese | |
- ja # jpn_Jpan Japanese | |
- kab # kab_Latn Kabyle | |
- kac # kac_Latn Jingpho | |
- kam # kam_Latn Kamba | |
- kn # kan_Knda Kannada | |
- ks # kas_Arab Kashmiri (Arabic script) | |
- ks # kas_Deva Kashmiri (Devanagari script) | |
- ka # kat_Geor Georgian | |
- kk # kaz_Cyrl Kazakh | |
- kbp # kbp_Latn Kabiyè | |
- kea # kea_Latn Kabuverdianu | |
- mn # khk_Cyrl Halh Mongolian | |
- km # khm_Khmr Khmer | |
- ki # kik_Latn Kikuyu | |
- rw # kin_Latn Kinyarwanda | |
- ky # kir_Cyrl Kyrgyz | |
- kmb # kmb_Latn Kimbundu | |
- kmr # kmr_Latn Northern Kurdish | |
- kr # knc_Arab Central Kanuri (Arabic script) | |
- kr # knc_Latn Central Kanuri (Latin script) | |
- kg # kon_Latn Kikongo | |
- ko # kor_Hang Korean | |
- lo # lao_Laoo Lao | |
- lij # lij_Latn Ligurian | |
- li # lim_Latn Limburgish | |
- ln # lin_Latn Lingala | |
- lt # lit_Latn Lithuanian | |
- lmo # lmo_Latn Lombard | |
- ltg # ltg_Latn Latgalian | |
- lb # ltz_Latn Luxembourgish | |
- lua # lua_Latn Luba-Kasai | |
- lg # lug_Latn Ganda | |
- luo # luo_Latn Luo | |
- lus # lus_Latn Mizo | |
- lv # lvs_Latn Standard Latvian | |
- mag # mag_Deva Magahi | |
- mai # mai_Deva Maithili | |
- ml # mal_Mlym Malayalam | |
- mr # mar_Deva Marathi | |
- min # min_Latn Minangkabau (Latin script) | |
- mk # mkd_Cyrl Macedonian | |
- mt # mlt_Latn Maltese | |
- mni # mni_Beng Meitei (Bengali script) | |
- mos # mos_Latn Mossi | |
- mi # mri_Latn Maori | |
- my # mya_Mymr Burmese | |
- nl # nld_Latn Dutch | |
- nb # nob_Latn Norwegian Bokmål | |
- ne # npi_Deva Nepali | |
- nso # nso_Latn Northern Sotho | |
- nus # nus_Latn Nuer | |
- ny # nya_Latn Nyanja | |
- oc # oci_Latn Occitan | |
- ory # ory_Orya Odia | |
- pag # pag_Latn Pangasinan | |
- pa # pan_Guru Eastern Panjabi | |
- pap # pap_Latn Papiamento | |
- pbt # pbt_Arab Southern Pashto | |
- fa # pes_Arab Western Persian | |
- plt # plt_Latn Plateau Malagasy | |
- pl # pol_Latn Polish | |
- pt # por_Latn Portuguese | |
- prs # prs_Arab Dari | |
- qu # quy_Latn Ayacucho Quechua | |
- ro # ron_Latn Romanian | |
- rn # run_Latn Rundi | |
- ru # rus_Cyrl Russian | |
- sg # sag_Latn Sango | |
- sa # san_Deva Sanskrit | |
- sat # sat_Beng ? | |
- scn # scn_Latn Sicilian | |
- shn # shn_Mymr Shan | |
- si # sin_Sinh Sinhala | |
- sk # slk_Latn Slovak | |
- sl # slv_Latn Slovenian | |
- sm # smo_Latn Samoan | |
- sn # sna_Latn Shona | |
- sd # snd_Arab Sindhi | |
- so # som_Latn Somali | |
- st # sot_Latn Southern Sotho | |
- es # spa_Latn Spanish | |
- sc # srd_Latn Sardinian | |
- sr # srp_Cyrl Serbian | |
- ss # ssw_Latn Swati | |
- su # sun_Latn Sundanese | |
- sv # swe_Latn Swedish | |
- sw # swh_Latn Swahili | |
- szl # szl_Latn Silesian | |
- ta # tam_Taml Tamil | |
- taq # taq_Latn Tamasheq (Latin script) | |
- tt # tat_Cyrl Tatar | |
- te # tel_Telu Telugu | |
- tg # tgk_Cyrl Tajik | |
- tl # tgl_Latn Tagalog | |
- ti # tir_Ethi Tigrinya | |
- tpi # tpi_Latn Tok Pisin | |
- tn # tsn_Latn Tswana | |
- ts # tso_Latn Tsonga | |
- tk # tuk_Latn Turkmen | |
- tum # tum_Latn Tumbuka | |
- tr # tur_Latn Turkish | |
- tw # twi_Latn Twi | |
- tzm # tzm_Tfng Central Atlas Tamazight | |
- ug # uig_Arab Uyghur | |
- uk # ukr_Cyrl Ukrainian | |
- umb # umb_Latn Umbundu | |
- ur # urd_Arab Urdu | |
- uz # uzn_Latn Northern Uzbek | |
- vec # vec_Latn Venetian | |
- vi # vie_Latn Vietnamese | |
- war # war_Latn Waray | |
- wo # wol_Latn Wolof | |
- xh # xho_Latn Xhosa | |
- yi # ydd_Hebr Eastern Yiddish | |
- yo # yor_Latn Yoruba | |
- zh # zho_Hans Chinese (Simplified) | |
- zh # zho_Hant Chinese (Traditional) | |
- ms # zsm_Latn Standard Malay | |
- zu # zul_Latn Zulu | |
It is a truncated version of [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model | |
(6 layers instead of 12, 512 hidden dimensions instead of 1024) with 175M parameters (131M of which are token embeddings). | |
This model was fine-tuned on the [slone/nllb-200-10M-sample](https://huggingface.co/datasets/slone/nllb-200-10M-sample) subset of | |
the [NLLB dataset](https://huggingface.co/datasets/allenai/nllb) with 175 languages, using only the samples with BLASER score above 3.5. | |
Because of its small size, it is really bad at translation, but can serve as a base model for further fine-tuning for a small number of languages. | |
It is recommended to [prune the vocabulary of this model](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) | |
before fine-tuning, to preserve only the tokens used with the intended languages. |