--- license: apache-2.0 datasets: - daniel-scalioni/pileNER pipeline_tag: feature-extraction --- # GLiNER-MoE-MultiLingual: A Zero-Shot Multilingual NER Model with MOE Architecture This repository provides **GLiNER-MoE-MultiLingual**, a zero-shot Named Entity Recognition (NER) model trained for **one epoch** using a **Mixture of Experts (MOE)** from NOMIC-MOE architecture. GLiNER-MoE-MultiLingual aims to handle zero shot **multilingual** NER tasks across various domains. Inspired from my work documented on this [medium article](https://medium.com/@mayankrakesh1/divide-specialize-and-conquer-my-ideas-on-how-moe-meetscontrastive-learning-in-nlp-part-1-8379803220d0). --- ## Overview Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) where the goal is to detect and classify named entities in text into predefined categories such as persons, locations, organizations, and more. **GLiNER** is designed to: - Perform NER in a **zero-shot** setting (i.e., it can handle languages and domains it was not explicitly fine-tuned on). - Leverage a **Mixture of Experts (MOE)** architecture for improved generalization across languages and domains. - Serve as a single checkpoint for handling multiple languages, reducing the overhead of training separate models. --- ## Features - **Zero-shot Multilingual Support**: Handle NER for many languages without separate fine-tuning. - **Domain Agnostic**: The model can generalize across diverse domains (news, biomedical, social media, etc.). - **Lightweight Training**: Trained for only **one epoch**, demonstrating the efficiency of the MOE approach. - **Dataset**: MultiLingual Samples were generated from PileNer Samples using machine translation. - **Easy Integration**: Built on top of standard NLP frameworks (e.g., [Hugging Face Transformers](https://github.com/huggingface/transformers)) for quick integration into your pipeline. --- ## Supported Languages Here is the complete list of **supported languages** along with their ISO codes: | Code | Language | Code | Language | Code | Language | Code | Language | |------|--------------|------|--------------|------|--------------|------|--------------| | en | English | be | Belarusian | ml | Malayalam | mk | Macedonian | | es | Spanish | kn | Kannada | ur | Urdu | fy | Frisian | | fr | French | fi | Filipino | te | Telugu | eu | Basque | | de | German | sw | Swahili | so | Somali | sd | Sindhi | | it | Italian | uz | Uzbek | co | Corsican | hr | Croatian | | pt | Portuguese | gu | Gujarati | hi-Latn | Hindi (Latin) | ceb | Cebuano | | pl | Polish | eo | Esperanto | jv | Javanese | la | Latin | | nl | Dutch | zu | Zulu | mn | Mongolian | si | Sinhala | | tr | Turkish | el-Latn | Greek (Latin) | ga | Irish | ky | Kyrgyz | | ja | Japanese | tg | Tajik | my | Burmese | km | Khmer | | vi | Vietnamese | mg | Malagasy | pa | Punjabi | ru-Latn | Russian (Latin) | | ru | Russian | zh-Latn | Chinese (Latin) | ha | Hausa | he | Hebrew | | id | Indonesian | hm | Hmong | ht | Haitian | ja-Latn | Japanese (Latin) | | ar | Arabic | su | Sundanese | bg-Latn | Bulgarian (Latin) | gd | Scots Gaelic | | cs | Czech | ny | Nyanja | ps | Pashto | ku | Kurdish | | ro | Romanian | sh | Serbo-Croatian | am | Amharic | ig | Igbo | | sv | Swedish | lo | Lao | mi | Maori | nn | Norwegian Nynorsk | | el | Greek | sm | Samoan | st | Sotho | tl | Tagalog | | uk | Ukrainian | xh | Xhosa | yo | Yoruba | bn | Bengali | | zh | Chinese | ko | Korean | fa | Persian | ms | Malay | | hu | Hungarian | sl | Slovenian | lv | Latvian | mr | Marathi | | da | Danish | no | Norwegian | hi | Hindi | fi | Finnish | | lt | Lithuanian | ca | Catalan | cy | Welsh | bg | Bulgarian | This list covers **over 40 languages**, making **GLiNER-MoE-MultiLingual** a highly versatile **zero-shot multilingual NER** model. 🚀 ## Model Architecture The **Mixture of Experts (MOE)** approach splits the model into several “experts,” each of which specializes in a subset of the input space. During inference, the MOE layer routes each token (or hidden state) to the most relevant expert(s). This helps in handling diverse languages and domains under a single unified model. --- ## Performance GLiNER’s zero-shot performance has been evaluated on various standard NER benchmarks across multiple domains: | Dataset | F1 Score | |----------------------|---------:| | ACE 2004 | 26.2% | | ACE 2005 | 22.5% | | AnatEM | 31.9% | | Broad Tweet Corpus | 65.1% | | CoNLL 2003 | 61.5% | | FabNER | 22.4% | | FindVehicle | 10.6% | | GENIA_NER | 45.1% | | HarveyNER | 3.7% | | MultiNERD | 60.6% | | Ontonotes | 26.0% | | PolyglotNER | 43.1% | | TweetNER7 | 37.4% | | WikiANN en | 54.7% | | WikiNeural | 75.4% | | bc2gm | 54.8% | | bc4chemd | 45.0% | | bc5cdr | 68.2% | | ncbi | 62.9% | | **Average** | **43.0%** | --- ## Usage ### Installation Use this forked repo of Original GLiNER to support MOE ```bash !git clone https://github.com/mayank-rakesh-mck/GLiNER.git cd GLiNER pip install -r requirements.txt ``` ### Inference with Transformers Pipeline ```python import json from GLiNER.gliner import GLiNERConfig, GLiNER with open('gliner_config.json') as f: config = json.load(f) model_config = GLiNERConfig(**config) model = GLiNER(model_config) state_dict = torch.load('pytorch_model.bin', map_location=torch.device('cuda:0'), weights_only=True) model.model.load_state_dict(state_dict, strict=True) model = model.to('cuda:0') #english translation # Sample text for entity prediction text = """ Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time. """ # Labels for entity prediction # Most GLiNER models should work best when entity types are in lower case or title case labels = ["Person", "Award", "Date", "Competitions", "Teams"] # Perform entity prediction entities = model.predict_entities(text, labels, threshold=0.2) # Display predicted entities and their labels for entity in entities: print(entity["text"], "=>", entity["label"]) ``` ``` Cristiano Ronaldo => Person 5 February 1985 => Date Al Nassr => Teams Portugal national team => Teams Ballon d'Or => Award UEFA Men's Player of the Year Awards => Award European Golden Shoes => Competitions UEFA Champions Leagues => Competitions UEFA European Championship => Competitions UEFA Nations League => Competitions Champions League => Competitions European Championship => Competitions international appearances => Award ``` --- ## Examples Example usage in a Jupyter notebook cell: ```python # Language: Armenian text = """ Կրիշտիանու Ռոնալդու դոս Սանտոս Ավեյրո (պորտուգալերեն արտասանություն՝ [kɾiʃˈtjɐnu ʁɔˈnaldu], ծնված 1985 թվականի փետրվարի 5-ին) պորտուգալացի պրոֆեսիոնալ ֆուտբոլիստ է, ով խաղում է Սաուդյան Արաբիայի Պրոֆեսիոնալ լիգայի և Պորտուգալիայի ազգային հավաքականի հարձակվող և ավագը։ Լայնորեն համարվում է բոլոր ժամանակների լավագույն խաղացողներից մեկը՝ Ռոնալդուն արժանացել է «Ոսկե գնդակի» հինգ մրցանակների, [նշում 3]՝ ռեկորդային երեք՝ ՈւԵՖԱ-ի տարվա լավագույն խաղացողի մրցանակի և Եվրոպայի չորս «Ոսկե խաղակոշիկի»՝ ամենաշատը եվրոպացի խաղացողների կողմից: Նա իր կարիերայի ընթացքում նվաճել է 33 գավաթ, այդ թվում՝ յոթ լիգայի տիտղոս, ՈՒԵՖԱ-ի հինգ Չեմպիոնների լիգա, ՈՒԵՖԱ-ի Եվրոպայի առաջնություն և ՈՒԵՖԱ-ի Ազգերի լիգա: Ռոնալդուն ռեկորդներ ունի Չեմպիոնների լիգայում ամենաշատ խաղերի (183), գոլերի (140) և գոլային փոխանցման (42), Եվրոպայի առաջնությունում (14), միջազգային գոլերի (128) և միջազգային խաղերի (205) ռեկորդների քանակով: Նա այն սակավաթիվ խաղացողներից է, ով անցկացրել է ավելի քան 1200 պրոֆեսիոնալ կարիերա, որոնցից ամենաշատը խաղադաշտ դուրս է եկել, և ավելի քան 850 գոլ է խփել ակումբի և երկրի գլխավոր կարիերայի ընթացքում՝ դարձնելով նրան բոլոր ժամանակների լավագույն ռմբարկուն:""" # Labels for entity prediction # Most GLiNER models should work best when entity types are in lower case or title case labels = ["Person", "Award", "Date", "Competitions", "Teams"] # Perform entity prediction entities = model.predict_entities(text, labels, threshold=0.2) # Display predicted entities and their labels for entity in entities: print(entity["text"], "=>", entity["label"]) ``` ``` Կրիշտիանու Ռոնալդու դոս Սանտոս Ավեյրո => Person 1985 => Date փետրվարի 5-ին => Date Սաուդյան Արաբիայի Պրոֆեսիոնալ լիգայի => Teams Պորտուգալիայի ազգային հավաքականի => Teams Ոսկե գնդակի => Award Ոսկե խաղակոշիկի => Award յոթ լիգայի տիտղոս => Award ՈՒԵՖԱ-ի հինգ Չեմպիոնների լիգա => Competitions ՈՒԵՖԱ-ի Ազգերի լիգա => Competitions Ռոնալդուն => Person Չեմպիոնների լիգայում => Competitions Եվրոպայի առաջնությունում => Competitions միջազգային խաղերի => Competitions ``` ```python # Language: Spanish text = """ Cristiano Ronaldo dos Santos Aveiro (pronunciación portuguesa: [kɾiʃˈtjɐnu ʁɔˈnaldu]; nacido el 5 de febrero de 1985) es un futbolista profesional portugués que juega como delantero y capitán tanto del club Al Nassr de la Saudi Pro League como de la selección nacional de Portugal. Ampliamente considerado como uno de los mejores jugadores de todos los tiempos, Ronaldo ha ganado cinco premios Balón de Oro, un récord de tres premios al Jugador del Año de la UEFA y cuatro Botas de Oro europeas, la mayor cantidad para un jugador europeo. Ha ganado 33 trofeos en su carrera, incluidos siete títulos de liga, cinco Ligas de Campeones de la UEFA, el Campeonato de Europa de la UEFA y la Liga de Naciones de la UEFA. Ronaldo tiene los récords de más apariciones (183), goles (140) y asistencias (42) en la Liga de Campeones, goles en la Eurocopa (14), goles internacionales (128) y apariciones internacionales (205). Es uno de los pocos jugadores que ha disputado más de 1.200 apariciones en su carrera profesional, la mayor cantidad para un jugador de campo, y ha marcado más de 850 goles oficiales en su carrera absoluta para su club y su país, lo que lo convierte en el máximo goleador de todos los tiempos. """ # Labels for entity prediction # Most GLiNER models should work best when entity types are in lower case or title case labels = ["Person", "Award", "Date", "Competitions", "Teams"] # Perform entity prediction entities = model.predict_entities(text, labels, threshold=0.2) # Display predicted entities and their labels for entity in entities: print(entity["text"], "=>", entity["label"]) ``` ``` Cristiano Ronaldo => Person 5 de febrero de 1985 => Date Al Nassr => Teams Saudi Pro League => Teams Balón de Oro => Award Jugador del Año => Award Botas de Oro => Award títulos de liga => Competitions Ligas de Campeones => Competitions Campeonato de Europa => Competitions Liga de Naciones => Competitions Liga de Campeones => Competitions Eurocopa => Competitions ``` --- ## Citation ``` @misc {mayank_rakesh_2025, author = { {Mayank Rakesh} }, title = { GLiNER-MoE-MultiLingual (Revision 3ba1ed0) }, year = 2025, url = { https://huggingface.co/Mayank6255/GLiNER-MoE-MultiLingual }, doi = { 10.57967/hf/4502 }, publisher = { Hugging Face } } ``` ## References ``` @inproceedings{zaratiana-etal-2024-gliner, title = "{GL}i{NER}: Generalist Model for Named Entity Recognition using Bidirectional Transformer", author = "Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-long.300", doi = "10.18653/v1/2024.naacl-long.300", pages = "5364--5376", abstract = "Named Entity Recognition (NER) is essential in various Natural Language Processing (NLP) applications. Traditional NER models are effective but limited to a set of predefined entity types. In contrast, Large Language Models (LLMs) can extract arbitrary entities through natural language instructions, offering greater flexibility. However, their size and cost, particularly for those accessed via APIs like ChatGPT, make them impractical in resource-limited scenarios. In this paper, we introduce a compact NER model trained to identify any type of entity. Leveraging a bidirectional transformer encoder, our model, GLiNER, facilitates parallel entity extraction, an advantage over the slow sequential token generation of LLMs. Through comprehensive testing, GLiNER demonstrate strong performance, outperforming both ChatGPT and fine-tuned LLMs in zero-shot evaluations on various NER benchmarks.", } @misc{nussbaum2025trainingsparsemixtureexperts, title={Training Sparse Mixture Of Experts Text Embedding Models}, author={Zach Nussbaum and Brandon Duderstadt}, year={2025}, eprint={2502.07972}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.07972}, } ``` ---