|
--- |
|
language: |
|
- as |
|
- bn |
|
- gu |
|
- hi |
|
- kn |
|
- ml |
|
- mr |
|
- or |
|
- pa |
|
- ta |
|
- te |
|
license: mit |
|
datasets: |
|
- Samanantar |
|
tags: |
|
- ner |
|
- Pytorch |
|
- transformer |
|
- multilingual |
|
- nlp |
|
- indicnlp |
|
--- |
|
|
|
# IndicNER |
|
IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine-tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets. |
|
The 11 languages covered by IndicNER are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. |
|
|
|
## Training Corpus |
|
Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously. |
|
|
|
## Downloads |
|
Download from this same Huggingface repo. |
|
|
|
Update 20 Dec 2022: We released a new paper documenting IndicNER and Naamapadam. We have a different model reported in the paper. We will update the repo here soon with this model. |
|
|
|
## Usage |
|
|
|
You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models. |
|
|
|
<!-- citing information --> |
|
## Citing |
|
|
|
If you are using IndicNER, please cite the following article: |
|
``` |
|
@misc{mhaske2022naamapadam, |
|
doi = {10.48550/ARXIV.2212.10168}, |
|
url = {https://arxiv.org/abs/2212.10168}, |
|
author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop}, |
|
title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages} |
|
publisher = {arXiv}, |
|
year = {2022}, |
|
copyright = {arXiv.org perpetual, non-exclusive license} |
|
} |
|
|
|
``` |
|
We would like to hear from you if: |
|
|
|
- You are using our resources. Please let us know how you are putting these resources to use. |
|
- You have any feedback on these resources. |
|
|
|
|
|
<!-- License --> |
|
## License |
|
|
|
The IndicNER code (and models) are released under the MIT License. |
|
|
|
<!-- Contributors --> |
|
## Contributors |
|
- Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub> |
|
- Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub> |
|
- Sumanth Doddapaneni <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub> |
|
- Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub> |
|
- Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub> |
|
- Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub> |
|
- Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub> |
|
|
|
This work is the outcome of a volunteer effort as part of the [AI4Bharat initiative](https://ai4bharat.iitm.ac.in). |
|
|
|
|
|
<!-- Contact --> |
|
## Contact |
|
- Anoop Kunchukuttan ([[email protected]](mailto:[email protected])) |
|
- Rudra Murthy V ([[email protected]](mailto:[email protected])) |
|
|