ai4bharat
/

IndicNER

@@ -30,19 +30,11 @@ The 11 languages covered by IndicNER are: Assamese, Bengali, Gujarati, Hindi, Ka
 ## Training Corpus
 Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
-## Evaluation Results
-Benchmarking on our testset.
-Language | bn | hi | kn | ml | mr | gu | ta | te | as | or | pa
------| ----- | ----- | ------ | -----| ----- | ----- | ------ | -----| ----- | ----- | ------
-F1 score | 79.75 | 82.33 | 80.01 | 80.73 | 80.51 | 73.82 | 80.98 | 80.88 | 62.50 | 27.05 | 74.88
-The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans.
 ## Downloads
 Download from this same Huggingface repo.
 ## Usage
 You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models.
@@ -52,13 +44,16 @@ You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-P
 If you are using IndicNER, please cite the following article:
 ```
-@misc{mhaske2022indicner,
-      title={Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
-Languages},
-      author={Arnav Mhaske, Harshit Kedia, Rudramurthy. V, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Khapra},
-      year={2022},
-      eprint={to be published soon},
-    }
 ```
 We would like to hear from you if:
@@ -71,22 +66,21 @@ We would like to hear from you if:
 The IndicNER code (and models) are released under the MIT License.
 <!-- Contributors -->
 ## Contributors
  - Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
  - Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
- - Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/)) </sub>
- - Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
- - Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
  - Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
-This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org).
 <!-- Contact -->
 ## Contact
 - Anoop Kunchukuttan ([[email protected]](mailto:[email protected]))
-- Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
-- Pratyush Kumar ([[email protected]](mailto:[email protected]))

 ## Training Corpus
 Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
 ## Downloads
 Download from this same Huggingface repo.
+Update 20 Dec 2022: We released a new paper documenting IndicNER and Naamapadam. We have a different model reported in the paper. We will update the repo here soon with this model.
 ## Usage
 You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models.
 If you are using IndicNER, please cite the following article:
 ```
+@misc{mhaske2022naamapadam,
+  doi = {10.48550/ARXIV.2212.10168},
+  url = {https://arxiv.org/abs/2212.10168},
+  author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop},
+  title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages}
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
 ```
 We would like to hear from you if:
 The IndicNER code (and models) are released under the MIT License.
 <!-- Contributors -->
 ## Contributors
  - Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
  - Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
+ - Sumanth Doddapaneni <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
  - Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
+ - Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
+ - Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
+ - Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
+This work is the outcome of a volunteer effort as part of the [AI4Bharat initiative](https://ai4bharat.iitm.ac.in).
 <!-- Contact -->
 ## Contact
 - Anoop Kunchukuttan ([[email protected]](mailto:[email protected]))
+- Rudra Murthy V ([rmurthyv@in.ibm.com](mailto:rmurthyv@in.ibm.com))