anoopk commited on
Commit
b56cb5e
1 Parent(s): 6434e40

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -25
README.md CHANGED
@@ -30,19 +30,11 @@ The 11 languages covered by IndicNER are: Assamese, Bengali, Gujarati, Hindi, Ka
30
  ## Training Corpus
31
  Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
32
 
33
- ## Evaluation Results
34
- Benchmarking on our testset.
35
-
36
- Language | bn | hi | kn | ml | mr | gu | ta | te | as | or | pa
37
- -----| ----- | ----- | ------ | -----| ----- | ----- | ------ | -----| ----- | ----- | ------
38
- F1 score | 79.75 | 82.33 | 80.01 | 80.73 | 80.51 | 73.82 | 80.98 | 80.88 | 62.50 | 27.05 | 74.88
39
-
40
- The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans.
41
-
42
-
43
  ## Downloads
44
  Download from this same Huggingface repo.
45
 
 
 
46
  ## Usage
47
 
48
  You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models.
@@ -52,13 +44,16 @@ You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-P
52
 
53
  If you are using IndicNER, please cite the following article:
54
  ```
55
- @misc{mhaske2022indicner,
56
- title={Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
57
- Languages},
58
- author={Arnav Mhaske, Harshit Kedia, Rudramurthy. V, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Khapra},
59
- year={2022},
60
- eprint={to be published soon},
61
- }
 
 
 
62
  ```
63
  We would like to hear from you if:
64
 
@@ -71,22 +66,21 @@ We would like to hear from you if:
71
 
72
  The IndicNER code (and models) are released under the MIT License.
73
 
74
-
75
-
76
  <!-- Contributors -->
77
  ## Contributors
78
  - Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
79
  - Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
80
- - Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/)) </sub>
81
- - Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
82
- - Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
83
  - Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
 
 
 
84
 
85
- This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.org).
86
 
87
 
88
  <!-- Contact -->
89
  ## Contact
90
  - Anoop Kunchukuttan ([[email protected]](mailto:[email protected]))
91
- - Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
92
- - Pratyush Kumar ([[email protected]](mailto:[email protected]))
 
30
  ## Training Corpus
31
  Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
32
 
 
 
 
 
 
 
 
 
 
 
33
  ## Downloads
34
  Download from this same Huggingface repo.
35
 
36
+ Update 20 Dec 2022: We released a new paper documenting IndicNER and Naamapadam. We have a different model reported in the paper. We will update the repo here soon with this model.
37
+
38
  ## Usage
39
 
40
  You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models.
 
44
 
45
  If you are using IndicNER, please cite the following article:
46
  ```
47
+ @misc{mhaske2022naamapadam,
48
+ doi = {10.48550/ARXIV.2212.10168},
49
+ url = {https://arxiv.org/abs/2212.10168},
50
+ author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop},
51
+ title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages}
52
+ publisher = {arXiv},
53
+ year = {2022},
54
+ copyright = {arXiv.org perpetual, non-exclusive license}
55
+ }
56
+
57
  ```
58
  We would like to hear from you if:
59
 
 
66
 
67
  The IndicNER code (and models) are released under the MIT License.
68
 
 
 
69
  <!-- Contributors -->
70
  ## Contributors
71
  - Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
72
  - Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
73
+ - Sumanth Doddapaneni <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
 
 
74
  - Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
75
+ - Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
76
+ - Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
77
+ - Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
78
 
79
+ This work is the outcome of a volunteer effort as part of the [AI4Bharat initiative](https://ai4bharat.iitm.ac.in).
80
 
81
 
82
  <!-- Contact -->
83
  ## Contact
84
  - Anoop Kunchukuttan ([[email protected]](mailto:[email protected]))
85
+ - Rudra Murthy V ([rmurthyv@in.ibm.com](mailto:rmurthyv@in.ibm.com))
86
+