Update README.md
Browse files
README.md
CHANGED
@@ -30,19 +30,11 @@ The 11 languages covered by IndicNER are: Assamese, Bengali, Gujarati, Hindi, Ka
|
|
30 |
## Training Corpus
|
31 |
Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
|
32 |
|
33 |
-
## Evaluation Results
|
34 |
-
Benchmarking on our testset.
|
35 |
-
|
36 |
-
Language | bn | hi | kn | ml | mr | gu | ta | te | as | or | pa
|
37 |
-
-----| ----- | ----- | ------ | -----| ----- | ----- | ------ | -----| ----- | ----- | ------
|
38 |
-
F1 score | 79.75 | 82.33 | 80.01 | 80.73 | 80.51 | 73.82 | 80.98 | 80.88 | 62.50 | 27.05 | 74.88
|
39 |
-
|
40 |
-
The first 5 languages (bn, hi, kn, ml, mr ) have large human annotated testsets consisting of around 500-1000 sentences. The next 3 (gu, ta, te) have smaller human annotated testsets with only around 50 sentences. The final 3 (as, or, pa) languages have mined projected testsets not supervised by humans.
|
41 |
-
|
42 |
-
|
43 |
## Downloads
|
44 |
Download from this same Huggingface repo.
|
45 |
|
|
|
|
|
46 |
## Usage
|
47 |
|
48 |
You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models.
|
@@ -52,13 +44,16 @@ You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-P
|
|
52 |
|
53 |
If you are using IndicNER, please cite the following article:
|
54 |
```
|
55 |
-
@misc{
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
|
|
|
|
|
|
62 |
```
|
63 |
We would like to hear from you if:
|
64 |
|
@@ -71,22 +66,21 @@ We would like to hear from you if:
|
|
71 |
|
72 |
The IndicNER code (and models) are released under the MIT License.
|
73 |
|
74 |
-
|
75 |
-
|
76 |
<!-- Contributors -->
|
77 |
## Contributors
|
78 |
- Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
79 |
- Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
80 |
-
-
|
81 |
-
- Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
|
82 |
-
- Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
|
83 |
- Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
|
|
|
|
|
|
84 |
|
85 |
-
This work is the outcome of a volunteer effort as part of [AI4Bharat initiative](https://ai4bharat.
|
86 |
|
87 |
|
88 |
<!-- Contact -->
|
89 |
## Contact
|
90 |
- Anoop Kunchukuttan ([[email protected]](mailto:[email protected]))
|
91 |
-
-
|
92 |
-
|
|
|
30 |
## Training Corpus
|
31 |
Our model was trained on a [dataset](https://huggingface.co/datasets/ai4bharat/naamapadam) which we mined from the existing [Samanantar Corpus](https://huggingface.co/datasets/ai4bharat/samanantar). We used a bert-base-multilingual-uncased model as the starting point and then fine-tuned it to the NER dataset mentioned previously.
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
## Downloads
|
34 |
Download from this same Huggingface repo.
|
35 |
|
36 |
+
Update 20 Dec 2022: We released a new paper documenting IndicNER and Naamapadam. We have a different model reported in the paper. We will update the repo here soon with this model.
|
37 |
+
|
38 |
## Usage
|
39 |
|
40 |
You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa-PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre-trained model on Naampadam dataset to build your own NER models.
|
|
|
44 |
|
45 |
If you are using IndicNER, please cite the following article:
|
46 |
```
|
47 |
+
@misc{mhaske2022naamapadam,
|
48 |
+
doi = {10.48550/ARXIV.2212.10168},
|
49 |
+
url = {https://arxiv.org/abs/2212.10168},
|
50 |
+
author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop},
|
51 |
+
title = {Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages}
|
52 |
+
publisher = {arXiv},
|
53 |
+
year = {2022},
|
54 |
+
copyright = {arXiv.org perpetual, non-exclusive license}
|
55 |
+
}
|
56 |
+
|
57 |
```
|
58 |
We would like to hear from you if:
|
59 |
|
|
|
66 |
|
67 |
The IndicNER code (and models) are released under the MIT License.
|
68 |
|
|
|
|
|
69 |
<!-- Contributors -->
|
70 |
## Contributors
|
71 |
- Arnav Mhaske <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
72 |
- Harshit Kedia <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
73 |
+
- Sumanth Doddapaneni <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
|
|
|
|
74 |
- Mitesh M. Khapra <sub> ([AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in)) </sub>
|
75 |
+
- Pratyush Kumar <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
|
76 |
+
- Rudra Murthy <sub> ([AI4Bharat](https://ai4bharat.org), [IBM](https://www.ibm.com))</sub>
|
77 |
+
- Anoop Kunchukuttan <sub> ([AI4Bharat](https://ai4bharat.org), [Microsoft](https://www.microsoft.com/en-in/), [IITM](https://www.iitm.ac.in)) </sub>
|
78 |
|
79 |
+
This work is the outcome of a volunteer effort as part of the [AI4Bharat initiative](https://ai4bharat.iitm.ac.in).
|
80 |
|
81 |
|
82 |
<!-- Contact -->
|
83 |
## Contact
|
84 |
- Anoop Kunchukuttan ([[email protected]](mailto:[email protected]))
|
85 |
+
- Rudra Murthy V ([rmurthyv@in.ibm.com](mailto:rmurthyv@in.ibm.com))
|
86 |
+
|