File size: 2,340 Bytes
10e2e09
 
578b7d5
e1b6df7
578b7d5
e1b6df7
 
578b7d5
 
e1b6df7
98646c2
e2d11fb
98646c2
 
 
8eec311
2c5ff89
 
 
16d4656
98646c2
 
d21f291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98646c2
 
 
 
 
 
 
 
 
 
 
 
 
 
d21f291
 
 
98646c2
d21f291
98646c2
 
 
 
d21f291
98646c2
 
 
8eec311
25f2456
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language:
- hi
tags:
- ner
---

# NER in Hindi
## muril_base_cased_hindi_ner

Base model is [google/muril-base-cased](https://huggingface.co/google/muril-base-cased), a BERT model pre-trained on 17 Indian languages and their transliterated counterparts.
Hindi NER dataset is from [HiNER](https://github.com/cfiltnlp/HiNER).

## Usage
### example: 
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("MichaelHuang/muril_base_cased_hindi_ner")
tokenizer = AutoTokenizer.from_pretrained("google/muril-base-cased")

# Define the labels dictionary
labels_dict = {
    0: "B-FESTIVAL",
    1: "B-GAME",
    2: "B-LANGUAGE",
    3: "B-LITERATURE",
    4: "B-LOCATION",
    5: "B-MISC",
    6: "B-NUMEX",
    7: "B-ORGANIZATION",
    8: "B-PERSON",
    9: "B-RELIGION",
    10: "B-TIMEX",
    11: "I-FESTIVAL",
    12: "I-GAME",
    13: "I-LANGUAGE",
    14: "I-LITERATURE",
    15: "I-LOCATION",
    16: "I-MISC",
    17: "I-NUMEX",
    18: "I-ORGANIZATION",
    19: "I-PERSON",
    20: "I-RELIGION",
    21: "I-TIMEX",
    22: "O"
}

def ner_predict(sentence, model, tokenizer, labels_dict):
    # Tokenize the input sentence
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)

    # Perform inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted labels
    predicted_labels = torch.argmax(outputs.logits, dim=2)

    # Convert tokens and labels to lists
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = predicted_labels.squeeze().tolist()

    # Map numeric labels to string labels
    predicted_labels = [labels_dict[label] for label in labels]

    # Combine tokens and labels
    result = list(zip(tokens, predicted_labels))

    return result

test_sentence = "अकबर ईद पर टेनिस खेलता है"
predictions = ner_predict(test_sentence, model, tokenizer, labels_dict)

for token, label in predictions:
    print(f"{token}: {label}")
```

### Eval results

| eval_loss | eval_accuracy| eval_f1|epoch | eval_precision  | eval_recall |
|:--------:|:-------------:|:------:|:----:|:---------------:|:----------:|
| 0.11     |    0.97       |  0.88    | 3.0 |      0.87 |   0.89     |