metadata

license: agpl-3.0
language:
  - de
base_model:
  - deepset/gbert-base
pipeline_tag: token-classification

MEDNER.DE: Medicinal Product Entity Recognition in German-Specific Contexts

Released in December 2024, this is a German BERT language model further pretrained on deepset/gbert-base using a pharmacovigilance-related case summary corpus. The model has been fine-tuned for Named Entity Recognition (NER) tasks on an automatically annotated dataset to recognize medicinal products such as medications and vaccines.
In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches

Overview

Paper: [https://...
Architecture: MLM_based BERT Base
Language: German
Supported Labels: Medicinal Product

Model Name: MEDNER.DE

How to Use

Use a pipeline as a high-level helper

from transformers import pipeline

# Load the NER pipeline
model = pipeline("ner", model="pei-germany/MEDNER-de-fp-gbert", aggregation_strategy="none")

# Input text
text = "Der Patient wurde mit AstraZeneca geimpft und nahm anschließend Ibuprofen, um das Fieber zu senken."

# Get raw predictions and merge subwords
merged_predictions = []
current = None

for pred in model(text):
    if pred['word'].startswith("##"):
        if current:
            current['word'] += pred['word'][2:]
            current['end'] = pred['end']
            current['score'] = (current['score'] + pred['score']) / 2
    else:
        if current:
            merged_predictions.append(current)
        current = pred.copy()

if current:
    merged_predictions.append(current)

# Filter by confidence threshold and print
threshold = 0.5
filtered_predictions = [p for p in merged_predictions if p['score'] >= threshold]
for p in filtered_predictions:
    print(f"Entity: {p['entity']}, Word: {p['word']}, Score: {p['score']:.2f}, Start: {p['start']}, End: {p['end']}")

Load model directly

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer

tokenizer = AutoTokenizer.from_pretrained("pei-germany/MEDNER-de-fp-gbert")
model = AutoModelForTokenClassification.from_pretrained("pei-germany/MEDNER-de-fp-gbert")

text="Der Patient bekam den COVID-Impfstoff und nahm danach Aspirin."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Process logits and map predictions to labels
predictions = [
    (token, model.config.id2label[label.item()])
    for token, label in zip(
        tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
        torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0]
    )
    if token not in tokenizer.all_special_tokens
]

print(predictions)

Authors

Farnaz Zeidi, Manuela Messelhäußer, Roman Christof, Xing David Wang, Ulf Leser, Dirk Mentzer, Renate König, Liam Childs.

License

This model is shared under the GNU Affero General Public License v3.0 License.