--- license: agpl-3.0 language: - de base_model: - deepset/gbert-base pipeline_tag: token-classification --- # MEDNER.DE: Medicinal Product Entity Recognition in German-Specific Contexts Released in December 2024, this is a German BERT language model further pretrained on `deepset/gbert-base` using a pharmacovigilance-related case summary corpus. The model has been fine-tuned for Named Entity Recognition (NER) tasks on an automatically annotated dataset to recognize medicinal products such as medications and vaccines. In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches --- ## Overview - **Paper**: [https://... - **Architecture**: MLM_based BERT Base - **Language**: German - **Supported Labels**: Medicinal Product **Model Name**: MEDNER.DE --- ## How to Use ### Use a pipeline as a high-level helper ```python from transformers import pipeline # Load the NER pipeline model = pipeline("ner", model="pei-germany/MEDNER-de-fp-gbert", aggregation_strategy="none") # Input text text = "Der Patient wurde mit AstraZeneca geimpft und nahm anschließend Ibuprofen, um das Fieber zu senken." # Get raw predictions and merge subwords merged_predictions = [] current = None for pred in model(text): if pred['word'].startswith("##"): if current: current['word'] += pred['word'][2:] current['end'] = pred['end'] current['score'] = (current['score'] + pred['score']) / 2 else: if current: merged_predictions.append(current) current = pred.copy() if current: merged_predictions.append(current) # Filter by confidence threshold and print threshold = 0.5 filtered_predictions = [p for p in merged_predictions if p['score'] >= threshold] for p in filtered_predictions: print(f"Entity: {p['entity']}, Word: {p['word']}, Score: {p['score']:.2f}, Start: {p['start']}, End: {p['end']}") ``` ### Load model directly ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("pei-germany/MEDNER-de-fp-gbert") model = AutoModelForTokenClassification.from_pretrained("pei-germany/MEDNER-de-fp-gbert") text = "Der Patient wurde mit AstraZeneca geimpft und nahm anschließend Ibuprofen, um das Fieber zu senken." # Tokenize and get predictions inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # Decode tokens and predictions tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) predictions = torch.argmax(outputs.logits, dim=2)[0].tolist() labels = [model.config.id2label[pred] for pred in predictions] # Process and merge subwords entities = [] current_word = "" current_entity = None for token, label in zip(tokens, labels): token = token.replace("##", "") # Remove subword markers if label.startswith("B-"): # Beginning of a new entity if current_entity and current_entity == label[2:]: # Merge consecutive B- labels current_word += token else: # Save the previous entity and start a new one if current_word: entities.append({"entity": current_entity, "word": current_word}) current_word = token current_entity = label[2:] elif label.startswith("I-") and current_entity == label[2:]: # Continuation of the same entity current_word += token else: # Outside any entity if current_word: # Save the previous entity entities.append({"entity": current_entity, "word": current_word}) current_word = "" current_entity = None if current_word: # Append the last entity entities.append({"entity": current_entity, "word": current_word}) # Print results for entity in entities: print(f"Entity: {entity['entity']}, Word: {entity['word']}") ``` --- # Authors Farnaz Zeidi, Manuela Messelhäußer, Roman Christof, Xing David Wang, Ulf Leser, Dirk Mentzer, Renate König, Liam Childs. --- ## License This model is shared under the [GNU Affero General Public License v3.0 License](https://choosealicense.com/licenses/agpl-3.0/).