stefan-it's picture
readme: fix base model key name
0903215
metadata
license: cc-by-4.0
library_name: span-marker
base_model: gwlms/bert-base-token-dropping-dewiki-v1
tags:
  - span-marker
  - token-classification
  - ner
  - named-entity-recognition
pipeline_tag: token-classification
widget:
  - text: >-
      Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU
      München .
    example_title: Wikipedia
datasets:
  - gwlms/germeval2014
language:
  - de
model-index:
  - name: >-
      SpanMarker with GWLMS Token Dropping BERT on GermEval 2014 NER Dataset by
      Stefan Schweter (@stefan-it)
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          type: gwlms/germeval2014
          name: GermEval 2014
          split: test
          revision: f3647c56803ce67c08ee8d15f4611054c377b226
        metrics:
          - type: f1
            value: 0.8744
            name: F1
metrics:
  - f1

SpanMarker for GermEval 2014 NER

This is a SpanMarker model that was fine-tuned on the GermEval 2014 NER Dataset.

The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following properties: The data was sampled from German Wikipedia and News Corpora as a collection of citations. The dataset covers over 31,000 sentences corresponding to over 590,000 tokens. The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]].

12 classes of Named Entites are annotated and must be recognized: four main classes PERson, LOCation, ORGanisation, and OTHer and their subclasses by introducing two fine-grained labels: -deriv marks derivations from NEs such as "englisch" (“English”), and -part marks compounds including a NE as a subsequence deutschlandweit (“Germany-wide”).

Fine-Tuning

We use the same hyper-parameters as used in the "German's Next Language Model" paper using the GWLMS Token Dropping BERT model as backbone.

Evaluation is performed with SpanMarkers internal evaluation code that uses seqeval.

We fine-tune 5 models and upload the model with best F1-Score on development set. Results on development set are in brackets:

Model Run 1 Run 2 Run 3 Run 4 Run 5 Avg.
GWLMS Token Dropping BERT (87.85) / 87.28 (88.09) / 87.44 (87.59) / 87.26 (87.71) / 87.43 (87.83) / 87.24 (87.81) / 87.33

The best model achieves a final test score of 87.44%.

Scripts for training and evaluation are also available.

Usage

The fine-tuned model can be used like:

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("gwlms/span-marker-token-dropping-bert-germeval14")

# Run inference
entities = model.predict("Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München .")