DistilCamemBERT-NER

We present DistilCamemBERT-NER, which is DistilCamemBERT fine-tuned for the NER (Named Entity Recognition) task for the French language. The work is inspired by Jean-Baptiste/camembert-ner based on the CamemBERT model. The problem of the modelizations based on CamemBERT is at the scaling moment, for the production phase, for example. Indeed, inference cost can be a technological issue. To counteract this effect, we propose this modelization which divides the inference time by two with the same consumption power thanks to DistilCamemBERT.

Dataset

The dataset used is wikiner_fr, which represents ~170k sentences labeled in 5 categories :

  • PER: personality ;
  • LOC: location ;
  • ORG: organization ;
  • MISC: miscellaneous entities (movies title, books, etc.) ;
  • O: background (Outside entity).

Evaluation results

class precision (%) recall (%) f1 (%) support (#sub-word)
global 98.17 98.19 98.18 378,776
PER 96.78 96.87 96.82 23,754
LOC 94.05 93.59 93.82 27,196
ORG 86.05 85.92 85.98 6,526
MISC 88.78 84.69 86.69 11,891
O 99.26 99.47 99.37 309,409

Benchmark

This model performance is compared to 2 reference models (see below) with the metric f1 score. For the mean inference time measure, an AMD Ryzen 5 4500U @ 2.3GHz with 6 cores was used:

model time (ms) PER (%) LOC (%) ORG (%) MISC (%) O (%)
cmarkea/distilcamembert-base-ner 43.44 96.82 93.82 85.98 86.69 99.37
Davlan/bert-base-multilingual-cased-ner-hrl 87.56 79.93 72.89 61.34 n/a 96.04
flair/ner-french 314.96 82.91 76.17 70.96 76.29 97.65

How to use DistilCamemBERT-NER

from transformers import pipeline

ner = pipeline(
    task='ner',
    model="cmarkea/distilcamembert-base-ner",
    tokenizer="cmarkea/distilcamembert-base-ner",
    aggregation_strategy="simple"
)
result = ner(
    "Le Crédit Mutuel Arkéa est une banque Française, elle comprend le CMB "
    "qui est une banque située en Bretagne et le CMSO qui est une banque "
    "qui se situe principalement en Aquitaine. C'est sous la présidence de "
    "Louis Lichou, dans les années 1980 que différentes filiales sont créées "
    "au sein du CMB et forment les principales filiales du groupe qui "
    "existent encore aujourd'hui (Federal Finance, Suravenir, Financo, etc.)."
)

result
[{'entity_group': 'ORG',
  'score': 0.9974479,
  'word': 'Crédit Mutuel Arkéa',
  'start': 3,
  'end': 22},
 {'entity_group': 'LOC',
  'score': 0.9000358,
  'word': 'Française',
  'start': 38,
  'end': 47},
 {'entity_group': 'ORG',
  'score': 0.9788757,
  'word': 'CMB',
  'start': 66,
  'end': 69},
 {'entity_group': 'LOC',
  'score': 0.99919766,
  'word': 'Bretagne',
  'start': 99,
  'end': 107},
 {'entity_group': 'ORG',
  'score': 0.9594884,
  'word': 'CMSO',
  'start': 114,
  'end': 118},
 {'entity_group': 'LOC',
  'score': 0.99935514,
  'word': 'Aquitaine',
  'start': 169,
  'end': 178},
 {'entity_group': 'PER',
  'score': 0.99911094,
  'word': 'Louis Lichou',
  'start': 208,
  'end': 220},
 {'entity_group': 'ORG',
  'score': 0.96226394,
  'word': 'CMB',
  'start': 291,
  'end': 294},
 {'entity_group': 'ORG',
  'score': 0.9983959,
  'word': 'Federal Finance',
  'start': 374,
  'end': 389},
 {'entity_group': 'ORG',
  'score': 0.9984454,
  'word': 'Suravenir',
  'start': 391,
  'end': 400},
 {'entity_group': 'ORG',
  'score': 0.9985084,
  'word': 'Financo',
  'start': 402,
  'end': 409}]

Optimum + ONNX

from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer, pipeline

HUB_MODEL = "cmarkea/distilcamembert-base-nli"
tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL)
model = ORTModelForTokenClassification.from_pretrained(HUB_MODEL)
onnx_qa = pipeline("token-classification", model=model, tokenizer=tokenizer)

# Quantized onnx model
quantized_model = ORTModelForTokenClassification.from_pretrained(
    HUB_MODEL, file_name="model_quantized.onnx"
)

Citation

@inproceedings{delestre:hal-03674695,
  TITLE = {{DistilCamemBERT : une distillation du mod{\`e}le fran{\c c}ais CamemBERT}},
  AUTHOR = {Delestre, Cyrile and Amar, Abibatou},
  URL = {https://hal.archives-ouvertes.fr/hal-03674695},
  BOOKTITLE = {{CAp (Conf{\'e}rence sur l'Apprentissage automatique)}},
  ADDRESS = {Vannes, France},
  YEAR = {2022},
  MONTH = Jul,
  KEYWORDS = {NLP ; Transformers ; CamemBERT ; Distillation},
  PDF = {https://hal.archives-ouvertes.fr/hal-03674695/file/cap2022.pdf},
  HAL_ID = {hal-03674695},
  HAL_VERSION = {v1},
}
Downloads last month
169,076
Safetensors
Model size
67.5M params
Tensor type
I64
·
F32
·
Inference API

Model tree for cmarkea/distilcamembert-base-ner

Quantized
(4)
this model
Finetunes
2 models

Dataset used to train cmarkea/distilcamembert-base-ner

Collection including cmarkea/distilcamembert-base-ner