INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a NER model built on top of camembert-large embeddings, trained to predict nested entities in french, specifically for literary texts.

The predicted entities are:

  • mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
  • facilities (FAC): chatêau, sentier, chambre, couloir, ...
  • time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
  • geo-political entities (GPE): Montrouge, France, le petit hameau, ...
  • locations (LOC): le sud, Mars, l'océan, le bois, ...
  • vehicles (VEH): avion, voitures, calèche, vélos, ...

MODEL PERFORMANCES (LOOCV):

NER_tag precision recall f1_score support support %
PER 90.58% 93.52% 92.03% 31,570 83.87%
FAC 70.49% 71.75% 71.12% 2,294 6.09%
TIME 58.40% 58.68% 58.54% 1,670 4.44%
GPE 76.69% 74.05% 75.35% 871 2.31%
LOC 60.92% 44.37% 51.35% 773 2.05%
VEH 66.18% 49.25% 56.47% 465 1.24%
micro_avg 86.70% 88.64% 87.61% 37,643 100.00%
macro_avg 70.55% 65.27% 67.48% 37,643 100.00%

TRAINING PARAMETERS:

  • Entities types: ['PER', 'LOC', 'FAC', 'TIME', 'VEH', 'GPE']
  • Tagging scheme: BIOES
  • Nested entities levels: [0, 1]
  • Split strategy: Leave-one-out cross-validation (28 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16
  • Initial learning rate: 0.00014

MODEL ARCHITECTURE:

Model Input: Maximum context camembert-large embeddings (1024 dimensions)

  • Locked Dropout: 0.5

  • Projection layer:

    • layer type: highway layer
    • input: 1024 dimensions
    • output: 2048 dimensions
  • BiLSTM layer:

    • input: 2048 dimensions
    • output: 256 dimensions (hidden state)
  • Linear layer:

    • input: 256 dimensions
    • output: 25 dimensions (predicted labels with BIOES tagging scheme)
  • CRF layer

Model Output: BIOES labels sequence

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1836_Gautier-Theophile_La-morte-amoureuse 14,299 tokens True
1 1840_Sand-George_Pauline 12,315 tokens True
2 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote 24,776 tokens True
3 1844_Balzac-Honore-de_La-Maison-Nucingen 30,987 tokens True
4 1844_Balzac-Honore-de_Sarrasine 15,408 tokens True
5 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens True
6 1863_Gautier-Theophile_Le-capitaine-Fracasse 11,834 tokens True
7 1873_Zola-Emile_Le-ventre-de-Paris 12,557 tokens True
8 1881_Flaubert-Gustave_Bouvard-et-Pecuchet 12,281 tokens True
9 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI 5,425 tokens True
10 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE 2,554 tokens True
11 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE 2,929 tokens True
12 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA 4,067 tokens True
13 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE 2,251 tokens True
14 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE 2,034 tokens True
15 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU 1,864 tokens True
16 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL 2,141 tokens True
17 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE 2,441 tokens True
18 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL 2,860 tokens True
19 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON 2,343 tokens True
20 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis 12,703 tokens True
21 1903_Conan-Laure_Elisabeth_Seton 13,023 tokens True
22 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube 10,982 tokens True
23 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin 10,305 tokens True
24 1917_Adèle-Bourgeois_Némoville 12,389 tokens True
25 1923_Radiguet-Raymond_Le-diable-au-corps 14,637 tokens True
26 1926_Audoux-Marguerite_De-la-ville-au-moulin 11,902 tokens True
27 1937_Audoux-Marguerite_Douce-Lumiere 12,285 tokens True
28 TOTAL 275,360 tokens 28 files used for cross-validation

PREDICTIONS CONFUSION MATRIX:

Gold Labels PER FAC TIME GPE LOC VEH O support
PER 29,525 27 13 6 7 26 1,966 31,570
FAC 43 1,646 0 17 12 2 574 2,294
TIME 5 1 980 1 1 0 682 1,670
GPE 18 28 1 645 27 0 152 871
LOC 5 63 0 54 343 0 308 773
VEH 58 8 1 0 0 229 169 465
O 2,902 532 682 110 167 89 0 4,482

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for AntoineBourgois/BookNLP-fr_NER_camembert-large

Finetuned
(10)
this model

Collection including AntoineBourgois/BookNLP-fr_NER_camembert-large