wissamantoun
commited on
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language: fr
|
4 |
+
library_name: transformers
|
5 |
+
pipeline_tag: feature-extraction
|
6 |
+
datasets:
|
7 |
+
- uonlp/CulturaX
|
8 |
+
- oscar
|
9 |
+
- almanach/HALvest
|
10 |
+
- wikimedia/wikipedia
|
11 |
+
tags:
|
12 |
+
- deberta-v2
|
13 |
+
- deberta-v3
|
14 |
+
- debertav2
|
15 |
+
- debertav3
|
16 |
+
- camembert
|
17 |
+
---
|
18 |
+
# CamemBERT(a)-v2: A Smarter French Language Model Aged to Perfection
|
19 |
+
|
20 |
+
[CamemBERTav2](https://arxiv.org/abs/2411.08868) is a French language model pretrained on a large corpus of 275B tokens of French text. It is the second version of the CamemBERTa model, which is based on the DebertaV2 architecture. CamemBERTav2 is trained using the Replaced Token Detection (RTD) objective with 20% mask rate on 275B tokens on 32 H100 GPUs. The dataset used for training is a combination of French [OSCAR](https://oscar-project.org/) dumps from the [CulturaX Project](https://huggingface.co/datasets/uonlp/CulturaX), French scientific documents from [HALvest](https://huggingface.co/datasets/almanach/HALvest), and the French Wikipedia.
|
21 |
+
|
22 |
+
The model is a drop-in replacement for the original CamemBERTa model. Note that the new tokenizer is different from the original CamemBERTa tokenizer, so you will need to use Fast Tokenizers to use the model. It will work with `DebertaV2TokenizerFast` from `transformers` library even if the original `DebertaV2TokenizerFast` was sentencepiece-based.
|
23 |
+
|
24 |
+
# Model Checkpoints
|
25 |
+
|
26 |
+
This repository contains all intermediate model checkpoints with corresponding checkpoints in TF and PT structured as follows:
|
27 |
+
|
28 |
+
```
|
29 |
+
โโโ checkpoints/
|
30 |
+
โ โโโ iter_ckpt_rank_XX/ # Contains all iterator checkpoints from a specific rank
|
31 |
+
โ โโโ summaries/ # Tensorboard logs
|
32 |
+
โ โโโ ckpt-YYYYY.data-00000-of-00001
|
33 |
+
โ โโโ ckpt-YYYYY.index
|
34 |
+
โโโ post/
|
35 |
+
โ โโโ ckpt-YYYYY/
|
36 |
+
โ โ โโโ pt/
|
37 |
+
โ โ โ โโโ discriminator/
|
38 |
+
โ โ โ โ โโโ config.json
|
39 |
+
โ โ โ โ โโโ pytorch_model.bin
|
40 |
+
โ โ โ โ โโโ special_tokens_map.json
|
41 |
+
โ โ โ โ โโโ tokenizer.json
|
42 |
+
โ โ โ โ โโโ tokenizer_config.json
|
43 |
+
โ โ โ โโโ generator/
|
44 |
+
โ โ โ โ โโโ ...
|
45 |
+
โ โ โโโ tf/
|
46 |
+
โ โ โ โโโ ...
|
47 |
+
```
|
48 |
+
|
49 |
+
## Citation
|
50 |
+
|
51 |
+
```bibtex
|
52 |
+
@misc{antoun2024camembert20smarterfrench,
|
53 |
+
title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
|
54 |
+
author={Wissam Antoun and Francis Kulumba and Rian Touchent and รric de la Clergerie and Benoรฎt Sagot and Djamรฉ Seddah},
|
55 |
+
year={2024},
|
56 |
+
eprint={2411.08868},
|
57 |
+
archivePrefix={arXiv},
|
58 |
+
primaryClass={cs.CL},
|
59 |
+
url={https://arxiv.org/abs/2411.08868},
|
60 |
+
}
|
61 |
+
```
|