magistermilitum/bert_medieval_multilingual

Model Details

This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.

The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.

Several big corpora were cleaned and transformed to be used during the training process :

dataset	size	Lang	dates
CC100 [1]	3,2Gb	la	5th BC - 18th
Corpus Corporum [2]	3,0Gb	la	5th BC - 16th
CEMA [3]	320Mb	la+fro	9th - 15th
HOME-Alcar [4]	38Mb	la+fro	12th - 15th
BFM [5]	34Mb	fro	13th - 15th
AND [6]	19Mb	fro	13th - 15th
CODEA [7]	13Mb	spa	12th - 16th
	~6,5Gb
	650M tokens (4,5Gb)*

A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted.

[1] CC-NET Repository : https://huggingface.co/datasets/cc100

[2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/

[3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/

[4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884

[5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/

[6] Anglo-Normand Dictionary : https://anglo-norman.net/

[7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/

magistermilitum
/

bert_medieval_multilingual

Model Details

Datasets used to train magistermilitum/bert_medieval_multilingual