AI & ML interests

ByT5, historical language models

hmByT5 - Language Models

Historical Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

  • English (British Library Corpus - Books)
  • German (Europeana Newspaper)
  • French (Europeana Newspaper)
  • Finnish (Europeana Newspaper)
  • Swedish (Europeana Newspaper)
  • Dutch (Delpher Corpus)
  • Norwegian (NCC)

More details can be found in our GitHub repository.

Leaderboard

We test our pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana. The following table shows an overview of used datasets.

Language Dataset Additional Dataset
English AjMC -
German AjMC -
French AjMC ICDAR-Europeana
Finnish NewsEye -
Swedish NewsEye -
Dutch ICDAR-Europeana -

Current best models:

Model English AjMC German AjMC French AjMC Finnish NewsEye Swedish NewsEye Dutch ICDAR French ICDAR Avg.
hmbyt5/byt5-small-english 85.65 ± 1.21 87.27 ± 0.50 84.44 ± 0.79
hmbyt5-preliminary/byt5-small-english-german 85.74 ± 0.72 87.45 ± 0.67 84.23 ± 0.65
hmbyt5-preliminary/byt5-small-english-german-french 85.61 ± 0.96 87.24 ± 0.76 84.39 ± 0.68
hmbyt5-preliminary/byt5-small-english-german-french-finnish 85.30 ± 1.14 87.37 ± 0.53 84.12 ± 0.42
hmbyt5-preliminary/byt5-small-english-german-french-finnish-swedish 85.40 ± 0.78 87.12 ± 0.19 84.41 ± 0.34
hmbyt5-preliminary/byt5-small-english-german-french-finnish-swedish-dutch 85.51 ± 0.68 87.58 ± 0.39 84.39 ± 0.83 55.46 ± 1.99 73.38 ± 2.45 84.80 ± 0.44 75.97 ± 0.55
hmbyt5-preliminary/byt5-small-multilingual-4g 83.49 ± 0.96 87.65 ± 0.63 84.16 ± 0.90
hmbyt5-preliminary/byt5-small-multilingual-4g-2e 83.86 ± 0.61 87.54 ± 0.19 84.29 ± 0.41
hmbyt5-preliminary/byt5-small-multilingual-4g-3e 83.49 ± 0.99 87.38 ± 0.53 84.30 ± 0.51
hmbyt5-preliminary/byt5-small-historic-multilingual-flax 83.28 ± 1.67 86.98 ± 0.71 83.49 ± 1.06 76.96 ± 1.58 78.80 ± 1.89 86.47 ± 0.79 77.43 ± 0.51
hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax 84.91 ± 0.86 88.02 ± 0.35 84.78 ± 0.75 77.77 ± 1.83 79.94 ± 0.60 86.85 ± 0.91 77.45 ± 0.54

More recent results on more datasets can be found in the hmLeaderboard.

Acknowledgements

We thank Luisa März, Katharina Schmid and Erion Çano for their fruitful discussions about Historical Language Models.

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️

datasets

None public yet