Normalization Model for Medieval Latin
Overview
This repository contains a PyTorch-based sequence-to-sequence model with attention designed to normalize orthographic variations in medieval Latin texts. It uses the Normalized Georges 1913 Dataset, which provides approximately 5 million word pairs of orthographic variants and their normalized forms.
The model is part of the Burchard's Dekret Digital project (www.burchards-dekret-digital.de) and was developed to support text normalization tasks in historical document processing.
Model Architecture
The model is a sequence-to-sequence (Seq2Seq) architecture with attention. Key components include:
Embedding Layer:
- Converts character indices into dense vector representations.
Bidirectional LSTM Encoder:
- Encodes the input sequence and captures bidirectional context.
Attention Mechanism:
- Aligns decoder outputs with relevant encoder outputs for better context-awareness.
LSTM Decoder:
- Decodes the normalized sequence character-by-character.
Projection Layer:
- Maps decoder outputs to character probabilities.
Model Parameters
- Embedding Dimension: 64
- Hidden Dimension: 128
- Number of Layers: 3
- Dropout: 0.3
Dataset
The model is trained on the Normalized Georges 1913 Dataset. The dataset contains tab-separated word pairs of orthographic variants and their normalized forms, generated with systematic transformations. For detailed dataset information, refer to the dataset page.
Sample Data
Orthographic Variant | Normalized Form |
---|---|
circumcalcabicis |
circumcalcabitis |
peruincaturi |
pervincaturi |
tepidaremtur |
tepidarentur |
exmovemdis |
exmovendis |
comvomavisset |
convomavisset |
permeiemdis |
permeiendis |
permeditacissime |
permeditatissime |
conspersu |
conspersu |
pręviridancissimę |
praeviridantissimae |
relaxavisses |
relaxavisses |
edentaveratis |
edentaveratis |
amhelioris |
anhelioris |
remediatae |
remediatae |
discruciavero |
discruciavero |
imterplicavimus |
interplicavimus |
peraequata |
peraequata |
ignicomantissimorum |
ignicomantissimorum |
pręfvltvro |
praefulturo |
Training
The model is trained using the following parameter:
- Loss: CrossEntropyLoss (ignores padding index).
- Optimizer: Adam with a learning rate of 0.0005.
- Scheduler: ReduceLROnPlateau, reducing the learning rate on validation loss stagnation.
- Gradient Clipping: Max norm of 1.0.
- Batch Size: 4096.
Usecases
This model can be used for:
- Applying normalization based on Georges 1913.
Known limitations
The dataset has not been subjected to data augmentation and may contain substantial bias, particularly against irregular forms, such as Greek loanwords like "presbyter."
How to Use
Saved Files
- normalization_model.pth: Trained PyTorch model weights.
- vocab.pkl: Vocabulary mapping for the dataset.
- config.json: Configuration file with model hyperparameters.
Training
To train the model, run the train_model.py
script on Github.
Usage for Inference
Use script test_model.py
script on Github.
Acknowledgments
Dataset was created by Michael Schonhardt (https://orcid.org/0000-0002-2750-1900) for the project Burchards Dekret Digital.
Creation was made possible thanks to the lemmata from Georges 1913, kindly provided via www.zeno.org by 'Henricus - Edition Deutsche Klassik GmbH'. Please consider using and supporting this valuable service.
License
CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/legalcode.en)
Citation
If you use this model, please cite: Michael Schonhardt, Model: Normalized Georges 1913, https://huggingface.co/mschonhardt/georges-1913-normalization-model, Doi: 10.5281/zenodo.14264956.
- Downloads last month
- 10