File size: 5,269 Bytes

---
license: mit
language:
- la
pipeline_tag: fill-mask
tags:
- latin
- masked language modelling
widget:
- text: "Gallia est omnis divisa in [MASK] tres ."
  example_title: "Commentary on Gallic Wars"
- text: "[MASK] sum Caesar ."
  example_title: "Who is Caesar?"
- text: "[MASK] it ad forum ."
  example_title: "Who is going to the forum?"
- text: "Ovidius paratus est ad [MASK] ."
  example_title: "What is Ovidius up to?"
- text: "[MASK], veni!"
  example_title: "Calling someone to come closer"
- text: "Roma in Italia [MASK] ."
  example_title: "Ubi est Roma?"
---




# Model Card for Simple Latin BERT 

<!-- Provide a quick summary of what the model is/does. [Optional] -->
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/)  corpora. 

**NOT** apt for production nor commercial use.  
This model&#39;s performance is really poor, and it has not been evaluated.

This model comes with its own tokenizer! It will automatically use **lowercase**.

Check the `training notebooks` folder for the preprocessing and training scripts.

Inspired by
- [This repo](https://github.com/dbamman/latin-bert), which has a BERT model for latin that is actually useful!
- [This tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples)
- [This tutorial](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=VNZZs-r6iKAV)
- [This tutorial](https://huggingface.co/blog/how-to-train)

#  Table of Contents

- [Model Card for Simple Latin BERT ](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Table of Contents](#table-of-contents-1)
- [Model Details](#model-details)
  - [Model Description](#model-description)
- [Uses](#uses)
  - [Direct Use](#direct-use)
  - [Downstream Use [Optional]](#downstream-use-optional)
- [Training Details](#training-details)
  - [Training Data](#training-data)
  - [Training Procedure](#training-procedure)
    - [Preprocessing](#preprocessing)
    - [Speeds, Sizes, Times](#speeds-sizes-times)
- [Evaluation](#evaluation)


# Model Details

## Model Description

<!-- Provide a longer summary of what this model is/does. -->
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/)  corpora. 

**NOT** apt for production nor commercial use.  
This model&#39;s performance is really poor, and it has not been evaluated.

This model comes with its own tokenizer!

Check the `notebooks` folder for the preprocessing and training scripts.

- **Developed by:** Luis Antonio VASQUEZ
- **Model type:** Language model
- **Language(s) (NLP):** la
- **License:** mit



# Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

## Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->

This model can be used directly for Masked Language Modelling.


## Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
 
This model could be used as a base model for other NLP tasks, for example, Text Classification (that is, using transformers&#39; `BertForSequenceClassification`)






# Training Details

## Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The training data comes from the corpora freely available from the [Classical Language Toolkit](http://cltk.org/)

- [The Latin Library](https://www.thelatinlibrary.com/)
- Latin section of the [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/)
-  Latin section of the [Tesserae Project](https://tesserae.caset.buffalo.edu/)
- [Corpus Grammaticorum Latinorum](https://cgl.hypotheses.org/)




## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

### Preprocessing

For preprocessing, the raw text from each of the corpora was extracted by parsing.  Then, it was **lowercased** and written onto `txt` files. Ideally, in these files one line would correspond to one sentence.

Other data from the corpora, like Entity Tags, POS Tags, etc., were discarded.

Training hyperparameters:
- epochs: 1
- Batch size: 64
- Attention heads: 12
- Hidden Layers: 12
- Max input size: 512 tokens

### Speeds, Sizes, Times

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

After having the dataset ready, training this model on a 16 GB Nvidia Graphics card took around 10 hours.
 
# Evaluation

No evaluation was performed on this dataset.