LuisAVasquez's picture
Update README.md
7efc9b6
|
raw
history blame
5.27 kB
---
license: mit
language:
- la
pipeline_tag: fill-mask
tags:
- latin
- masked language modelling
widget:
- text: "Gallia est omnis divisa in [MASK] tres ."
example_title: "Commentary on Gallic Wars"
- text: "[MASK] sum Caesar ."
example_title: "Who is Caesar?"
- text: "[MASK] it ad forum ."
example_title: "Who is going to the forum?"
- text: "Ovidius paratus est ad [MASK] ."
example_title: "What is Ovidius up to?"
- text: "[MASK], veni!"
example_title: "Calling someone to come closer"
- text: "Roma in Italia [MASK] ."
example_title: "Ubi est Roma?"
---
# Model Card for Simple Latin BERT
<!-- Provide a quick summary of what the model is/does. [Optional] -->
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/) corpora.
**NOT** apt for production nor commercial use.
This model&#39;s performance is really poor, and it has not been evaluated.
This model comes with its own tokenizer! It will automatically use **lowercase**.
Check the `training notebooks` folder for the preprocessing and training scripts.
Inspired by
- [This repo](https://github.com/dbamman/latin-bert), which has a BERT model for latin that is actually useful!
- [This tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples)
- [This tutorial](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=VNZZs-r6iKAV)
- [This tutorial](https://huggingface.co/blog/how-to-train)
# Table of Contents
- [Model Card for Simple Latin BERT ](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Table of Contents](#table-of-contents-1)
- [Model Details](#model-details)
- [Model Description](#model-description)
- [Uses](#uses)
- [Direct Use](#direct-use)
- [Downstream Use [Optional]](#downstream-use-optional)
- [Training Details](#training-details)
- [Training Data](#training-data)
- [Training Procedure](#training-procedure)
- [Preprocessing](#preprocessing)
- [Speeds, Sizes, Times](#speeds-sizes-times)
- [Evaluation](#evaluation)
# Model Details
## Model Description
<!-- Provide a longer summary of what this model is/does. -->
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/) corpora.
**NOT** apt for production nor commercial use.
This model&#39;s performance is really poor, and it has not been evaluated.
This model comes with its own tokenizer!
Check the `notebooks` folder for the preprocessing and training scripts.
- **Developed by:** Luis Antonio VASQUEZ
- **Model type:** Language model
- **Language(s) (NLP):** la
- **License:** mit
# Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
## Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
This model can be used directly for Masked Language Modelling.
## Downstream Use
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
This model could be used as a base model for other NLP tasks, for example, Text Classification (that is, using transformers&#39; `BertForSequenceClassification`)
# Training Details
## Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The training data comes from the corpora freely available from the [Classical Language Toolkit](http://cltk.org/)
- [The Latin Library](https://www.thelatinlibrary.com/)
- Latin section of the [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/)
- Latin section of the [Tesserae Project](https://tesserae.caset.buffalo.edu/)
- [Corpus Grammaticorum Latinorum](https://cgl.hypotheses.org/)
## Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
### Preprocessing
For preprocessing, the raw text from each of the corpora was extracted by parsing. Then, it was **lowercased** and written onto `txt` files. Ideally, in these files one line would correspond to one sentence.
Other data from the corpora, like Entity Tags, POS Tags, etc., were discarded.
Training hyperparameters:
- epochs: 1
- Batch size: 64
- Attention heads: 12
- Hidden Layers: 12
- Max input size: 512 tokens
### Speeds, Sizes, Times
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
After having the dataset ready, training this model on a 16 GB Nvidia Graphics card took around 10 hours.
# Evaluation
No evaluation was performed on this dataset.