|
--- |
|
license: mit |
|
language: |
|
- la |
|
pipeline_tag: fill-mask |
|
tags: |
|
- latin |
|
- masked language modelling |
|
widget: |
|
- text: "Gallia est omnis divisa in [MASK] tres ." |
|
example_title: "Commentary on Gallic Wars" |
|
- text: "[MASK] sum Caesar ." |
|
example_title: "Who is Caesar?" |
|
- text: "[MASK] it ad forum ." |
|
example_title: "Who is going to the forum?" |
|
- text: "Ovidius paratus est ad [MASK] ." |
|
example_title: "What is Ovidius up to?" |
|
- text: "[MASK], veni!" |
|
example_title: "Calling someone to come closer" |
|
- text: "Roma in Italia [MASK] ." |
|
example_title: "Ubi est Roma?" |
|
--- |
|
|
|
|
|
|
|
|
|
# Model Card for Simple Latin BERT |
|
|
|
<!-- Provide a quick summary of what the model is/does. [Optional] --> |
|
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/) corpora. |
|
|
|
**NOT** apt for production nor commercial use. |
|
This model's performance is really poor, and it has not been evaluated. |
|
|
|
This model comes with its own tokenizer! It will automatically use **lowercase**. |
|
|
|
Check the `training notebooks` folder for the preprocessing and training scripts. |
|
|
|
Inspired by |
|
- [This repo](https://github.com/dbamman/latin-bert), which has a BERT model for latin that is actually useful! |
|
- [This tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples) |
|
- [This tutorial](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=VNZZs-r6iKAV) |
|
- [This tutorial](https://huggingface.co/blog/how-to-train) |
|
|
|
# Table of Contents |
|
|
|
- [Model Card for Simple Latin BERT ](#model-card-for--model_id-) |
|
- [Table of Contents](#table-of-contents) |
|
- [Table of Contents](#table-of-contents-1) |
|
- [Model Details](#model-details) |
|
- [Model Description](#model-description) |
|
- [Uses](#uses) |
|
- [Direct Use](#direct-use) |
|
- [Downstream Use [Optional]](#downstream-use-optional) |
|
- [Training Details](#training-details) |
|
- [Training Data](#training-data) |
|
- [Training Procedure](#training-procedure) |
|
- [Preprocessing](#preprocessing) |
|
- [Speeds, Sizes, Times](#speeds-sizes-times) |
|
- [Evaluation](#evaluation) |
|
|
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is/does. --> |
|
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/) corpora. |
|
|
|
**NOT** apt for production nor commercial use. |
|
This model's performance is really poor, and it has not been evaluated. |
|
|
|
This model comes with its own tokenizer! |
|
|
|
Check the `notebooks` folder for the preprocessing and training scripts. |
|
|
|
- **Developed by:** Luis Antonio VASQUEZ |
|
- **Model type:** Language model |
|
- **Language(s) (NLP):** la |
|
- **License:** mit |
|
|
|
|
|
|
|
# Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
## Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." --> |
|
|
|
This model can be used directly for Masked Language Modelling. |
|
|
|
|
|
## Downstream Use |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." --> |
|
|
|
This model could be used as a base model for other NLP tasks, for example, Text Classification (that is, using transformers' `BertForSequenceClassification`) |
|
|
|
|
|
|
|
|
|
|
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
The training data comes from the corpora freely available from the [Classical Language Toolkit](http://cltk.org/) |
|
|
|
- [The Latin Library](https://www.thelatinlibrary.com/) |
|
- Latin section of the [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/) |
|
- Latin section of the [Tesserae Project](https://tesserae.caset.buffalo.edu/) |
|
- [Corpus Grammaticorum Latinorum](https://cgl.hypotheses.org/) |
|
|
|
|
|
|
|
|
|
## Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
### Preprocessing |
|
|
|
For preprocessing, the raw text from each of the corpora was extracted by parsing. Then, it was **lowercased** and written onto `txt` files. Ideally, in these files one line would correspond to one sentence. |
|
|
|
Other data from the corpora, like Entity Tags, POS Tags, etc., were discarded. |
|
|
|
Training hyperparameters: |
|
- epochs: 1 |
|
- Batch size: 64 |
|
- Attention heads: 12 |
|
- Hidden Layers: 12 |
|
- Max input size: 512 tokens |
|
|
|
### Speeds, Sizes, Times |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
After having the dataset ready, training this model on a 16 GB Nvidia Graphics card took around 10 hours. |
|
|
|
# Evaluation |
|
|
|
No evaluation was performed on this dataset. |