File size: 5,269 Bytes
653ee2d
 
 
 
 
 
 
 
 
3ee056e
 
653ee2d
3ee056e
 
 
 
 
 
 
 
 
653ee2d
 
266d481
 
 
 
 
 
 
 
 
 
 
 
 
7efc9b6
266d481
7efc9b6
 
 
 
 
266d481
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
license: mit
language:
- la
pipeline_tag: fill-mask
tags:
- latin
- masked language modelling
widget:
- text: "Gallia est omnis divisa in [MASK] tres ."
  example_title: "Commentary on Gallic Wars"
- text: "[MASK] sum Caesar ."
  example_title: "Who is Caesar?"
- text: "[MASK] it ad forum ."
  example_title: "Who is going to the forum?"
- text: "Ovidius paratus est ad [MASK] ."
  example_title: "What is Ovidius up to?"
- text: "[MASK], veni!"
  example_title: "Calling someone to come closer"
- text: "Roma in Italia [MASK] ."
  example_title: "Ubi est Roma?"
---




# Model Card for Simple Latin BERT 

<!-- Provide a quick summary of what the model is/does. [Optional] -->
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/)  corpora. 

**NOT** apt for production nor commercial use.  
This model&#39;s performance is really poor, and it has not been evaluated.

This model comes with its own tokenizer! It will automatically use **lowercase**.

Check the `training notebooks` folder for the preprocessing and training scripts.

Inspired by
- [This repo](https://github.com/dbamman/latin-bert), which has a BERT model for latin that is actually useful!
- [This tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples)
- [This tutorial](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb#scrollTo=VNZZs-r6iKAV)
- [This tutorial](https://huggingface.co/blog/how-to-train)

#  Table of Contents

- [Model Card for Simple Latin BERT ](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Table of Contents](#table-of-contents-1)
- [Model Details](#model-details)
  - [Model Description](#model-description)
- [Uses](#uses)
  - [Direct Use](#direct-use)
  - [Downstream Use [Optional]](#downstream-use-optional)
- [Training Details](#training-details)
  - [Training Data](#training-data)
  - [Training Procedure](#training-procedure)
    - [Preprocessing](#preprocessing)
    - [Speeds, Sizes, Times](#speeds-sizes-times)
- [Evaluation](#evaluation)


# Model Details

## Model Description

<!-- Provide a longer summary of what this model is/does. -->
A simple BERT Masked Language Model for Latin for my portfolio, trained on Latin Corpora from the [Classical Language Toolkit](http://cltk.org/)  corpora. 

**NOT** apt for production nor commercial use.  
This model&#39;s performance is really poor, and it has not been evaluated.

This model comes with its own tokenizer!

Check the `notebooks` folder for the preprocessing and training scripts.

- **Developed by:** Luis Antonio VASQUEZ
- **Model type:** Language model
- **Language(s) (NLP):** la
- **License:** mit



# Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

## Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->

This model can be used directly for Masked Language Modelling.


## Downstream Use

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
<!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
 
This model could be used as a base model for other NLP tasks, for example, Text Classification (that is, using transformers&#39; `BertForSequenceClassification`)






# Training Details

## Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The training data comes from the corpora freely available from the [Classical Language Toolkit](http://cltk.org/)

- [The Latin Library](https://www.thelatinlibrary.com/)
- Latin section of the [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/)
-  Latin section of the [Tesserae Project](https://tesserae.caset.buffalo.edu/)
- [Corpus Grammaticorum Latinorum](https://cgl.hypotheses.org/)




## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

### Preprocessing

For preprocessing, the raw text from each of the corpora was extracted by parsing.  Then, it was **lowercased** and written onto `txt` files. Ideally, in these files one line would correspond to one sentence.

Other data from the corpora, like Entity Tags, POS Tags, etc., were discarded.

Training hyperparameters:
- epochs: 1
- Batch size: 64
- Attention heads: 12
- Hidden Layers: 12
- Max input size: 512 tokens

### Speeds, Sizes, Times

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

After having the dataset ready, training this model on a 16 GB Nvidia Graphics card took around 10 hours.
 
# Evaluation

No evaluation was performed on this dataset.