Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- sk
|
4 |
+
tags:
|
5 |
+
- Slovak GPT-J
|
6 |
+
- pytorch
|
7 |
+
- causal-lm
|
8 |
+
license: gpl-3.0
|
9 |
+
---
|
10 |
+
|
11 |
+
# Slovak GPT-J-405M
|
12 |
+
Slovak GPT-J-405M is the second model released in Slovak GPT-J series after its smaller variant [Slovak GPT-J-162M](https://huggingface.co/Milos/slovak-gpt-j-162M).
|
13 |
+
## Model Description
|
14 |
+
Model is based on [GPT-J](https://github.com/kingoflolz/mesh-transformer-jax/) and has over 405M trainable parameters.
|
15 |
+
|
16 |
+
<figure>
|
17 |
+
|
18 |
+
| Hyperparameter | Value |
|
19 |
+
|----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
|
20 |
+
| \\(n_{parameters}\\) | 405,677,136 |
|
21 |
+
| \\(n_{layers}\\) | 24 |
|
22 |
+
| \\(d_{model}\\) | 1024 |
|
23 |
+
| \\(d_{ff}\\) | 16384 |
|
24 |
+
| \\(n_{heads}\\) | 16 |
|
25 |
+
| \\(d_{head}\\) | 256 |
|
26 |
+
| \\(n_{ctx}\\) | 2048 |
|
27 |
+
| \\(n_{vocab}\\) | 50256 (same tokenizer as GPT-2/3†) |
|
28 |
+
| Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
|
29 |
+
| RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
|
30 |
+
|
31 |
+
<p><strong>†</strong> ByteLevelBPETokenizer was trained on the same Slovak corpus.</p></figure>
|
32 |
+
|
33 |
+
## Training data
|
34 |
+
|
35 |
+
Slovak GPT-J models were trained on a privately collected dataset consisting of predominantly Slovak text spanning different categories, e.g. web, news articles or even biblical texts - in total, over 40GB of text data was used to train this model.
|
36 |
+
The dataset was preprocessed and cleaned in a specific way that involves minor but a few caveats, so in order to achieve the expected performance, feel free to refer to [How to use] section. Please, keep in mind that despite the effort to remove inappropriate corpus, the model still might generate sensitive content or leak sensitive information.
|
37 |
+
|
38 |
+
## Training procedure
|
39 |
+
|
40 |
+
This model was trained for a bit more than 36.5 billion tokens over 69,001 steps on TPU v3-8 pod. The cross-entropy validation loss at the last step was 2.821.
|
41 |
+
|
42 |
+
## Intended Use
|
43 |
+
|
44 |
+
Same as the original GPT-J, Slovak GPT-J learns an inner representation of the language that can be used to extract features useful for downstream tasks, however, the intended use is text generation from a prompt.
|
45 |
+
|
46 |
+
### How to use
|
47 |
+
|
48 |
+
This model along with the tokenizer can be easily loaded using the `AutoModelForCausalLM` functionality:
|
49 |
+
|
50 |
+
```python
|
51 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
52 |
+
|
53 |
+
tokenizer = AutoTokenizer.from_pretrained("Milos/slovak-gpt-j-405M")
|
54 |
+
model = AutoModelForCausalLM.from_pretrained("Milos/slovak-gpt-j-405M")
|
55 |
+
```
|
56 |
+
|
57 |
+
When generating a prompt keep in mind these three things, and you should be good to go:
|
58 |
+
1. Never leave trailing whitespaces. There's a difference between how tokenizer encodes "Mám rád slovenčinu" (no space after `slovenčinu`) and "Mám rád slovenčinu " (trailing space after `slovenčinu`), i.e `[12805, 2872, 46878]` != `[12805, 2872, 46878, 221]`.
|
59 |
+
2. Always use good ol' US English primary double quotation marks, i.e. `""` instead of `„“`.
|
60 |
+
3. In case of a new line always enter `\n\n` instead of a single `\n`
|
61 |
+
|
62 |
+
To illustrate an example of a basic text generation:
|
63 |
+
```
|
64 |
+
>>> prompt = "Tradičné jedlo na Orave sú"
|
65 |
+
>>> encoded_input = tokenizer(prompt, return_tensors='pt')
|
66 |
+
>>> output = model.generate(**encoded_input)
|
67 |
+
>>> tokenizer.decode(output[0])
|
68 |
+
'Tradičné jedlo na Orave sú bryndzové halušky\n\nNa Orave sa v minulosti varilo viac druhov'
|
69 |
+
```
|
70 |
+
|
71 |
+
### Capabilities, Limitations, and Biases
|
72 |
+
|
73 |
+
The capability of this particular model is somewhat decent despite its small size totalling 405M parameters. With relative ease it can manage to generate interesting and grammatically correct content.
|
74 |
+
For example, you can try few of the following prompts. (For sake of simplicity, I have omitted all the boilerplate code and swapped `\n` for new lines).
|
75 |
+
|
76 |
+
Try generating "How to" articles:
|
77 |
+
```
|
78 |
+
PROMPT
|
79 |
+
Ako napísať pôsobivú esej krok za krokom
|
80 |
+
OUTPUT
|
81 |
+
Ako napísať pôsobivú esej krok za krokom
|
82 |
+
|
83 |
+
V tomto článku sa dozviete, ako napísať esej, ktorá bude mať v sebe niečo, čo zaujme a bude sa vám páčiť.
|
84 |
+
|
85 |
+
V prvom rade si musíte uvedomiť, že esej je vlastne písaný text, ktorý má byť napísaný na počítači.'
|
86 |
+
```
|
87 |
+
However, relying on the model to produce factually correct information isn't recommended.
|
88 |
+
|
89 |
+
Or try to generate chat conversations:
|
90 |
+
```
|
91 |
+
PROMPT
|
92 |
+
Peter: Ako sa mas?
|
93 |
+
|
94 |
+
Lucia: Ale celkom dobre. Co si robil?
|
95 |
+
|
96 |
+
Peter:
|
97 |
+
OUTPUT
|
98 |
+
Peter: Ako sa mas?
|
99 |
+
|
100 |
+
Lucia: Ale celkom dobre. Co si robil?
|
101 |
+
|
102 |
+
Peter: No, bol som na chate.
|
103 |
+
|
104 |
+
Lucia: A co si tam robil?
|
105 |
+
|
106 |
+
Peter: No, bol som tam s kamošmi.
|
107 |
+
|
108 |
+
Lucia: A co si tam robil?
|
109 |
+
|
110 |
+
Peter: No, bol som tam s kamošmi.
|
111 |
+
```
|
112 |
+
Apparently either Peters are more likely to act suspiciously in this particular context or there's a problem with the model. Let's assume that the second explanation may hold some merit. In general, GPT models can (and often will) get into a repeating cycle of generating same content. This is a common problem beyond the scope of this README, however, see [generate's documentation](https://huggingface.co/docs/transformers/master/en/main_classes/model#transformers.generation_utils.GenerationMixin.generate) on how to introduce a frequency/repetition penalty.
|
113 |
+
|
114 |
+
Since the dataset contains profanity, politically incorrect language, and (unintentionally) even a bits of text in Czech, the model can generate them in some extent too. Here's an example of the model output when prompt is in Czech:
|
115 |
+
```
|
116 |
+
>>> prompt = "Věta nesmí být sprostá a musí být zcela"
|
117 |
+
>>> encoded_input = tokenizer(prompt, return_tensors='pt')
|
118 |
+
>>> output = model.generate(**encoded_input, max_length=16)
|
119 |
+
>>> tokenizer.decode(output[0])
|
120 |
+
'Věta nesmí být sprostá a musí být zcela pravdivá.'
|
121 |
+
```
|
122 |
+
|
123 |
+
## Citation and Related Information
|
124 |
+
|
125 |
+
This was done as a moonlighting project during summer of 2021 to better understand transformers. I didn't have much free time to open source it properly, so it all sat on my hard drive until now :) Based on the popularity and interest in this model I might release _substantially_ larger versions of Slovak GPT-J models that are way more capable.
|
126 |
+
|
127 |
+
If you use this model or have any questions about it feel free to hit me up at [twitter](https://twitter.com/miloskondela) or check out my [github](https://github.com/kondela) profile.
|
128 |
+
|
129 |
+
### BibTeX entry
|
130 |
+
To cite this model:
|
131 |
+
```bibtex
|
132 |
+
@misc{slovak-gpt-j-405m,
|
133 |
+
author = {Kondela, Milos},
|
134 |
+
title = {{Slovak GPT-J-405M}},
|
135 |
+
howpublished = {\url{https://huggingface.co/Milos/slovak-gpt-j-405M}},
|
136 |
+
year = 2022,
|
137 |
+
month = February
|
138 |
+
}
|
139 |
+
```
|
140 |
+
|
141 |
+
To cite the codebase that trained this model:
|
142 |
+
```bibtex
|
143 |
+
@misc{mesh-transformer-jax,
|
144 |
+
author = {Wang, Ben},
|
145 |
+
title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
|
146 |
+
howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
|
147 |
+
year = 2021,
|
148 |
+
month = May
|
149 |
+
}
|
150 |
+
```
|
151 |
+
|
152 |
+
## Acknowledgements
|
153 |
+
This project was generously supported by [TPU Research Cloud (TRC) program](https://sites.research.google/trc/about/). Shoutout also goes to [Ben Wang](https://github.com/kingoflolz) and great [EleutherAI community](https://www.eleuther.ai/).
|