Milos commited on
Commit
9ae2a34
1 Parent(s): a9115cd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - sk
4
+ tags:
5
+ - Slovak GPT-J
6
+ - pytorch
7
+ - causal-lm
8
+ license: gpl-3.0
9
+ ---
10
+
11
+ # Slovak GPT-J-405M
12
+ Slovak GPT-J-405M is the second model released in Slovak GPT-J series after its smaller variant [Slovak GPT-J-162M](https://huggingface.co/Milos/slovak-gpt-j-162M).
13
+ ## Model Description
14
+ Model is based on [GPT-J](https://github.com/kingoflolz/mesh-transformer-jax/) and has over 405M trainable parameters.
15
+
16
+ <figure>
17
+
18
+ | Hyperparameter | Value |
19
+ |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
20
+ | \\(n_{parameters}\\) | 405,677,136 |
21
+ | \\(n_{layers}\\) | 24 |
22
+ | \\(d_{model}\\) | 1024 |
23
+ | \\(d_{ff}\\) | 16384 |
24
+ | \\(n_{heads}\\) | 16 |
25
+ | \\(d_{head}\\) | 256 |
26
+ | \\(n_{ctx}\\) | 2048 |
27
+ | \\(n_{vocab}\\) | 50256 (same tokenizer as GPT-2/3&dagger;) |
28
+ | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
29
+ | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
30
+
31
+ <p><strong>&dagger;</strong> ByteLevelBPETokenizer was trained on the same Slovak corpus.</p></figure>
32
+
33
+ ## Training data
34
+
35
+ Slovak GPT-J models were trained on a privately collected dataset consisting of predominantly Slovak text spanning different categories, e.g. web, news articles or even biblical texts - in total, over 40GB of text data was used to train this model.
36
+ The dataset was preprocessed and cleaned in a specific way that involves minor but a few caveats, so in order to achieve the expected performance, feel free to refer to [How to use] section. Please, keep in mind that despite the effort to remove inappropriate corpus, the model still might generate sensitive content or leak sensitive information.
37
+
38
+ ## Training procedure
39
+
40
+ This model was trained for a bit more than 36.5 billion tokens over 69,001 steps on TPU v3-8 pod. The cross-entropy validation loss at the last step was 2.821.
41
+
42
+ ## Intended Use
43
+
44
+ Same as the original GPT-J, Slovak GPT-J learns an inner representation of the language that can be used to extract features useful for downstream tasks, however, the intended use is text generation from a prompt.
45
+
46
+ ### How to use
47
+
48
+ This model along with the tokenizer can be easily loaded using the `AutoModelForCausalLM` functionality:
49
+
50
+ ```python
51
+ from transformers import AutoTokenizer, AutoModelForCausalLM
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained("Milos/slovak-gpt-j-405M")
54
+ model = AutoModelForCausalLM.from_pretrained("Milos/slovak-gpt-j-405M")
55
+ ```
56
+
57
+ When generating a prompt keep in mind these three things, and you should be good to go:
58
+ 1. Never leave trailing whitespaces. There's a difference between how tokenizer encodes "Mám rád slovenčinu" (no space after `slovenčinu`) and "Mám rád slovenčinu " (trailing space after `slovenčinu`), i.e `[12805, 2872, 46878]` != `[12805, 2872, 46878, 221]`.
59
+ 2. Always use good ol' US English primary double quotation marks, i.e. `""` instead of `„“`.
60
+ 3. In case of a new line always enter `\n\n` instead of a single `\n`
61
+
62
+ To illustrate an example of a basic text generation:
63
+ ```
64
+ >>> prompt = "Tradičné jedlo na Orave sú"
65
+ >>> encoded_input = tokenizer(prompt, return_tensors='pt')
66
+ >>> output = model.generate(**encoded_input)
67
+ >>> tokenizer.decode(output[0])
68
+ 'Tradičné jedlo na Orave sú bryndzové halušky\n\nNa Orave sa v minulosti varilo viac druhov'
69
+ ```
70
+
71
+ ### Capabilities, Limitations, and Biases
72
+
73
+ The capability of this particular model is somewhat decent despite its small size totalling 405M parameters. With relative ease it can manage to generate interesting and grammatically correct content.
74
+ For example, you can try few of the following prompts. (For sake of simplicity, I have omitted all the boilerplate code and swapped `\n` for new lines).
75
+
76
+ Try generating "How to" articles:
77
+ ```
78
+ PROMPT
79
+ Ako napísať pôsobivú esej krok za krokom
80
+ OUTPUT
81
+ Ako napísať pôsobivú esej krok za krokom
82
+
83
+ V tomto článku sa dozviete, ako napísať esej, ktorá bude mať v sebe niečo, čo zaujme a bude sa vám páčiť.
84
+
85
+ V prvom rade si musíte uvedomiť, že esej je vlastne písaný text, ktorý má byť napísaný na počítači.'
86
+ ```
87
+ However, relying on the model to produce factually correct information isn't recommended.
88
+
89
+ Or try to generate chat conversations:
90
+ ```
91
+ PROMPT
92
+ Peter: Ako sa mas?
93
+
94
+ Lucia: Ale celkom dobre. Co si robil?
95
+
96
+ Peter:
97
+ OUTPUT
98
+ Peter: Ako sa mas?
99
+
100
+ Lucia: Ale celkom dobre. Co si robil?
101
+
102
+ Peter: No, bol som na chate.
103
+
104
+ Lucia: A co si tam robil?
105
+
106
+ Peter: No, bol som tam s kamošmi.
107
+
108
+ Lucia: A co si tam robil?
109
+
110
+ Peter: No, bol som tam s kamošmi.
111
+ ```
112
+ Apparently either Peters are more likely to act suspiciously in this particular context or there's a problem with the model. Let's assume that the second explanation may hold some merit. In general, GPT models can (and often will) get into a repeating cycle of generating same content. This is a common problem beyond the scope of this README, however, see [generate's documentation](https://huggingface.co/docs/transformers/master/en/main_classes/model#transformers.generation_utils.GenerationMixin.generate) on how to introduce a frequency/repetition penalty.
113
+
114
+ Since the dataset contains profanity, politically incorrect language, and (unintentionally) even a bits of text in Czech, the model can generate them in some extent too. Here's an example of the model output when prompt is in Czech:
115
+ ```
116
+ >>> prompt = "Věta nesmí být sprostá a musí být zcela"
117
+ >>> encoded_input = tokenizer(prompt, return_tensors='pt')
118
+ >>> output = model.generate(**encoded_input, max_length=16)
119
+ >>> tokenizer.decode(output[0])
120
+ 'Věta nesmí být sprostá a musí být zcela pravdivá.'
121
+ ```
122
+
123
+ ## Citation and Related Information
124
+
125
+ This was done as a moonlighting project during summer of 2021 to better understand transformers. I didn't have much free time to open source it properly, so it all sat on my hard drive until now :) Based on the popularity and interest in this model I might release _substantially_ larger versions of Slovak GPT-J models that are way more capable.
126
+
127
+ If you use this model or have any questions about it feel free to hit me up at [twitter](https://twitter.com/miloskondela) or check out my [github](https://github.com/kondela) profile.
128
+
129
+ ### BibTeX entry
130
+ To cite this model:
131
+ ```bibtex
132
+ @misc{slovak-gpt-j-405m,
133
+ author = {Kondela, Milos},
134
+ title = {{Slovak GPT-J-405M}},
135
+ howpublished = {\url{https://huggingface.co/Milos/slovak-gpt-j-405M}},
136
+ year = 2022,
137
+ month = February
138
+ }
139
+ ```
140
+
141
+ To cite the codebase that trained this model:
142
+ ```bibtex
143
+ @misc{mesh-transformer-jax,
144
+ author = {Wang, Ben},
145
+ title = {{Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX}},
146
+ howpublished = {\url{https://github.com/kingoflolz/mesh-transformer-jax}},
147
+ year = 2021,
148
+ month = May
149
+ }
150
+ ```
151
+
152
+ ## Acknowledgements
153
+ This project was generously supported by [TPU Research Cloud (TRC) program](https://sites.research.google/trc/about/). Shoutout also goes to [Ben Wang](https://github.com/kingoflolz) and great [EleutherAI community](https://www.eleuther.ai/).