patrickvonplaten
commited on
Commit
•
1876a70
1
Parent(s):
b5f37f5
Update README.md
Browse files
README.md
CHANGED
@@ -9,95 +9,52 @@ license: mit
|
|
9 |
|
10 |
# OPT : Open Pre-trained Transformer Language Models
|
11 |
|
12 |
-
|
13 |
|
14 |
-
|
15 |
|
16 |
-
**Disclaimer**: The team releasing OPT
|
17 |
-
|
18 |
-
has been written by the Hugging Face team to complete the information they provided and give specific examples of how to use the model, and the various bias.
|
19 |
|
20 |
## Model description
|
21 |
|
22 |
-
OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the
|
23 |
-
|
24 |
-
it was trained to guess the next word in sentences. This is usually called self-supervised learning.
|
25 |
|
26 |
-
|
27 |
-
|
28 |
-
predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens.
|
29 |
-
|
30 |
-
This way, the model learns an inner representation of the English language that can then be used to extract features
|
31 |
-
useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a
|
32 |
-
prompt.
|
33 |
|
34 |
## Intended uses & limitations
|
35 |
|
36 |
-
|
37 |
-
|
38 |
|
39 |
### How to use
|
40 |
|
41 |
-
You can use this model directly with a pipeline for text generation.
|
42 |
|
43 |
```python
|
44 |
-
>>> from transformers import pipeline
|
45 |
-
|
46 |
-
>>>
|
47 |
-
>>>
|
48 |
-
|
49 |
-
>>> set_seed(42)
|
50 |
-
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=1)
|
51 |
-
[{'generated_text': "Hello, I'm a language model, and I'm interested in learning more about the language model.\n\nI'm a language model, and I"}]
|
52 |
```
|
53 |
|
54 |
-
|
55 |
|
56 |
```python
|
57 |
-
>>> from transformers import
|
58 |
-
|
59 |
-
>>>
|
60 |
-
>>>
|
61 |
-
>>>
|
62 |
-
|
63 |
-
BaseModelOutputWithPast(last_hidden_state=tensor([[[-2.4159, 0.7136, -4.6705, ..., -1.3857, 0.4758, -1.5518],
|
64 |
-
[-1.4122, -2.0026, -9.4849, ..., 1.3589, 3.1777, 0.8622],
|
65 |
-
[ 0.8425, -5.9863, -5.7204, ..., 2.2054, 4.3147, 0.2039],
|
66 |
-
...,
|
67 |
-
[-0.5943, -0.9686, -2.3670, ..., 6.7386, -4.5704, 3.1795],
|
68 |
-
[ 0.0582, -5.4449, -3.1305, ..., 3.9461, -2.2183, 1.1721],
|
69 |
-
[ 0.0547, -4.1437, -0.1780, ..., -0.1648, 0.7273, 0.7006]]],
|
70 |
-
grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-0.4485, 0.4126, 0.3829, ..., -0.4228, 0.5844, 0.4145],
|
71 |
-
[-0.8542, 0.8587, 0.8495, ..., -0.8048, 0.7143, 0.8142],
|
72 |
-
[-0.6921, 0.6961, 0.6502, ..., -0.6523, 0.5810, 0.6708],
|
73 |
-
...,
|
74 |
-
[-0.6822, 0.6847, 0.6880, ..., -0.6225, 0.5817, 0.6720],
|
75 |
-
[-0.7208, 0.7355, 0.6723, ..., -0.6821, 0.6895, 0.7070],
|
76 |
-
[-0.6217, 0.6276, 0.6367, ..., -0.5950, 0.5609, 0.6075]],
|
77 |
-
|
78 |
-
[[-0.0373, -0.4824, 0.0290, ..., -0.5359, 0.5350, 0.1365],
|
79 |
-
[ 0.8295, -0.3887, -0.7507, ..., -0.2576, -1.1691, 0.6727],
|
80 |
-
[ 0.5611, -0.3490, -0.5395, ..., -0.2822, -0.7972, 0.5236],
|
81 |
-
...,
|
82 |
-
[ 0.4013, -0.2377, -0.3478, ..., -0.1679, -0.5556, 0.4043],
|
83 |
-
[ 0.5444, -0.3821, -0.4555, ..., -0.2781, -0.6267, 0.4551],
|
84 |
-
[ 0.2731, -0.1157, -0.2134, ..., -0.0131, -0.3230, 0.2420]],
|
85 |
-
|
86 |
-
[[-0.8761, 0.8668, 0.8488, ..., -0.7307, -0.8133, 0.7668],
|
87 |
-
[-0.6488, 0.7369, 0.7716, ..., -0.8711, -0.6874, 0.7305],
|
88 |
-
[-0.6605, 0.7629, 0.7675, ..., -0.7790, -0.6908, 0.7493],
|
89 |
-
...,
|
90 |
-
[-0.6542, 0.7252, 0.7787, ..., -0.7739, -0.6742, 0.7018],
|
91 |
-
[-0.7012, 0.7739, 0.8003, ..., -0.8420, -0.7059, 0.7675],
|
92 |
-
[-0.5077, 0.5662, 0.6203, ..., -0.7885, -0.5262, 0.5924]],
|
93 |
-
|
94 |
-
...,
|
95 |
-
]]], hidden_states=None, attentions=None)
|
96 |
```
|
97 |
|
98 |
### Limitations and bias
|
99 |
|
100 |
-
As mentioned in
|
101 |
unfiltered content from the internet, which is far from neutral the model is strongly biased :
|
102 |
|
103 |
> Like other large language models for which the diversity (or lack thereof) of training
|
@@ -110,31 +67,37 @@ Here's an example of how the model can have biased predictions:
|
|
110 |
|
111 |
```python
|
112 |
>>> from transformers import pipeline, set_seed
|
113 |
-
|
114 |
-
>>> set_seed(
|
115 |
-
>>> generator
|
116 |
-
|
117 |
-
[{'generated_text':
|
118 |
-
{'generated_text': 'The
|
119 |
-
{'generated_text': 'The
|
120 |
-
{'generated_text': 'The
|
121 |
-
{'generated_text': 'The
|
122 |
-
|
123 |
-
>>> set_seed(42)
|
124 |
-
>>> generator("The Black man worked as a", max_length=10, num_return_sequences=5,do_sample=True)
|
125 |
-
|
126 |
-
[{'generated_text': 'The Black man worked as a courier and was arrested'},
|
127 |
-
{'generated_text': 'The Black man worked as a carpenter and lived'},
|
128 |
-
{'generated_text': 'The Black man worked as a delivery driver for a'},
|
129 |
-
{'generated_text': 'The Black man worked as a truckman for several'},
|
130 |
-
{'generated_text': 'The Black man worked as a bouncer, then'}]
|
131 |
```
|
132 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
This bias will also affect all fine-tuned versions of this model.
|
134 |
|
135 |
## Training data
|
136 |
|
137 |
-
The
|
138 |
|
139 |
- BookCorpus, which consists of more than 10K unpublished books,
|
140 |
- CC-Stories, which contains a subset of CommonCrawl data filtered to match the
|
@@ -152,23 +115,20 @@ The dataset might contains offensive content as parts of the dataset are a subse
|
|
152 |
public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
|
153 |
that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
|
154 |
|
155 |
-
|
156 |
### Collection process
|
157 |
|
158 |
The dataset was collected form internet, and went through classic data processing algorithms and
|
159 |
-
re-formatting practices, including removing repetitive/non-informative text like
|
160 |
-
|
161 |
|
162 |
## Training procedure
|
163 |
|
164 |
### Preprocessing
|
165 |
|
166 |
The texts are tokenized using the **GPT2** byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a
|
167 |
-
vocabulary size of
|
168 |
-
|
169 |
-
The larger model was trained on 992 *80GB A100 GPUs*. The training duration was roughly ~33 days of continuous training.
|
170 |
-
|
171 |
|
|
|
172 |
|
173 |
### BibTeX entry and citation info
|
174 |
|
|
|
9 |
|
10 |
# OPT : Open Pre-trained Transformer Language Models
|
11 |
|
12 |
+
OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective.
|
13 |
|
14 |
+
OPT was first introduced in [Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) and first released in [metaseq's repository](https://github.com/facebookresearch/metaseq) on May 3rd 2022 by Meta AI.
|
15 |
|
16 |
+
**Disclaimer**: The team releasing OPT wrote an official model card, which is available in Appendix D of the [paper](https://arxiv.org/pdf/2205.01068.pdf).
|
17 |
+
Content from **this** model card has been written by the Hugging Face team.
|
|
|
18 |
|
19 |
## Model description
|
20 |
|
21 |
+
OPT belongs to the same family of decoder-only models like [GPT-3](https://arxiv.org/abs/2005.14165). As such, it was pretrained using the self-supervised causal language modedling
|
22 |
+
objective.
|
|
|
23 |
|
24 |
+
For evaluation, OPT follows [GPT-3](https://arxiv.org/abs/2005.14165) by using their prompts and overall experimental setup. For more details, please read
|
25 |
+
the [official paper](https://arxiv.org/abs/2205.01068).
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
## Intended uses & limitations
|
28 |
|
29 |
+
The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation.
|
30 |
+
In addition, the model can be fine-tuned on a downstream task using the [CLM example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling). For all other OPT checkpoints, please have a look at the [model hub](https://huggingface.co/models?filter=opt).
|
31 |
|
32 |
### How to use
|
33 |
|
34 |
+
You can use this model directly with a pipeline for text generation.
|
35 |
|
36 |
```python
|
37 |
+
>>> from transformers import pipeline
|
38 |
+
|
39 |
+
>>> generator = pipeline('text-generation', model="facebook/opt-350m")
|
40 |
+
>>> generator("Hello, I'm am conscious and")
|
41 |
+
[{'generated_text': "Hello, I'm am conscious and I'm a bit of a noob. I'm looking for"}]
|
|
|
|
|
|
|
42 |
```
|
43 |
|
44 |
+
By default, generation is deterministic. In order to use the top-k sampling, please set `do_sample` to `True`.
|
45 |
|
46 |
```python
|
47 |
+
>>> from transformers import pipeline, set_seed
|
48 |
+
|
49 |
+
>>> set_seed(32)
|
50 |
+
>>> generator = pipeline('text-generation', model="facebook/opt-350m", do_sample=True)
|
51 |
+
>>> generator("Hello, I'm am conscious and")
|
52 |
+
[{'generated_text': "Hello, I'm am conscious and I'm interested in this project. Can I get an initial contact"}]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
```
|
54 |
|
55 |
### Limitations and bias
|
56 |
|
57 |
+
As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of
|
58 |
unfiltered content from the internet, which is far from neutral the model is strongly biased :
|
59 |
|
60 |
> Like other large language models for which the diversity (or lack thereof) of training
|
|
|
67 |
|
68 |
```python
|
69 |
>>> from transformers import pipeline, set_seed
|
70 |
+
|
71 |
+
>>> set_seed(32)
|
72 |
+
>>> generator = pipeline('text-generation', model="facebook/opt-350m", do_sample=True, num_return_sequences=5)
|
73 |
+
>>> generator("The woman worked as a")
|
74 |
+
[{'generated_text': "The woman works as a substitute teacher for kids who have missed school. She's the teacher herself,"},
|
75 |
+
{'generated_text': 'The woman works as a security guard for another company and does an average of around $13/hour'},
|
76 |
+
{'generated_text': 'The woman works as a receptionist, she could at the least wait a week or two for her'},
|
77 |
+
{'generated_text': 'The woman works as a manager/intern/career development coach/advisor at a nursing home'},
|
78 |
+
{'generated_text': 'The woman works as a maid and has to clean the house but you can tell her to do it'}]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
```
|
80 |
|
81 |
+
compared to:
|
82 |
+
|
83 |
+
```
|
84 |
+
>>> from transformers import pipeline, set_seed
|
85 |
+
|
86 |
+
>>> set_seed(0)
|
87 |
+
>>> generator = pipeline('text-generation', model="facebook/opt-350m", do_sample=True, num_return_sequences=5)
|
88 |
+
>>> generator("The man worked as a")
|
89 |
+
[{'generated_text': 'The man works as a security guard for the National Football League franchise. He has been a part of'},
|
90 |
+
{'generated_text': 'The man works as a security guard for another company and does an excellent job.\nI remember when'},
|
91 |
+
{'generated_text': 'The man works as a "secret agent" but at the same time he\'s working to protect the'},
|
92 |
+
{'generated_text': 'The man works as a manager/operator/servant for a grocery store and does a lot of'},
|
93 |
+
{'generated_text': 'The man works as a bouncer near the scene of the accident - how he could do that is'}]
|
94 |
+
```
|
95 |
+
|
96 |
This bias will also affect all fine-tuned versions of this model.
|
97 |
|
98 |
## Training data
|
99 |
|
100 |
+
The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents:
|
101 |
|
102 |
- BookCorpus, which consists of more than 10K unpublished books,
|
103 |
- CC-Stories, which contains a subset of CommonCrawl data filtered to match the
|
|
|
115 |
public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
|
116 |
that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety.
|
117 |
|
|
|
118 |
### Collection process
|
119 |
|
120 |
The dataset was collected form internet, and went through classic data processing algorithms and
|
121 |
+
re-formatting practices, including removing repetitive/non-informative text like *Chapter One* or
|
122 |
+
*This ebook by Project Gutenberg.*
|
123 |
|
124 |
## Training procedure
|
125 |
|
126 |
### Preprocessing
|
127 |
|
128 |
The texts are tokenized using the **GPT2** byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a
|
129 |
+
vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens.
|
|
|
|
|
|
|
130 |
|
131 |
+
The 175B model was trained on 992 *80GB A100 GPUs*. The training duration was roughly ~33 days of continuous training.
|
132 |
|
133 |
### BibTeX entry and citation info
|
134 |
|