olm-gpt2-oct-2022 / README.md

Update README.md

510e07a about 2 years ago

6.04 kB

	---
	language: en
	tags:
	- exbert

	---


	# OLM GPT-2 October 2022

	This is a more up-to-date version of the [original GPT-2](https://huggingface.co/gpt2).
	In addition to being more up-to-date, it also tends to perform better than the original GPT2 on standard benchmarks.
	It was trained on a cleaned October 2022 snapshot of Common Crawl and Wikipedia.

	## Intended uses

	You can use the raw model for text generation or fine-tune it to a downstream task.

	## How to use

	You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we
	set a seed for reproducibility:

	```python
	>>> from transformers import pipeline, set_seed
	>>> # It is important to include the bad_words_ids=[[0,2]] if you want this model to stay on topic.
	>>> # Otherwise, the model may generate start and end tokens followed by text that is not relevant to
	>>> # the previous text.
	>>> generator = pipeline('text-generation', model='olm/olm-gpt2-oct-2022', bad_words_ids=[[0,2]])
	>>> set_seed(42)
	>>> # This example also illustrates that sometimes our model generates
	>>> # bloggy/spammy/webb-y things, even though it gets higher evaluation results
	>>> # than the original GPT-2 accross a variety of benchmarks. See the first output.
	>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
	Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
	[
	{'generated_text': "Hello, I'm a language model, but you can take me if I want.\nReplyDelete\nReplies\nReply\nAnonymous October 17, 2011"},
	{'generated_text': "Hello, I'm a language model, and here's some useful news for you all: The release date for the new release of"},
	{'generated_text': "Hello, I'm a language model, I'm not a developer or anybody who's working on those. I'm a freelancer... I"},
	{'generated_text': "Hello, I'm a language model, a language analyst, and a language system designer. I'm just curious about the"},
	{'generated_text': "Hello, I'm a language model, I'm passionate about languages, but I don't understand how my system works, the interaction"}
	]
	```

	Here is how to use this model to get the features of a given text in PyTorch:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	tokenizer = AutoTokenizer.from_pretrained('olm/olm-gpt2-oct-2022')
	model = AutoModelForCausalLM.from_pretrained('olm/olm-gpt2-oct-2022')
	text = "Replace me by any text you'd like."
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)
	```

	## Dataset

	The model and tokenizer were trained with this [October 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295) plus this [October 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221001).
	The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-october-2022-tokenized-1024).
	The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).

	## Training

	The model was trained according to the GPT2 instructions at this [repo](https://github.com/huggingface/olm-training).

	## Evaluation results

	The model achieves the following results without any fine-tuning (zero-shot):

	\| Task \| Metric \| Original GPT2 \| OLM GPT2 (Ours) \| Significance (two-tailed p-value) \|
	\|:------------\|:-----------\|--------------------:\|----------------------:\|----------------------------------:\|
	\|rte \|acc \|0.5307 \|0.5415 \|0.7188 \|
	\|piqa \|acc/acc_norm\|0.6289/0.6251 \|0.6638/0.6670 \|0.0020/0.0002 \|
	\|copa \|acc \|0.6400 \|0.6900 \|0.3000 \|
	\|record \|f1/em \|0.7094/0.7026\|0.6874/0.6810 \|0.0000/0.0000 \|
	\|boolq \|acc \|0.4872 \|0.5606 \|0.0000 \|
	\|cb \|acc/f1 \|0.4101/0.2619 \|0.3571/0.1754 \|0.4193/NA \|
	\|hellaswag \|acc/acc_norm\|0.2892/0.3114 \|0.3076/0.3491 \|0.0000/0.0000 \|
	\|mrpc \|acc/f1 \|0.5662/0.6911 \|0.6495/0.7741 \|0.0007/0.0002 \|
	\|multirc \|acc \|0.0189 \|0.0115 \|0.0959 \|
	\|lambada \|ppl/acc \|40.0554/0.3256 \|28.6733/0.3625 \|0.0000/0.0000 \|
	\|wsc \|acc \|0.4327 \|0.3654 \|0.1679 \|
	\|wic \|acc \|0.4922 \|0.5 \|0.6924 \|
	\|mnli \|acc \|0.3372 \|0.3471 \|0.0384 \|
	\|qnli \|acc \|0.5017 \|0.4981 \|0.5884 \|
	\|cola \|mcc \|0.0126 \|0.0181 \|0.8614 \|
	\|triviaqa \|acc \|0.0151 \|0.0182 \|0.0048 \|
	\|winogrande \|acc \|0.5162 \|0.5114 \|0.7360 \|
	\|webqs \|acc \|0.0030 \|0.0108 \|0.0000 \|
	\|arc_easy \|acc/acc_norm\|0.4381/0.3948 \|0.4651/0.4247 \|0.0082/0.0029 \|
	\|arc_challenge\|acc/acc_norm\|0.1903/0.2270 \|0.1997/0.2329 \|0.4132/0.6256 \|

	To get these results, we used the Eleuther AI evaluation harness [here](https://github.com/EleutherAI/lm-evaluation-harness)
	The harness can produce results a little different than those reported in the GPT2 paper.
	The p-values come from the stderr from the evaluation harness, plus a normal distribution assumption.