michaelfeil commited on
Commit
a35a608
·
1 Parent(s): 8fe89bd

Upload mosaicml/mpt-7b-instruct ctranslate fp16 weights

Browse files
Files changed (5) hide show
  1. README.md +61 -34
  2. config.json +55 -4
  3. model.bin +2 -2
  4. requirements.txt +2 -0
  5. vocabulary.json +0 -0
README.md CHANGED
@@ -16,38 +16,40 @@ Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on
16
 
17
  quantized version of [mosaicml/mpt-7b-instruct](https://huggingface.co/mosaicml/mpt-7b-instruct)
18
  ```bash
19
- pip install hf-hub-ctranslate2>=2.0.8 ctranslate2>=3.14.0
20
- ```
21
- Converted on 2023-05-31 using
22
- ```
23
- ct2-transformers-converter --model mosaicml/mpt-7b-instruct --output_dir /home/michael/tmp-ct2fast-mpt-7b-instruct --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json .gitattributes --quantization float16 --trust_remote_code
24
  ```
25
 
26
- Checkpoint compatible to [ctranslate2>=3.14.0](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2>=2.0.8](https://github.com/michaelfeil/hf-hub-ctranslate2)
27
- - `compute_type=int8_float16` for `device="cuda"`
28
- - `compute_type=int8` for `device="cpu"`
29
-
30
  ```python
31
- from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
32
- from transformers import AutoTokenizer
33
-
34
  model_name = "michaelfeil/ct2fast-mpt-7b-instruct"
35
- # use either TranslatorCT2fromHfHub or GeneratorCT2fromHfHub here, depending on model.
 
 
36
  model = GeneratorCT2fromHfHub(
37
  # load in int8 on CUDA
38
- model_name_or_path=model_name,
39
  device="cuda",
40
  compute_type="int8_float16",
41
- # tokenizer=AutoTokenizer.from_pretrained("mosaicml/mpt-7b-instruct")
42
  )
43
  outputs = model.generate(
44
- text=["How do you call a fast Flan-ingo?", "User: How are you doing? Bot:"],
45
- max_length=64,
46
  include_prompt_in_result=False
47
  )
48
  print(outputs)
49
  ```
50
 
 
 
 
 
 
 
 
 
 
 
51
  # Licence and other remarks:
52
  This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
53
 
@@ -57,7 +59,7 @@ This is just a quantized version. Licence conditions are intended to be idential
57
  # MPT-7B-Instruct
58
 
59
  MPT-7B-Instruct is a model for short-form instruction following.
60
- It is built by finetuning [MPT-7B](https://huggingface.co/spaces/mosaicml/mpt-7b) on a [dataset](https://huggingface.co/datasets/sam-mosaic/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
61
  * License: _CC-By-SA-3.0_
62
  * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
63
 
@@ -100,37 +102,41 @@ model = transformers.AutoModelForCausalLM.from_pretrained(
100
  trust_remote_code=True
101
  )
102
  ```
103
- Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
104
  This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
105
  `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
106
 
107
- To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model with `attn_impl='triton'` and move the model to `bfloat16`:
108
  ```python
109
- config = transformers.AutoConfig.from_pretrained(
110
- 'mosaicml/mpt-7b-instruct',
111
- trust_remote_code=True
112
- )
 
 
113
  config.attn_config['attn_impl'] = 'triton'
 
114
 
115
  model = transformers.AutoModelForCausalLM.from_pretrained(
116
- 'mosaicml/mpt-7b-instruct',
117
  config=config,
118
- torch_dtype=torch.bfloat16,
119
  trust_remote_code=True
120
  )
121
- model.to(device='cuda:0')
122
  ```
123
 
124
  Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
125
 
126
  ```python
127
- config = transformers.AutoConfig.from_pretrained(
128
- 'mosaicml/mpt-7b-instruct',
129
- trust_remote_code=True
130
- )
131
- config.update({"max_seq_len": 4096})
 
 
132
  model = transformers.AutoModelForCausalLM.from_pretrained(
133
- 'mosaicml/mpt-7b-instruct',
134
  config=config,
135
  trust_remote_code=True
136
  )
@@ -143,6 +149,22 @@ from transformers import AutoTokenizer
143
  tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
144
  ```
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  ### Formatting
147
 
148
  This model was trained on data formatted in the dolly-15k format:
@@ -193,6 +215,11 @@ For more details on the pretraining process, see [MPT-7B](https://huggingface.co
193
 
194
  The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
195
 
 
 
 
 
 
196
  ## Limitations and Biases
197
 
198
  _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
@@ -227,4 +254,4 @@ Please cite this model using the following format:
227
  note = {Accessed: 2023-03-28}, % change this date
228
  urldate = {2023-03-28} % change this date
229
  }
230
- ```
 
16
 
17
  quantized version of [mosaicml/mpt-7b-instruct](https://huggingface.co/mosaicml/mpt-7b-instruct)
18
  ```bash
19
+ pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.16.0
 
 
 
 
20
  ```
21
 
 
 
 
 
22
  ```python
23
+ # from transformers import AutoTokenizer
 
 
24
  model_name = "michaelfeil/ct2fast-mpt-7b-instruct"
25
+
26
+
27
+ from hf_hub_ctranslate2 import GeneratorCT2fromHfHub
28
  model = GeneratorCT2fromHfHub(
29
  # load in int8 on CUDA
30
+ model_name_or_path=model_name,
31
  device="cuda",
32
  compute_type="int8_float16",
33
+ # tokenizer=AutoTokenizer.from_pretrained("{ORG}/{NAME}")
34
  )
35
  outputs = model.generate(
36
+ text=["def fibonnaci(", "User: How are you doing? Bot:"],
37
+ max_length=64,
38
  include_prompt_in_result=False
39
  )
40
  print(outputs)
41
  ```
42
 
43
+ Checkpoint compatible to [ctranslate2>=3.16.0](https://github.com/OpenNMT/CTranslate2)
44
+ and [hf-hub-ctranslate2>=2.12.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
45
+ - `compute_type=int8_float16` for `device="cuda"`
46
+ - `compute_type=int8` for `device="cpu"`
47
+
48
+ Converted on 2023-06-27 using
49
+ ```
50
+ ct2-transformers-converter --model mosaicml/mpt-7b-instruct --output_dir ~/tmp-ct2fast-mpt-7b-instruct --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json requirements.txt .gitattributes --quantization int8_float16 --trust_remote_code
51
+ ```
52
+
53
  # Licence and other remarks:
54
  This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
55
 
 
59
  # MPT-7B-Instruct
60
 
61
  MPT-7B-Instruct is a model for short-form instruction following.
62
+ It is built by finetuning [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) on a [dataset](https://huggingface.co/datasets/sam-mosaic/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
63
  * License: _CC-By-SA-3.0_
64
  * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
65
 
 
102
  trust_remote_code=True
103
  )
104
  ```
105
+ Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
106
  This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
107
  `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
108
 
109
+ To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:
110
  ```python
111
+ import torch
112
+ import transformers
113
+
114
+ name = 'mosaicml/mpt-7b-instruct'
115
+
116
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
117
  config.attn_config['attn_impl'] = 'triton'
118
+ config.init_device = 'cuda:0' # For fast initialization directly on GPU!
119
 
120
  model = transformers.AutoModelForCausalLM.from_pretrained(
121
+ name,
122
  config=config,
123
+ torch_dtype=torch.bfloat16, # Load model weights in bfloat16
124
  trust_remote_code=True
125
  )
 
126
  ```
127
 
128
  Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
129
 
130
  ```python
131
+ import transformers
132
+
133
+ name = 'mosaicml/mpt-7b-instruct'
134
+
135
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
136
+ config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
137
+
138
  model = transformers.AutoModelForCausalLM.from_pretrained(
139
+ name,
140
  config=config,
141
  trust_remote_code=True
142
  )
 
149
  tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
150
  ```
151
 
152
+ The model can then be used, for example, within a text-generation pipeline.
153
+ Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).
154
+
155
+ ```python
156
+ from transformers import pipeline
157
+
158
+ pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
159
+
160
+ with torch.autocast('cuda', dtype=torch.bfloat16):
161
+ print(
162
+ pipe('Here is a recipe for vegan banana bread:\n',
163
+ max_new_tokens=100,
164
+ do_sample=True,
165
+ use_cache=True))
166
+ ```
167
+
168
  ### Formatting
169
 
170
  This model was trained on data formatted in the dolly-15k format:
 
215
 
216
  The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
217
 
218
+ ### Training Configuration
219
+
220
+ This model was trained on 8 A100-40GBs for about 2.3 hours using the [MosaicML Platform](https://www.mosaicml.com/platform).
221
+ The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the AdamW optimizer.
222
+
223
  ## Limitations and Biases
224
 
225
  _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
 
254
  note = {Accessed: 2023-03-28}, % change this date
255
  urldate = {2023-03-28} % change this date
256
  }
257
+ ```
config.json CHANGED
@@ -1,5 +1,56 @@
1
  {
2
- "bos_token": "<|endoftext|>",
3
- "eos_token": "<|endoftext|>",
4
- "unk_token": "<|endoftext|>"
5
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  {
2
+ "architectures": [
3
+ "MPTForCausalLM"
4
+ ],
5
+ "attn_config": {
6
+ "alibi": true,
7
+ "alibi_bias_max": 8,
8
+ "attn_impl": "torch",
9
+ "attn_pdrop": 0,
10
+ "attn_type": "multihead_attention",
11
+ "attn_uses_sequence_id": false,
12
+ "clip_qkv": null,
13
+ "prefix_lm": false,
14
+ "qk_ln": false,
15
+ "softmax_scale": null
16
+ },
17
+ "auto_map": {
18
+ "AutoConfig": "configuration_mpt.MPTConfig",
19
+ "AutoModelForCausalLM": "modeling_mpt.MPTForCausalLM"
20
+ },
21
+ "d_model": 4096,
22
+ "emb_pdrop": 0,
23
+ "embedding_fraction": 1.0,
24
+ "expansion_ratio": 4,
25
+ "init_config": {
26
+ "emb_init_std": null,
27
+ "emb_init_uniform_lim": null,
28
+ "fan_mode": "fan_in",
29
+ "init_div_is_residual": true,
30
+ "init_gain": 0,
31
+ "init_nonlinearity": "relu",
32
+ "init_std": 0.02,
33
+ "name": "kaiming_normal_",
34
+ "verbose": 0
35
+ },
36
+ "init_device": "cpu",
37
+ "learned_pos_emb": true,
38
+ "logit_scale": null,
39
+ "max_seq_len": 2048,
40
+ "model_type": "mpt",
41
+ "n_heads": 32,
42
+ "n_layers": 32,
43
+ "no_bias": true,
44
+ "norm_type": "low_precision_layernorm",
45
+ "resid_pdrop": 0,
46
+ "tokenizer_name": "EleutherAI/gpt-neox-20b",
47
+ "torch_dtype": "bfloat16",
48
+ "transformers_version": "4.28.1",
49
+ "use_cache": false,
50
+ "verbose": 0,
51
+ "vocab_size": 50432,
52
+ "bos_token": "<|endoftext|>",
53
+ "eos_token": "<|endoftext|>",
54
+ "layer_norm_epsilon": null,
55
+ "unk_token": "<|endoftext|>"
56
+ }
model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:411576f8c03aa73bc7faa2d241ef5090e16abc73583c610f592dd36798c4b198
3
- size 13298599938
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1adb227bbf42f844b27c853a902aa384a770b246c764ce45b4ac836f9cdc9884
3
+ size 6654505904
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ einops==0.5.0
2
+ triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir_sm90#subdirectory=python
vocabulary.json ADDED
The diff for this file is too large to render. See raw diff