michaelfeil
commited on
Commit
·
a35a608
1
Parent(s):
8fe89bd
Upload mosaicml/mpt-7b-instruct ctranslate fp16 weights
Browse files- README.md +61 -34
- config.json +55 -4
- model.bin +2 -2
- requirements.txt +2 -0
- vocabulary.json +0 -0
README.md
CHANGED
@@ -16,38 +16,40 @@ Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on
|
|
16 |
|
17 |
quantized version of [mosaicml/mpt-7b-instruct](https://huggingface.co/mosaicml/mpt-7b-instruct)
|
18 |
```bash
|
19 |
-
pip install hf-hub-ctranslate2>=2.0
|
20 |
-
```
|
21 |
-
Converted on 2023-05-31 using
|
22 |
-
```
|
23 |
-
ct2-transformers-converter --model mosaicml/mpt-7b-instruct --output_dir /home/michael/tmp-ct2fast-mpt-7b-instruct --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json .gitattributes --quantization float16 --trust_remote_code
|
24 |
```
|
25 |
|
26 |
-
Checkpoint compatible to [ctranslate2>=3.14.0](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2>=2.0.8](https://github.com/michaelfeil/hf-hub-ctranslate2)
|
27 |
-
- `compute_type=int8_float16` for `device="cuda"`
|
28 |
-
- `compute_type=int8` for `device="cpu"`
|
29 |
-
|
30 |
```python
|
31 |
-
from
|
32 |
-
from transformers import AutoTokenizer
|
33 |
-
|
34 |
model_name = "michaelfeil/ct2fast-mpt-7b-instruct"
|
35 |
-
|
|
|
|
|
36 |
model = GeneratorCT2fromHfHub(
|
37 |
# load in int8 on CUDA
|
38 |
-
model_name_or_path=model_name,
|
39 |
device="cuda",
|
40 |
compute_type="int8_float16",
|
41 |
-
# tokenizer=AutoTokenizer.from_pretrained("
|
42 |
)
|
43 |
outputs = model.generate(
|
44 |
-
text=["
|
45 |
-
max_length=64,
|
46 |
include_prompt_in_result=False
|
47 |
)
|
48 |
print(outputs)
|
49 |
```
|
50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
# Licence and other remarks:
|
52 |
This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
|
53 |
|
@@ -57,7 +59,7 @@ This is just a quantized version. Licence conditions are intended to be idential
|
|
57 |
# MPT-7B-Instruct
|
58 |
|
59 |
MPT-7B-Instruct is a model for short-form instruction following.
|
60 |
-
It is built by finetuning [MPT-7B](https://huggingface.co/
|
61 |
* License: _CC-By-SA-3.0_
|
62 |
* [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
|
63 |
|
@@ -100,37 +102,41 @@ model = transformers.AutoModelForCausalLM.from_pretrained(
|
|
100 |
trust_remote_code=True
|
101 |
)
|
102 |
```
|
103 |
-
Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
|
104 |
This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
|
105 |
`MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
|
106 |
|
107 |
-
To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model with `attn_impl='triton'` and
|
108 |
```python
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
|
|
|
|
113 |
config.attn_config['attn_impl'] = 'triton'
|
|
|
114 |
|
115 |
model = transformers.AutoModelForCausalLM.from_pretrained(
|
116 |
-
|
117 |
config=config,
|
118 |
-
torch_dtype=torch.bfloat16,
|
119 |
trust_remote_code=True
|
120 |
)
|
121 |
-
model.to(device='cuda:0')
|
122 |
```
|
123 |
|
124 |
Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
|
125 |
|
126 |
```python
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
config.
|
|
|
|
|
132 |
model = transformers.AutoModelForCausalLM.from_pretrained(
|
133 |
-
|
134 |
config=config,
|
135 |
trust_remote_code=True
|
136 |
)
|
@@ -143,6 +149,22 @@ from transformers import AutoTokenizer
|
|
143 |
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
|
144 |
```
|
145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
146 |
### Formatting
|
147 |
|
148 |
This model was trained on data formatted in the dolly-15k format:
|
@@ -193,6 +215,11 @@ For more details on the pretraining process, see [MPT-7B](https://huggingface.co
|
|
193 |
|
194 |
The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
|
195 |
|
|
|
|
|
|
|
|
|
|
|
196 |
## Limitations and Biases
|
197 |
|
198 |
_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
|
@@ -227,4 +254,4 @@ Please cite this model using the following format:
|
|
227 |
note = {Accessed: 2023-03-28}, % change this date
|
228 |
urldate = {2023-03-28} % change this date
|
229 |
}
|
230 |
-
```
|
|
|
16 |
|
17 |
quantized version of [mosaicml/mpt-7b-instruct](https://huggingface.co/mosaicml/mpt-7b-instruct)
|
18 |
```bash
|
19 |
+
pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.16.0
|
|
|
|
|
|
|
|
|
20 |
```
|
21 |
|
|
|
|
|
|
|
|
|
22 |
```python
|
23 |
+
# from transformers import AutoTokenizer
|
|
|
|
|
24 |
model_name = "michaelfeil/ct2fast-mpt-7b-instruct"
|
25 |
+
|
26 |
+
|
27 |
+
from hf_hub_ctranslate2 import GeneratorCT2fromHfHub
|
28 |
model = GeneratorCT2fromHfHub(
|
29 |
# load in int8 on CUDA
|
30 |
+
model_name_or_path=model_name,
|
31 |
device="cuda",
|
32 |
compute_type="int8_float16",
|
33 |
+
# tokenizer=AutoTokenizer.from_pretrained("{ORG}/{NAME}")
|
34 |
)
|
35 |
outputs = model.generate(
|
36 |
+
text=["def fibonnaci(", "User: How are you doing? Bot:"],
|
37 |
+
max_length=64,
|
38 |
include_prompt_in_result=False
|
39 |
)
|
40 |
print(outputs)
|
41 |
```
|
42 |
|
43 |
+
Checkpoint compatible to [ctranslate2>=3.16.0](https://github.com/OpenNMT/CTranslate2)
|
44 |
+
and [hf-hub-ctranslate2>=2.12.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
|
45 |
+
- `compute_type=int8_float16` for `device="cuda"`
|
46 |
+
- `compute_type=int8` for `device="cpu"`
|
47 |
+
|
48 |
+
Converted on 2023-06-27 using
|
49 |
+
```
|
50 |
+
ct2-transformers-converter --model mosaicml/mpt-7b-instruct --output_dir ~/tmp-ct2fast-mpt-7b-instruct --force --copy_files tokenizer.json README.md tokenizer_config.json generation_config.json special_tokens_map.json requirements.txt .gitattributes --quantization int8_float16 --trust_remote_code
|
51 |
+
```
|
52 |
+
|
53 |
# Licence and other remarks:
|
54 |
This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
|
55 |
|
|
|
59 |
# MPT-7B-Instruct
|
60 |
|
61 |
MPT-7B-Instruct is a model for short-form instruction following.
|
62 |
+
It is built by finetuning [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) on a [dataset](https://huggingface.co/datasets/sam-mosaic/dolly_hhrlhf) derived from the [Databricks Dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.
|
63 |
* License: _CC-By-SA-3.0_
|
64 |
* [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-7b-instruct)
|
65 |
|
|
|
102 |
trust_remote_code=True
|
103 |
)
|
104 |
```
|
105 |
+
Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
|
106 |
This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
|
107 |
`MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
|
108 |
|
109 |
+
To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:
|
110 |
```python
|
111 |
+
import torch
|
112 |
+
import transformers
|
113 |
+
|
114 |
+
name = 'mosaicml/mpt-7b-instruct'
|
115 |
+
|
116 |
+
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
|
117 |
config.attn_config['attn_impl'] = 'triton'
|
118 |
+
config.init_device = 'cuda:0' # For fast initialization directly on GPU!
|
119 |
|
120 |
model = transformers.AutoModelForCausalLM.from_pretrained(
|
121 |
+
name,
|
122 |
config=config,
|
123 |
+
torch_dtype=torch.bfloat16, # Load model weights in bfloat16
|
124 |
trust_remote_code=True
|
125 |
)
|
|
|
126 |
```
|
127 |
|
128 |
Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. For example:
|
129 |
|
130 |
```python
|
131 |
+
import transformers
|
132 |
+
|
133 |
+
name = 'mosaicml/mpt-7b-instruct'
|
134 |
+
|
135 |
+
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
|
136 |
+
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
|
137 |
+
|
138 |
model = transformers.AutoModelForCausalLM.from_pretrained(
|
139 |
+
name,
|
140 |
config=config,
|
141 |
trust_remote_code=True
|
142 |
)
|
|
|
149 |
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
|
150 |
```
|
151 |
|
152 |
+
The model can then be used, for example, within a text-generation pipeline.
|
153 |
+
Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).
|
154 |
+
|
155 |
+
```python
|
156 |
+
from transformers import pipeline
|
157 |
+
|
158 |
+
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
|
159 |
+
|
160 |
+
with torch.autocast('cuda', dtype=torch.bfloat16):
|
161 |
+
print(
|
162 |
+
pipe('Here is a recipe for vegan banana bread:\n',
|
163 |
+
max_new_tokens=100,
|
164 |
+
do_sample=True,
|
165 |
+
use_cache=True))
|
166 |
+
```
|
167 |
+
|
168 |
### Formatting
|
169 |
|
170 |
This model was trained on data formatted in the dolly-15k format:
|
|
|
215 |
|
216 |
The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
|
217 |
|
218 |
+
### Training Configuration
|
219 |
+
|
220 |
+
This model was trained on 8 A100-40GBs for about 2.3 hours using the [MosaicML Platform](https://www.mosaicml.com/platform).
|
221 |
+
The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the AdamW optimizer.
|
222 |
+
|
223 |
## Limitations and Biases
|
224 |
|
225 |
_The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
|
|
|
254 |
note = {Accessed: 2023-03-28}, % change this date
|
255 |
urldate = {2023-03-28} % change this date
|
256 |
}
|
257 |
+
```
|
config.json
CHANGED
@@ -1,5 +1,56 @@
|
|
1 |
{
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
{
|
2 |
+
"architectures": [
|
3 |
+
"MPTForCausalLM"
|
4 |
+
],
|
5 |
+
"attn_config": {
|
6 |
+
"alibi": true,
|
7 |
+
"alibi_bias_max": 8,
|
8 |
+
"attn_impl": "torch",
|
9 |
+
"attn_pdrop": 0,
|
10 |
+
"attn_type": "multihead_attention",
|
11 |
+
"attn_uses_sequence_id": false,
|
12 |
+
"clip_qkv": null,
|
13 |
+
"prefix_lm": false,
|
14 |
+
"qk_ln": false,
|
15 |
+
"softmax_scale": null
|
16 |
+
},
|
17 |
+
"auto_map": {
|
18 |
+
"AutoConfig": "configuration_mpt.MPTConfig",
|
19 |
+
"AutoModelForCausalLM": "modeling_mpt.MPTForCausalLM"
|
20 |
+
},
|
21 |
+
"d_model": 4096,
|
22 |
+
"emb_pdrop": 0,
|
23 |
+
"embedding_fraction": 1.0,
|
24 |
+
"expansion_ratio": 4,
|
25 |
+
"init_config": {
|
26 |
+
"emb_init_std": null,
|
27 |
+
"emb_init_uniform_lim": null,
|
28 |
+
"fan_mode": "fan_in",
|
29 |
+
"init_div_is_residual": true,
|
30 |
+
"init_gain": 0,
|
31 |
+
"init_nonlinearity": "relu",
|
32 |
+
"init_std": 0.02,
|
33 |
+
"name": "kaiming_normal_",
|
34 |
+
"verbose": 0
|
35 |
+
},
|
36 |
+
"init_device": "cpu",
|
37 |
+
"learned_pos_emb": true,
|
38 |
+
"logit_scale": null,
|
39 |
+
"max_seq_len": 2048,
|
40 |
+
"model_type": "mpt",
|
41 |
+
"n_heads": 32,
|
42 |
+
"n_layers": 32,
|
43 |
+
"no_bias": true,
|
44 |
+
"norm_type": "low_precision_layernorm",
|
45 |
+
"resid_pdrop": 0,
|
46 |
+
"tokenizer_name": "EleutherAI/gpt-neox-20b",
|
47 |
+
"torch_dtype": "bfloat16",
|
48 |
+
"transformers_version": "4.28.1",
|
49 |
+
"use_cache": false,
|
50 |
+
"verbose": 0,
|
51 |
+
"vocab_size": 50432,
|
52 |
+
"bos_token": "<|endoftext|>",
|
53 |
+
"eos_token": "<|endoftext|>",
|
54 |
+
"layer_norm_epsilon": null,
|
55 |
+
"unk_token": "<|endoftext|>"
|
56 |
+
}
|
model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1adb227bbf42f844b27c853a902aa384a770b246c764ce45b4ac836f9cdc9884
|
3 |
+
size 6654505904
|
requirements.txt
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
einops==0.5.0
|
2 |
+
triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir_sm90#subdirectory=python
|
vocabulary.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|