mwitiderrick
commited on
Commit
·
a7d46ed
1
Parent(s):
bae0a3d
Update README.md
Browse files
README.md
CHANGED
@@ -15,8 +15,12 @@ This repo contains model files for [TinyLlama 1.1B Chat](https://huggingface.co/
|
|
15 |
|
16 |
This model was quantized and pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml).
|
17 |
|
18 |
-
##
|
19 |
-
|
|
|
|
|
|
|
|
|
20 |
```python
|
21 |
from deepsparse import TextGeneration
|
22 |
|
@@ -97,7 +101,8 @@ There are many factors to consider when choosing a university. Here are some tip
|
|
97 |
<|im_start|>assistant\n
|
98 |
|
99 |
```
|
100 |
-
##
|
|
|
101 |
|
102 |
```bash
|
103 |
git clone https://github.com/neuralmagic/sparseml
|
@@ -106,10 +111,20 @@ wget https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v0.4-pruned50-quant/r
|
|
106 |
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py TinyLlama/TinyLlama-1.1B-Chat-v0.4 open_platypus --recipe recipe.yaml --save True
|
107 |
python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
|
108 |
cp deployment/model.onnx deployment/model-orig.onnx
|
109 |
-
wget https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v0.4-pruned50-quant/raw/main/onnx_kv_inject.py # kv_cache file
|
110 |
-
python onnx_kv_inject.py --input-file deployment/model-orig.onnx --output-file deployment/model.onnx
|
111 |
```
|
112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
## Slack
|
114 |
|
115 |
For further support, and discussions on these models and AI in general, join us at [Neural Magic's Slack server](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)
|
|
|
15 |
|
16 |
This model was quantized and pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml).
|
17 |
|
18 |
+
## Inference
|
19 |
+
Install [DeepSparse LLM](https://github.com/neuralmagic/deepsparse) for fast inference on CPUs:
|
20 |
+
```bash
|
21 |
+
pip install deepsparse-nightly[llm]
|
22 |
+
```
|
23 |
+
Run in a [Python pipeline](https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md):
|
24 |
```python
|
25 |
from deepsparse import TextGeneration
|
26 |
|
|
|
101 |
<|im_start|>assistant\n
|
102 |
|
103 |
```
|
104 |
+
## Sparsification
|
105 |
+
For details on how this model was sparsified, see the `recipe.yaml` in this repo and follow the instructions below.
|
106 |
|
107 |
```bash
|
108 |
git clone https://github.com/neuralmagic/sparseml
|
|
|
111 |
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py TinyLlama/TinyLlama-1.1B-Chat-v0.4 open_platypus --recipe recipe.yaml --save True
|
112 |
python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
|
113 |
cp deployment/model.onnx deployment/model-orig.onnx
|
|
|
|
|
114 |
```
|
115 |
+
Run this kv-cache injection afterwards:
|
116 |
+
```python
|
117 |
+
import os
|
118 |
+
import onnx
|
119 |
+
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
|
120 |
+
input_file = "deployment/model-orig.onnx"
|
121 |
+
output_file = "deployment/model.onnx"
|
122 |
+
model = onnx.load(input_file, load_external_data=False)
|
123 |
+
model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
|
124 |
+
onnx.save(model, output_file)
|
125 |
+
print(f"Modified model saved to: {output_file}")
|
126 |
+
```
|
127 |
+
Follow the instructions on our [One Shot With SparseML][https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq] guide for step-by-step instruction on performing one-shot quantization on your own large language models.
|
128 |
## Slack
|
129 |
|
130 |
For further support, and discussions on these models and AI in general, join us at [Neural Magic's Slack server](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)
|