File size: 6,360 Bytes
ae9f583 04a4be4 ae9f583 750c6d3 ae9f583 750c6d3 ae9f583 44d5be3 ae9f583 44d5be3 ae9f583 44d5be3 ae9f583 44d5be3 ae9f583 44d5be3 ae9f583 44d5be3 ae9f583 44d5be3 ae9f583 750c6d3 ae9f583 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: cc-by-nc-4.0
base_model: google/gemma-2b
model-index:
- name: Octopus-V2-2B
results: []
tags:
- function calling
- on-device language model
- android
inference: false
space: false
spaces: false
language:
- en
---
# Quantized Octopus V2: On-device language model for super agent
This repo includes two types of quantized models: **GGUF** and **AWQ**, for our Octopus V2 model at [NexaAIDev/Octopus-v2](https://huggingface.co/NexaAIDev/Octopus-v2)
<p align="center" width="100%">
<a><img src="Octopus-logo.jpeg" alt="nexa-octopus" style="width: 40%; min-width: 300px; display: block; margin: auto;"></a>
</p>
# GGUF Qauntization
Run with [Ollama](https://github.com/ollama/ollama)
```bash
ollama run NexaAIDev/octopus-v2-Q4_K_M
```
Input example:
```dash
"Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera \n\nResponse:"
```
Output function example:
```json
def get_trending_news(category=None, region='US', language='en', max_results=5):
"""
Fetches trending news articles based on category, region, and language.
Parameters:
- category (str, optional): News category to filter by, by default use None for all categories. Optional to provide.
- region (str, optional): ISO 3166-1 alpha-2 country code for region-specific news, by default, uses 'US'. Optional to provide.
- language (str, optional): ISO 639-1 language code for article language, by default uses 'en'. Optional to provide.
- max_results (int, optional): Maximum number of articles to return, by default, uses 5. Optional to provide.
Returns:
- list[str]: A list of strings, each representing an article. Each string contains the article's heading and URL.
"""
```
## AWQ Quantization
Input Python example:
```python
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
import torch
import time
import numpy as np
def inference(input_text):
tokens = tokenizer(
input_text,
return_tensors='pt'
).input_ids.cuda()
start_time = time.time()
generation_output = model.generate(
tokens,
do_sample=False,
temperature=0,
max_new_tokens=512
)
end_time = time.time()
generated_sequence = generation_output[:, input_length:].tolist()
res = tokenizer.decode(generated_sequence[0])
latency = end_time - start_time
num_output_tokens = len(generated_sequence[0])
throughput = num_output_tokens / latency
return {"output": res, "latency": latency, "throughput": throughput}
model_id = "NexaAIDev/Octopus-v2-gguf-awq"
tokenizer = AutoTokenizer.from_pretrained(model_id,
trust_remote_code=False)
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
trust_remote_code=False, safetensors=True)
prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
avg_throughput = []
for prompt in prompts:
out = inference(prompt)
avg_throughput.append(out["throughput"])
print("nexa model result:\n", out["output"])
print("avg throughput:", np.mean(avg_throughput))
```
# Quantized GGUF & AWQ Models Benchmark
| Name | Quant method | Bits | Size | Response (t/s) | Use Cases |
| ---------------------- | ------------ | ---- | -------- | -------------- | ----------------------------------- |
| Octopus-v2-AWQ | AWQ | 4 | 3.00 GB | 63.83 | fast, high quality, recommended |
| Octopus-v2-Q2_K.gguf | Q2_K | 2 | 1.16 GB | 57.81 | fast but high loss, not recommended |
| Octopus-v2-Q3_K.gguf | Q3_K | 3 | 1.38 GB | 57.81 | extremely not recommended |
| Octopus-v2-Q3_K_S.gguf | Q3_K_S | 3 | 1.19 GB | 52.13 | extremely not recommended |
| Octopus-v2-Q3_K_M.gguf | Q3_K_M | 3 | 1.38 GB | 58.67 | moderate loss, not very recommended |
| Octopus-v2-Q3_K_L.gguf | Q3_K_L | 3 | 1.47 GB | 56.92 | not very recommended |
| Octopus-v2-Q4_0.gguf | Q4_0 | 4 | 1.55 GB | 68.80 | moderate speed, recommended |
| Octopus-v2-Q4_1.gguf | Q4_1 | 4 | 1.68 GB | 68.09 | moderate speed, recommended |
| Octopus-v2-Q4_K.gguf | Q4_K | 4 | 1.63 GB | 64.70 | moderate speed, recommended |
| Octopus-v2-Q4_K_S.gguf | Q4_K_S | 4 | 1.56 GB | 62.16 | fast and accurate, very recommended |
| Octopus-v2-Q4_K_M.gguf | Q4_K_M | 4 | 1.63 GB | 64.74 | fast, recommended |
| Octopus-v2-Q5_0.gguf | Q5_0 | 5 | 1.80 GB | 64.80 | fast, recommended |
| Octopus-v2-Q5_1.gguf | Q5_1 | 5 | 1.92 GB | 63.42 | very big, prefer Q4 |
| Octopus-v2-Q5_K.gguf | Q5_K | 5 | 1.84 GB | 61.28 | big, recommended |
| Octopus-v2-Q5_K_S.gguf | Q5_K_S | 5 | 1.80 GB | 62.16 | big, recommended |
| Octopus-v2-Q5_K_M.gguf | Q5_K_M | 5 | 1.71 GB | 61.54 | big, recommended |
| Octopus-v2-Q6_K.gguf | Q6_K | 6 | 2.06 GB | 55.94 | very big, not very recommended |
| Octopus-v2-Q8_0.gguf | Q8_0 | 8 | 2.67 GB | 56.35 | very big, not very recommended |
| Octopus-v2-f16.gguf | f16 | 16 | 5.02 GB | 36.27 | extremely big |
| Octopus-v2.gguf | | | 10.00 GB | | |
_Quantized with llama.cpp_
**Acknowledgement**:
We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.
|