File size: 6,056 Bytes
ae9f583 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: cc-by-nc-4.0
base_model: google/gemma-2b
model-index:
- name: Octopus-V2-2B
results: []
tags:
- function calling
- on-device language model
- android
inference: false
space: false
spaces: false
language:
- en
---
# Quantized Octopus V2: On-device language model for super agent
This repo includes two types of quantized models: **GGUF** and **AWQ**, for ourOctopus V2 model at [NexaAIDev/Octopus-v2](https://huggingface.co/NexaAIDev/Octopus-v2)
# GGUF Qauntization
## Run with [Ollama](https://github.com/ollama/ollama)
```bash
ollama run NexaAIDev/octopus-v2-Q4_K_M
```
Input example:
```json
def get_trending_news(category=None, region='US', language='en', max_results=5):
"""
Fetches trending news articles based on category, region, and language.
Parameters:
- category (str, optional): News category to filter by, by default use None for all categories. Optional to provide.
- region (str, optional): ISO 3166-1 alpha-2 country code for region-specific news, by default, uses 'US'. Optional to provide.
- language (str, optional): ISO 639-1 language code for article language, by default uses 'en'. Optional to provide.
- max_results (int, optional): Maximum number of articles to return, by default, uses 5. Optional to provide.
Returns:
- list[str]: A list of strings, each representing an article. Each string contains the article's heading and URL.
"""
```
# AWQ Quantization
Input Python example:
```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AutoTokenizer, GemmaForCausalLM
import torch
import time
import numpy as np
def inference(input_text):
tokens = tokenizer(
input_text,
return_tensors='pt'
).input_ids.cuda()
start_time = time.time()
generation_output = model.generate(
tokens,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
max_new_tokens=512
)
end_time = time.time()
res = tokenizer.decode(generation_output[0])
res = res.split(input_text)
latency = end_time - start_time
output_tokens = tokenizer.encode(res)
num_output_tokens = len(output_tokens)
throughput = num_output_tokens / latency
return {"output": res[-1], "latency": latency, "throughput": throughput}
model_id = "path/to/Octopus-v2-AWQ"
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
avg_throughput = []
for prompt in prompts:
out = inference(prompt)
avg_throughput.append(out["throughput"])
print("nexa model result:\n", out["output"])
print("avg throughput:", np.mean(avg_throughput))
```
## Quantized GGUF & AWQ Models
| Name | Quant method | Bits | Size | Response (t/s) | Use Cases |
| ---------------------- | ------------ | ---- | -------- | -------------- | ----------------------------------- |
| Octopus-v2-AWQ | AWQ | 4 | 3.00 GB | 63.83 | fast, high quality, recommended |
| Octopus-v2-Q2_K.gguf | Q2_K | 2 | 1.16 GB | 57.81 | fast but high loss, not recommended |
| Octopus-v2-Q3_K.gguf | Q3_K | 3 | 1.38 GB | 57.81 | extremely not recommended |
| Octopus-v2-Q3_K_S.gguf | Q3_K_S | 3 | 1.19 GB | 52.13 | extremely not recommended |
| Octopus-v2-Q3_K_M.gguf | Q3_K_M | 3 | 1.38 GB | 58.67 | moderate loss, not very recommended |
| Octopus-v2-Q3_K_L.gguf | Q3_K_L | 3 | 1.47 GB | 56.92 | not very recommended |
| Octopus-v2-Q4_0.gguf | Q4_0 | 4 | 1.55 GB | 68.80 | moderate speed, recommended |
| Octopus-v2-Q4_1.gguf | Q4_1 | 4 | 1.68 GB | 68.09 | moderate speed, recommended |
| Octopus-v2-Q4_K.gguf | Q4_K | 4 | 1.63 GB | 64.70 | moderate speed, recommended |
| Octopus-v2-Q4_K_S.gguf | Q4_K_S | 4 | 1.56 GB | 62.16 | fast and accurate, very recommended |
| Octopus-v2-Q4_K_M.gguf | Q4_K_M | 4 | 1.63 GB | 64.74 | fast, recommended |
| Octopus-v2-Q5_0.gguf | Q5_0 | 5 | 1.80 GB | 64.80 | fast, recommended |
| Octopus-v2-Q5_1.gguf | Q5_1 | 5 | 1.92 GB | 63.42 | very big, prefer Q4 |
| Octopus-v2-Q5_K.gguf | Q5_K | 5 | 1.84 GB | 61.28 | big, recommended |
| Octopus-v2-Q5_K_S.gguf | Q5_K_S | 5 | 1.80 GB | 62.16 | big, recommended |
| Octopus-v2-Q5_K_M.gguf | Q5_K_M | 5 | 1.71 GB | 61.54 | big, recommended |
| Octopus-v2-Q6_K.gguf | Q6_K | 6 | 2.06 GB | 55.94 | very big, not very recommended |
| Octopus-v2-Q8_0.gguf | Q8_0 | 8 | 2.67 GB | 56.35 | very big, not very recommended |
| Octopus-v2-f16.gguf | f16 | 16 | 5.02 GB | 36.27 | extremely big |
| Octopus-v2.gguf | | | 10.00 GB | | |
_Quantized with llama.cpp_
**Acknowledgement**:
We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.
|