Octopus-v2-gguf-awq / README.md
zackli4ai's picture
Update README.md
ae9f583 verified
|
raw
history blame
6.06 kB
metadata
license: cc-by-nc-4.0
base_model: google/gemma-2b
model-index:
  - name: Octopus-V2-2B
    results: []
tags:
  - function calling
  - on-device language model
  - android
inference: false
space: false
spaces: false
language:
  - en

Quantized Octopus V2: On-device language model for super agent

This repo includes two types of quantized models: GGUF and AWQ, for ourOctopus V2 model at NexaAIDev/Octopus-v2

GGUF Qauntization

Run with Ollama

ollama run NexaAIDev/octopus-v2-Q4_K_M

Input example:

def get_trending_news(category=None, region='US', language='en', max_results=5):
    """
    Fetches trending news articles based on category, region, and language.

    Parameters:
    - category (str, optional): News category to filter by, by default use None for all categories. Optional to provide.
    - region (str, optional): ISO 3166-1 alpha-2 country code for region-specific news, by default, uses 'US'. Optional to provide.
    - language (str, optional): ISO 639-1 language code for article language, by default uses 'en'. Optional to provide.
    - max_results (int, optional): Maximum number of articles to return, by default, uses 5. Optional to provide.

    Returns:
    - list[str]: A list of strings, each representing an article. Each string contains the article's heading and URL.
    """

AWQ Quantization

Input Python example:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AutoTokenizer, GemmaForCausalLM
import torch
import time
import numpy as np

def inference(input_text):

    tokens = tokenizer(
        input_text,
        return_tensors='pt'
    ).input_ids.cuda()

    start_time = time.time()
    generation_output = model.generate(
        tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        max_new_tokens=512
    )
    end_time = time.time()

    res = tokenizer.decode(generation_output[0])
    res = res.split(input_text)
    latency = end_time - start_time
    output_tokens = tokenizer.encode(res)
    num_output_tokens = len(output_tokens)
    throughput = num_output_tokens / latency

    return {"output": res[-1], "latency": latency, "throughput": throughput}


model_id = "path/to/Octopus-v2-AWQ"
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)

prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]

avg_throughput = []
for prompt in prompts:
    out = inference(prompt)
    avg_throughput.append(out["throughput"])
    print("nexa model result:\n", out["output"])

print("avg throughput:", np.mean(avg_throughput))

Quantized GGUF & AWQ Models

Name Quant method Bits Size Response (t/s) Use Cases
Octopus-v2-AWQ AWQ 4 3.00 GB 63.83 fast, high quality, recommended
Octopus-v2-Q2_K.gguf Q2_K 2 1.16 GB 57.81 fast but high loss, not recommended
Octopus-v2-Q3_K.gguf Q3_K 3 1.38 GB 57.81 extremely not recommended
Octopus-v2-Q3_K_S.gguf Q3_K_S 3 1.19 GB 52.13 extremely not recommended
Octopus-v2-Q3_K_M.gguf Q3_K_M 3 1.38 GB 58.67 moderate loss, not very recommended
Octopus-v2-Q3_K_L.gguf Q3_K_L 3 1.47 GB 56.92 not very recommended
Octopus-v2-Q4_0.gguf Q4_0 4 1.55 GB 68.80 moderate speed, recommended
Octopus-v2-Q4_1.gguf Q4_1 4 1.68 GB 68.09 moderate speed, recommended
Octopus-v2-Q4_K.gguf Q4_K 4 1.63 GB 64.70 moderate speed, recommended
Octopus-v2-Q4_K_S.gguf Q4_K_S 4 1.56 GB 62.16 fast and accurate, very recommended
Octopus-v2-Q4_K_M.gguf Q4_K_M 4 1.63 GB 64.74 fast, recommended
Octopus-v2-Q5_0.gguf Q5_0 5 1.80 GB 64.80 fast, recommended
Octopus-v2-Q5_1.gguf Q5_1 5 1.92 GB 63.42 very big, prefer Q4
Octopus-v2-Q5_K.gguf Q5_K 5 1.84 GB 61.28 big, recommended
Octopus-v2-Q5_K_S.gguf Q5_K_S 5 1.80 GB 62.16 big, recommended
Octopus-v2-Q5_K_M.gguf Q5_K_M 5 1.71 GB 61.54 big, recommended
Octopus-v2-Q6_K.gguf Q6_K 6 2.06 GB 55.94 very big, not very recommended
Octopus-v2-Q8_0.gguf Q8_0 8 2.67 GB 56.35 very big, not very recommended
Octopus-v2-f16.gguf f16 16 5.02 GB 36.27 extremely big
Octopus-v2.gguf 10.00 GB

Quantized with llama.cpp

Acknowledgement:
We sincerely thank our community members, Mingyuan, Zoey, Brian, Perry, Qi, David for their extraordinary contributions to this quantization effort.