metadata

license: cc-by-nc-4.0
base_model: google/gemma-2b
model-index:
  - name: Octopus-V2-2B
    results: []
tags:
  - function calling
  - on-device language model
  - android
inference: false
space: false
spaces: false
language:
  - en

Quantized Octopus V2: On-device language model for super agent

This repo includes two types of quantized models: GGUF and AWQ, for ourOctopus V2 model at NexaAIDev/Octopus-v2

GGUF Qauntization

Run with Ollama

ollama run NexaAIDev/octopus-v2-Q4_K_M

Input example:

def get_trending_news(category=None, region='US', language='en', max_results=5):
    """
    Fetches trending news articles based on category, region, and language.

    Parameters:
    - category (str, optional): News category to filter by, by default use None for all categories. Optional to provide.
    - region (str, optional): ISO 3166-1 alpha-2 country code for region-specific news, by default, uses 'US'. Optional to provide.
    - language (str, optional): ISO 639-1 language code for article language, by default uses 'en'. Optional to provide.
    - max_results (int, optional): Maximum number of articles to return, by default, uses 5. Optional to provide.

    Returns:
    - list[str]: A list of strings, each representing an article. Each string contains the article's heading and URL.
    """

AWQ Quantization

Input Python example:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AutoTokenizer, GemmaForCausalLM
import torch
import time
import numpy as np

def inference(input_text):

    tokens = tokenizer(
        input_text,
        return_tensors='pt'
    ).input_ids.cuda()

    start_time = time.time()
    generation_output = model.generate(
        tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.95,
        top_k=40,
        max_new_tokens=512
    )
    end_time = time.time()

    res = tokenizer.decode(generation_output[0])
    res = res.split(input_text)
    latency = end_time - start_time
    output_tokens = tokenizer.encode(res)
    num_output_tokens = len(output_tokens)
    throughput = num_output_tokens / latency

    return {"output": res[-1], "latency": latency, "throughput": throughput}


model_id = "path/to/Octopus-v2-AWQ"
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)

prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]

avg_throughput = []
for prompt in prompts:
    out = inference(prompt)
    avg_throughput.append(out["throughput"])
    print("nexa model result:\n", out["output"])

print("avg throughput:", np.mean(avg_throughput))

Quantized GGUF & AWQ Models

Name	Quant method	Bits	Size	Response (t/s)	Use Cases
Octopus-v2-AWQ	AWQ	4	3.00 GB	63.83	fast, high quality, recommended
Octopus-v2-Q2_K.gguf	Q2_K	2	1.16 GB	57.81	fast but high loss, not recommended
Octopus-v2-Q3_K.gguf	Q3_K	3	1.38 GB	57.81	extremely not recommended
Octopus-v2-Q3_K_S.gguf	Q3_K_S	3	1.19 GB	52.13	extremely not recommended
Octopus-v2-Q3_K_M.gguf	Q3_K_M	3	1.38 GB	58.67	moderate loss, not very recommended
Octopus-v2-Q3_K_L.gguf	Q3_K_L	3	1.47 GB	56.92	not very recommended
Octopus-v2-Q4_0.gguf	Q4_0	4	1.55 GB	68.80	moderate speed, recommended
Octopus-v2-Q4_1.gguf	Q4_1	4	1.68 GB	68.09	moderate speed, recommended
Octopus-v2-Q4_K.gguf	Q4_K	4	1.63 GB	64.70	moderate speed, recommended
Octopus-v2-Q4_K_S.gguf	Q4_K_S	4	1.56 GB	62.16	fast and accurate, very recommended
Octopus-v2-Q4_K_M.gguf	Q4_K_M	4	1.63 GB	64.74	fast, recommended
Octopus-v2-Q5_0.gguf	Q5_0	5	1.80 GB	64.80	fast, recommended
Octopus-v2-Q5_1.gguf	Q5_1	5	1.92 GB	63.42	very big, prefer Q4
Octopus-v2-Q5_K.gguf	Q5_K	5	1.84 GB	61.28	big, recommended
Octopus-v2-Q5_K_S.gguf	Q5_K_S	5	1.80 GB	62.16	big, recommended
Octopus-v2-Q5_K_M.gguf	Q5_K_M	5	1.71 GB	61.54	big, recommended
Octopus-v2-Q6_K.gguf	Q6_K	6	2.06 GB	55.94	very big, not very recommended
Octopus-v2-Q8_0.gguf	Q8_0	8	2.67 GB	56.35	very big, not very recommended
Octopus-v2-f16.gguf	f16	16	5.02 GB	36.27	extremely big
Octopus-v2.gguf			10.00 GB

Quantized with llama.cpp

Acknowledgement:
We sincerely thank our community members, Mingyuan, Zoey, Brian, Perry, Qi, David for their extraordinary contributions to this quantization effort.