Files changed (1) hide show
  1. README.md +70 -24
README.md CHANGED
@@ -25,54 +25,101 @@ This repo includes two types of quantized models: **GGUF** and **AWQ**, for our
25
 
26
 
27
  # GGUF Qauntization
28
- Run with [Ollama](https://github.com/ollama/ollama)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ```bash
31
- ollama run NexaAIDev/octopus-v2-Q4_K_M
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ```
33
 
34
  # AWQ Quantization
35
  Python example:
36
 
37
  ```python
 
38
  from awq import AutoAWQForCausalLM
39
- from transformers import AutoTokenizer, GemmaForCausalLM
40
  import torch
41
  import time
42
  import numpy as np
43
 
44
  def inference(input_text):
45
-
46
- tokens = tokenizer(
47
- input_text,
48
- return_tensors='pt'
49
- ).input_ids.cuda()
50
-
51
  start_time = time.time()
 
 
52
  generation_output = model.generate(
53
- tokens,
54
- do_sample=True,
55
- temperature=0.7,
56
- top_p=0.95,
57
- top_k=40,
58
- max_new_tokens=512
59
  )
60
  end_time = time.time()
61
 
62
- res = tokenizer.decode(generation_output[0])
63
- res = res.split(input_text)
 
 
64
  latency = end_time - start_time
65
- output_tokens = tokenizer.encode(res)
66
- num_output_tokens = len(output_tokens)
67
  throughput = num_output_tokens / latency
68
 
69
- return {"output": res[-1], "latency": latency, "throughput": throughput}
70
 
71
-
72
- model_id = "path/to/Octopus-v2-AWQ"
 
73
  model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
74
  trust_remote_code=False, safetensors=True)
75
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
76
 
77
  prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
78
 
@@ -115,4 +162,3 @@ _Quantized with llama.cpp_
115
 
116
  **Acknowledgement**:
117
  We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.
118
-
 
25
 
26
 
27
  # GGUF Qauntization
28
+ ## (Recommended) Run with [llama.cpp](https://github.com/ggerganov/llama.cpp)
29
+
30
+ 1. **Clone and compile:**
31
+
32
+ ```bash
33
+ git clone https://github.com/ggerganov/llama.cpp
34
+ cd llama.cpp
35
+ # Compile the source code:
36
+ make
37
+ ```
38
+
39
+ 2. **Prepare the Input Prompt File:**
40
+
41
+ Navigate to the `prompt` folder inside the `llama.cpp`, and create a new file named `chat-with-octopus.txt`.
42
+
43
+ `chat-with-octopus.txt`:
44
+
45
+ ```bash
46
+ User:
47
+ ```
48
+
49
+ 3. **Execute the Model:**
50
+
51
+ Run the following command in the terminal:
52
+
53
+ ```bash
54
+ ./main -m ./path/to/octopus-v2-Q4_K_M.gguf -c 512 -b 2048 -n 256 -t 1 --repeat_penalty 1.0 --top_k 0 --top_p 1.0 --color -i -r "User:" -f prompts/chat-with-octopus.txt
55
+ ```
56
+
57
+ Example prompt to interact
58
+ ```bash
59
+ <|system|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<|end|><|user|>Query: Take a selfie for me with front camera<|end|><|assistant|>
60
+ ```
61
+
62
+ ## Run with [Ollama](https://github.com/ollama/ollama)
63
+ 1. Create a `Modelfile` in your directory and include a `FROM` statement with the path to your local model:
64
 
65
  ```bash
66
+ FROM ./path/to/octopus-v2-Q4_K_M.gguf
67
+ PARAMETER temperature 0
68
+ PARAMETER num_ctx 1024
69
+ PARAMETER stop <nexa_end>
70
+ ```
71
+
72
+ 2. Use the following command to add the model to Ollama:
73
+ ```bash
74
+ ollama create octopus-v2-Q4_K_M -f Modelfile
75
+ ```
76
+
77
+ 3. Verify that the model has been successfully imported:
78
+ ```bash
79
+ ollama ls
80
+ ```
81
+
82
+ ### Run the model
83
+ ```bash
84
+ ollama run octopus-v2-Q4_K_M "<|system|>You are a router. Below is the query from the users, please call the correct function and generate the parameters to call the function.<|end|><|user|>Query: Take a selfie for me with front camera<|end|><|assistant|>"
85
  ```
86
 
87
  # AWQ Quantization
88
  Python example:
89
 
90
  ```python
91
+ from transformers import AutoTokenizer
92
  from awq import AutoAWQForCausalLM
 
93
  import torch
94
  import time
95
  import numpy as np
96
 
97
  def inference(input_text):
 
 
 
 
 
 
98
  start_time = time.time()
99
+ input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
100
+ input_length = input_ids["input_ids"].shape[1]
101
  generation_output = model.generate(
102
+ input_ids["input_ids"],
103
+ do_sample=False,
104
+ max_length=1024
 
 
 
105
  )
106
  end_time = time.time()
107
 
108
+ # Decode only the generated part
109
+ generated_sequence = generation_output[:, input_length:].tolist()
110
+ res = tokenizer.decode(generated_sequence[0])
111
+
112
  latency = end_time - start_time
113
+ num_output_tokens = len(generated_sequence[0])
 
114
  throughput = num_output_tokens / latency
115
 
116
+ return {"output": res, "latency": latency, "throughput": throughput}
117
 
118
+ # Initialize tokenizer and model
119
+ model_id = "/home/mingyuanma/Octopus-v2-AWQ-NexaAIDev"
120
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
121
  model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
122
  trust_remote_code=False, safetensors=True)
 
123
 
124
  prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
125
 
 
162
 
163
  **Acknowledgement**:
164
  We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.