TheBloke/Wizard-Vicuna-13B-Uncensored-GGML · How to generate token by token?

Jun 20, 2023

Awesome model but when i use it, it doesnt generate token by token but only when the whole response is complete. Is there a way to do token by token?
This is my current code and im using it with llama cpp right now and colab.

!pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="/content/Wizard-Vicuna-13B-Uncensored.ggmlv3.q2_K.bin")
message = "Hello, how are you?"
output = llm("USER:" + message + "ASSISTANT:", max_tokens=32, stop=["USER:", "\n"], echo=True)
print(output)

TheBloke

Owner Jun 20, 2023

•

edited Jun 20, 2023

Here's a llama-cpp-python script that shows the response word by word, as well as all at once at the end (you can remove that line):

from llama_cpp import Llama
import random
llm = Llama(model_path="tulu-7b.ggmlv3.q2_K.bin", n_gpu_layers=40, seed=random.randint(1, 2**31))
tokens = llm.tokenize(b"### Instruction: write a story about llamas\n### Response:")

output = b""
count = 0
for token in llm.generate(tokens, top_k=50, top_p=0.73, temp=0.72, repeat_penalty=1.1):
     text = llm.detokenize([token])
     print(text.decode(), end='', flush=True)
     output += text

     count +=1
     if count >= 500 or (token == llm.token_eos()):
         break

print("\n\nFull response:", output.decode())

YaTharThShaRma999

Jun 20, 2023

Awesome, it seems to work perfectly! however I just did llm = Llama(model_path="/content/Wizard-Vicuna-13B-Uncensored.ggmlv3.q2_K.bin") but what does the n_gpu_layers do? and seed=random.randint(1, 2**31)? I am also doing cpu right now so should i add something?