How to generate token by token?
Awesome model but when i use it, it doesnt generate token by token but only when the whole response is complete. Is there a way to do token by token?
This is my current code and im using it with llama cpp right now and colab.
!pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="/content/Wizard-Vicuna-13B-Uncensored.ggmlv3.q2_K.bin")
message = "Hello, how are you?"
output = llm("USER:" + message + "ASSISTANT:", max_tokens=32, stop=["USER:", "\n"], echo=True)
print(output)
Here's a llama-cpp-python script that shows the response word by word, as well as all at once at the end (you can remove that line):
from llama_cpp import Llama
import random
llm = Llama(model_path="tulu-7b.ggmlv3.q2_K.bin", n_gpu_layers=40, seed=random.randint(1, 2**31))
tokens = llm.tokenize(b"### Instruction: write a story about llamas\n### Response:")
output = b""
count = 0
for token in llm.generate(tokens, top_k=50, top_p=0.73, temp=0.72, repeat_penalty=1.1):
text = llm.detokenize([token])
print(text.decode(), end='', flush=True)
output += text
count +=1
if count >= 500 or (token == llm.token_eos()):
break
print("\n\nFull response:", output.decode())
Awesome, it seems to work perfectly! however I just did llm = Llama(model_path="/content/Wizard-Vicuna-13B-Uncensored.ggmlv3.q2_K.bin") but what does the n_gpu_layers do? and seed=random.randint(1, 2**31)? I am also doing cpu right now so should i add something?