the answer is not terminated correctly
I am using the 8-bit quantized version:
model = transformers.AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto",
load_in_8bit=True
)
And inference:
fmt_ex = """
Instruction: Please write a poem with less than 200 words.
Response:
"""
with torch.autocast('cuda', dtype=torch.float16):
print(
pipe(fmt_ex,
max_new_tokens=256,
do_sample=True,
use_cache=True))
I can see that the last word is ended by "#", but then some random characters appear:
He will know I am here and I am meant to be by his side\n#d-link dl6000 # dlink dl6000 \n# dl6000\n- #n\n#d-link #dlinkdl6000 # dl60\n#dlink # dlinkdl6000 # dl6000 #dl\n- #dlk\n#d-link dl6\n#dlk\n##dlink dl6\n\n#d-link #dlink #dlinkdl6000 #dl\n#d
It seems to be a problem with the special tokens.
it turns out that I needed to set "stopping_criteria" when building the pipeline. I did not realize this point since many huggingface models already implemented this in their custom code.