It doesn't stop generating text.
I have faced this problem with llama-2-7B-32k where it continues producing text until the max number of tokens is reached.
Is there a solution for this problem ?
Set the EOS token to the corresponding value in the vocabulary
@macadeliccc
Can you give me a code example ?
This is the example from amazon/MistralLite:
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch
model_id = "amazon/MistralLite"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
torch_dtype=torch.bfloat16,
use_flash_attention_2=True,
device_map="auto",)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
sequences = pipeline(
prompt,
max_new_tokens=400,
do_sample=False,
return_full_text=False,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"{seq['generated_text']}")
This line can be referenced like this:
eos_token_id=tokenizer.eos_token_id,
Or like this:
eos_token_id=32101,
The number i selected is arbitrary I just wanted to show you that its referencing the index of the vocabulary.
Given that this is a mistral ft I think this should suffice. Regardless, this is the logic that stops the sentence and prevents run-on generation and can be found in most/all text generation models.
@macadeliccc It worked thanks
@macadeliccc Can you tell me what ideal value to keep for eos_token_id. Also, I read somewhere that adding this to the code also helps: model.config.pad_token_id=tokenizer.eos_token_id, helps in solving this issue. But I'm confused that do I have to specify any value for eos_token_id? If so, then what's the ideal value to set up? Please let me know. Thanks!
EOS token should always be set like this
eos_token_id=tokenizer.eos_token_id,
This way it will just use whatever the eos token is in the tokenizer youre using with the model
@macadeliccc
Here's my code snippet:
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
eos_token="[/INST]"
bnb_config = BitsAndBytesConfig(
load_in_4bit= True,
bnb_4bit_quant_type= "nf4",
bnb_4bit_compute_dtype= torch.bfloat16,
bnb_4bit_use_double_quant= True,
llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config
# device_map="auto",
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.add_eos_token = True
tokenizer.max_new_tokens=2000
tokenizer.max_length=200
tokenizer.max_new_length=200
tokenizer.pad_token_id=2041
tokenizer.pad_token = tokenizer.unk_token
eos_token_id=tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
I'm trying to solve two issues here:
1- Massive repetition, self talk, etc
2- Abruptly stopping the generation
Let me know your comments. Thanks!
I had similar problem problem and solve it quite rude:
def dataset_to_dialog_fromat(text, annotations):
dialogs = []
for i in range(len(text)):
chat = [
{"role": "user", "content": text[i]},
{"role": "assistant", "content": annotations[i] + " ##################"},
]
dialogs.append(chat)
return dialogs
I add " ##################" and use cut inference based on this symbol "#" in the inference and use add text before this symbol.