HuggingChat: Input validation error: `inputs` tokens + `max_new_tokens` must be..

#430
by Kostyak - opened

I use the meta-llama/Meta-Llama-3-70B-Instruct model. After a certain number of moves, the AI refuses to walk and gives an error : "Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 6391 inputs tokens and 2047 max_new_tokens". Is this a bug or some new limitation? I still don't get it to be honest and I hope I get an answer here. I'm new to this site.

Kostyak changed discussion status to closed
Kostyak changed discussion title from Input validation error: `inputs` tokens + `max_new_tokens` must be.. to HuggingChat: Input validation error: `inputs` tokens + `max_new_tokens` must be..
Kostyak changed discussion status to open
Kostyak changed discussion status to closed
Kostyak changed discussion status to open

Same issue all of the sudden today

Hugging Chat org

Can you see if this still happens? Should be fixed now.

This comment has been hidden

Can you see if this still happens? Should be fixed now.

Still same error, except numbers have changed a little.
Screenshot_20.png

I keep getting this error as well. Using CohereForAI

Same error, "Meta-Llama-3-70B-Instruct" model.

I have also been running into this error. Is there a workaround or solution at all?

"Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 6474 inputs tokens and 2047 max_new_tokens"

Using the meta-llama/Meta-Llama-3-70B-Instruct model.

Keep getting same error on llama3-70b. If the message prompt crosses context length shouldn't it automatically truncate or something like that?

It happens more often than not, even when using like 7 words.

Using the meta-llama/Meta-Llama-3-70B-Instruct model.
Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 6477 inputs tokens and 2047 max_new_tokens

Happening to me right now:
Input validation error: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 6398 `inputs` tokens and 2047 `max_new_tokens

Hugging Chat org

Just to check, are you having long conversations and/or using the websearch? Sorry for the inconvenience, trying to find a fix.

Happens even without web search, just long conversation.

No web search, not really long. The old conversation should be somewhere around 8000 tokens. Like the error sais:
Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 8015 inputs tokens and 2047 max_new_tokens

In new chat length was less before getting the same error. Again, like it is stated in the error it should be somewhere around 6150 tokens.
Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 6150 inputs tokens and 2047 max_new_tokens

Just to check, are you having long conversations and/or using the websearch? Sorry for the inconvenience, trying to find a fix.

I was having a long conversation without web search

Just to check, are you having long conversations and/or using the websearch? Sorry for the inconvenience, trying to find a fix.

Likewise, a long conversation without a web search.

Just to check, are you having long conversations and/or using the websearch? Sorry for the inconvenience, trying to find a fix.

There is no inconvenience at all, we appreciate your time and effort trying to fix this. On my end, no web search here, just default. "Assistant will not use internet to do information retrieval and will respond faster. Recommended for most Assistants."

This is def a weird bug, doesn't matter how many words you use in the context it just throw the error and blocks you, you can try to reduce the prompt to 1 word it will throw the error still.

Seems to happen with long conversations. Like I'm hitting a hard limit. I could do a token count if that helps.

Hugging Chat org

I'd really appreciate if you could count the tokens indeed. You can grab the raw prompt by clicking the bottom right icon on the message that gave you the error. It will open a JSON with a field called prompt which contains the raw prompt.

Screenshot 2024-05-04 at 09.17.43.png

Otherwise if someone feels comfortable sharing a conversation, I can have a look directly.

Otherwise if someone feels comfortable sharing a conversation, I can have a look directly.

Here you go: https://hf.co/chat/r/v_U0GXB

I'd really appreciate if you could count the tokens indeed. You can grab the raw prompt by clicking the bottom right icon on the message that gave you the error. It will open a JSON with a field called prompt which contains the raw prompt.

Screenshot 2024-05-04 at 09.17.43.png

Otherwise if someone feels comfortable sharing a conversation, I can have a look directly.

Here you go, please: https://hf.co/chat/r/7MLJ8EX

Otherwise if someone feels comfortable sharing a conversation, I can have a look directly.

Hi there! It happens here too. Here's my conversation https://hf.co/chat/r/1yeBRAV

Thanks in advance!

Anyone know if this is fixed?

Anyone know if this is fixed?

nope, still same error.

Hugging Chat org

I asked internally, trying to get to the bottom of this, sorry for the inconvenience!

I'm also getting this problem. It's very annoying. I know the service is free, but I wouldn't mind paying for it if it got rid of this error.

When will they fix this error? It's literally annoying especially when I was trying to make LLama 3 fix the code

Any news on fixing this bug?

Hi . I also have the same problem when sending a link of a facebook page, but i've already done it in other chats and there were no problem.

The issue for me is that i need to change conversation because i can't use the chat anymore, and that's a problem because I was using to deliver a business service .

I would appreciate you very much for trying, I can share the conversation as well

Running into the same issue: I'm iterating over a defined set of strings trying out best prompting strategy, and it gives me this error with random strings at random times. Can't make sense of it. Using the meta-llama/Meta-Llama-3-8B-Instruct model.

Any updates?

I'm also getting the same problem. can i help in any way?

I am getting the same error as well usually on long conversations which involves code reviews, documentations etc

yeah, still getting this issue , its so annoying.

Bruh it is never going to be fixed I guess ๐Ÿ˜ญ

Same issueee Input validation error: inputstokens +max_new_tokensmust be <= 4096. Given: 4076inputstokens and 100max_new_tokens``

I've got the same issue in a long conversation. If I branch a prompt, the ai answer me but I can't add any request after the branch. I tried with several models and I'v got the same result : Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 6269 inputs tokens and 2047 max_new_tokens . If I go to another conversation it's working.

Any updates?

Seems that the issue lies on how the context is being handled, I think the best approach here would be to clear the context after a few messages to maintain it always with enough tokens to keep conversation going, maybe set to retrieve only the last 3-4 messages which would create less context but would probably avoid that error which seems to be when context is full and you have to add a new chat and start all over again till it happen again.

I think the best approach here would be to clear the context after a few messages

Can you give an example on how to do this?

I think the best approach here would be to clear the context after a few messages

Can you give an example on how to do this?

The error we're encountering is probably due to the limitation on the total number of tokens that can be processed by the LLaMA model. To resolve this issue, developers can implement a mechanism to truncate the conversation context after a certain number of messages.

Something like this could work for let's say the last 5 messages, but this has to be done in the backend:


conversation_history = []

def process_message(user_input):
    global conversation_history
    
    # Add the user's input to the conversation history
    conversation_history.append(user_input)
    
    # Truncate the conversation history to keep only the last 5 messages
    if len(conversation_history) > 5:
        conversation_history = conversation_history[-5:]
    
    # Prepare the input for the LLaMA model
    input_text = "\n".join(conversation_history)
    
    # Call the LLaMA model with the truncated input
    response = llama_model(input_text)
    
    # Append the response to the conversation history
    conversation_history.append(response)
    
    return response

Or this for the frontend


let conversationHistory = [];

function processMessage(userInput) {
  conversationHistory.push(userInput);

  // Truncate the conversation history to keep only the last 5 messages
  if (conversationHistory.length > 5) {
    conversationHistory = conversationHistory.slice(-5);
  }

  // Prepare the input for the LLaMA model
  let inputText = conversationHistory.join("\n");

  // Call the LLaMA model with the truncated input
  $.ajax({
    type: "POST",
    url: "/llama-endpoint", // Replace with your LLaMA model endpoint
    data: { input: inputText },
    success: function(response) {
      // Append the response to the conversation history
      conversationHistory.push(response);

      // Update the conversation display
      $("#conversation-display").append(`<p>${response}</p>`);
    }
  });
}

That could potentially fix this bug.

Thanks for the help but Iโ€™m using the Huggingface chat website. Iโ€™ve not clue how to input this code.

Thanks for the help but Iโ€™m using the Huggingface chat website. Iโ€™ve not clue how to input this code.

I know, I mean developers have to check if that can fix the issue on their end FYI @nsarrazin

Any updates, friends?

None, it is still getting stuck with error every time when chat log reaches some limit. Like someone stated earlier it seems it will not be fixed.

None, it is still getting stuck with an error every time the chat log reaches some limit. Like someone stated earlier it seems it will not be fixed.

Yeah, unfortunately at the frontend, there isn't any way to fix it once it reaches that error. Editing a few messages above the latest which caused the error asking for a summarization of the chat context than starting a new one works but this isn't a solution.

I don't know how the chat is deployed or what language but if I could help to fix it, I would. I use the chat on a daily basis.

Hugging Chat org

We should have a fix for it in TGI, will make sure it's deployed tomorrow!

We should have a fix for it in TGI, will make sure it's deployed tomorrow!

Iโ€™m pretty new to Hugging face chat, will it update automatically or would I need to do something manually. Also when will the fix drop and do you know whatโ€™s causing it

We should have a fix for it in TGI, will make sure it's deployed tomorrow!

Amazing! Thank you

We should have a fix for it in TGI, will make sure it's deployed tomorrow!

Amazing! Thank you

Have you run into the error again? So far so good for me

still getting it
image.png

Still same error for me.

Well done devs! Great stuff @nsarrazin thank you!

Seems fixed to me. I couldn't see that error anymore.

Hugging Chat org

Yep the issue should be fixed on all models, if you still see it feel free to ping me!

nsarrazin changed discussion status to closed

@nsarrazin Getting this error on codellama-7b-instruct, and llama2-70b-chat models
ValidationError: Input validation error: inputs tokens + max_new_tokens must be <= 6144. Given: 4183 inputs tokens and 2023 max_new_tokens

@nsarrazin I am getting the same error with Qwen/Qwen2-72B-Instruct using Inference Endpoints:
Input validation error: inputs tokens + max_new_tokens must be <= 1512. Given: 970 inputs tokens and 5000 max_new_tokens

The model works if I set max_new_tokens to 500 (970+500 <= 1512), though. Is this a limitation of the model or Hugging Face Inference Endpoints?

Edit: I just noticed that for a Text Generation task, Max Number of Tokens (per Query) can be set under the Advanced Configuration settings of a dedicated Inference Endpoint. The default value is 1512 and increasing it to let's say 3000 fixed my issue.

@nsarrazin
Though you have mentioned that the issue is solved for all the models.
Iam facing issue for meta-llama/Meta-Llama-3-8B-Instruct model.
{
"error": "Input validation error: inputs tokens + max_new_tokens must be <= 4096. Given: 4092 inputs tokens and 16 max_new_tokens",
"error_type": "validation"
}

I think this is still an error. "'error': 'Input validation error: inputs tokens + max_new_tokens must be <= 4096". Using the dockerized TGI, with params --model-id Qwen/Qwen2-72B-Instruct-GPTQ-Int8 --quantize gptq. This limits to 4096 when the context should be allowed to be much bigger than that.

If you are using the Dockerized TGI, try setting the --max-total-tokens parameter. The default is 4096 and that may be the origin of the issue.

Hi, I saw the above thread and was wondering if its an issue or limitation.

I am using meta-llama/Meta-Llama-3.1-70B-Instruct which has a context window of 128k. But I get this when I send large input.

Input validation error: inputs tokens + max_new_tokens must be <= 8192. Given: 12682 inputs tokens and 4000 max_new_tokens

Using Hugging Chat, https://huggingface.co/chat/

Model: meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

Input validation error: inputs tokens + max_new_tokens must be <= 16384. Given: 14337 inputs tokens and 2048 max_new_tokens

Hello,

I have exactly the same error when calling Meta-Llama-3.1-70B-Instruct using Haystack v2.0's HuggingFaceTGIGenerator in the context of a RAG application:

cmd.png

It is very puzzling because Meta-Llama-3.1-70B-Instruct should have a context window size of 128k tokens. This, and the multilingual capabilities, are major upgrades with respect to the previous iteration of the model.

Still, here's the result:

error 422.png

I am calling the model using serverless API. Perhaps creating a dedicated, paid API endpoint would solve the issue? Did anyone try this?

Hi,

I had the same problem when using the Serverless Inference API and meta-llama/Meta-Llama-3.1-8B-Instruct. The problem is that the API only supports a context length of 8k for this model, while the model supports 128k. I got around the problem by running a private endpoint and changing the 'Container Configuration', specifically the token settings to whatever length I required.

Hi AlbinLidback,

Yes, I ended up doing the same thing and it solved the problem. HuggingFace could save users a lots of frustration by explicitly mentioning this on the model cards.

Hi @AlbinLidback , @JulienGuy

I'm totally new to the Hugging Face.
I also got the same problem with meta-llama/Meta-Llama-3.1-8B-Instruct and 70B-Instruct.

Could you share hot to "running a private endpoint and changing the 'Container Configuration' with the 128k token length?

Hi @pineapple96 ,

This part is relatively straightforward. Go to the the model card (e.g. https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), Click on "Deploy" in the top right corner and select "Inference Endpoint". In the next page you can choose what hardware you want to run the model on, which will impact how much you will pay per hour. Set "Automatic Scale to Zero" to some value other than "never" to switch off the endpoint after X amount of time without request, so that you won't be paying for the endpoint while it's not in use. Then go to "Advanced Configuration" and set the maximum amount of tokens to whatever makes sense for your use case. With this procedure you will be able to make full use of the larger context windows of Llama 3 models.

Thanks a lot for the detailed how-to guide, JulienGuy. Appreciate it!

Sign up or log in to comment