Spaces:
Running
System Configuration for deploying a web app like huggingchat
I would like to know the cloud configuration required to deploy llama2-70b-chat model which is used by Hugging Chat.
If you want to deploy llama2 on your own infrastructure, you can try using text generation inference.
If you don't have access to enough local compute to run it yourself, you can try to deploy it using AWS SageMaker for example and follow the guide on how to set it up with chat-ui.
Regarding chat-ui parameters for llama 2, here is what we use:
"userMessageToken": "",
"userMessageEndToken": " [/INST] ",
"assistantMessageToken": "",
"assistantMessageEndToken": " </s><s>[INST] ",
"preprompt": "<s>[INST] <<SYS>>\n\n<</SYS>>\n\n",
"promptExamples": [
{
"title": "Write an email from bullet list",
"prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
}, {
"title": "Code a snake game",
"prompt": "Code a basic snake game in python, give explanations for each step."
}, {
"title": "Assist in a task",
"prompt": "How do I make a delicious lemon cheesecake?"
}
],
"parameters": {
"temperature": 0.1,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"truncate": 1000,
"max_new_tokens": 1024
},
You can find docs about the rest of the parameters here.
I hope that's enough to get started, let me know if you need anything else.
Hey
@nsarrazin
thanks for your reply.
I also wanted to know the NCCL GPU configurations you have used to create text-generation inference. Also how many concurrent requests can it handle ?