[FEEDBACK and SHOWCASE] PRO subscription
Feel free to add your feedback about the Inference API for PRO users as well as other features.
I am getting a very partial response. I am using huggingfacehub like:
repo_id = "meta-llama/Llama-2-70b-chat-hf"
args = {
'temperature': 1,
"max_length":1024
}
HuggingFaceService.llm = HuggingFaceHub(repo_id=repo_id, model_kwargs=args)
prompt is : You are a assistant who can generate the response based on the prompt
Use the following pieces of context to answer the question at the end.
If you don't find the answer, just say Sorry I didn't understand, can you rephrase please.
[Document(page_content='Types of workflow in the DigitalChameleon platform There are two types of workflows that can be created in the platform which include: 1.\tConversation: A series of nodes with questions or text displayed to the customer in a sequence one by one, to capture the response of Customer, is referred to as Conversation workflow. The nodes of a workflow of conversation type are loaded on the webpage to the customer one at a time. The flow can be modified to return to a previous flow or allow customer to resume work at a later point in time. Workflow will go to the next node only when the customer performs the desired action in the previous node as configured in the workflow. 2.\tForm: A one time loading of the nodes/questions/messages to the end customer all at once in the UI of a form. The form will be created in the similar manner as we create for conversation in the CMS except for the workflow type in the journey properties should be selected as Form while creating/copying the workflow.
Question: explain the Types of workflow in the DigitalChameleon platform
result : "result": ". \n ')]] Sure, I'd be happy to explain the types of"
I am using langchain to get the answers based on a text file.
@it-chameleoncx can you format your post with codeblocks (```) thanks
Are you planning to add more models to PRO interfaces like for example teknium/OpenHermes-2.5-Mistral-7B?
Hi,
Please add PRO interface for mistralai/Mixtral-8x7B-Instruct-v0.1. It would also be nice to have interfaces for other models that are available through HuggingChat and are not available for PRO subscribers.
Thank you 🙂
Hi, can you please provide a link to a privacy policy that applies to the PRO Inference API?
Hello, can you please add https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B
in the pro subscription.
Sorry, I didn't understand are any limiations for requests in PRO / Free accounts, like limit of tokens
I am trying to access meta-llama/Llama-2-70b-chat-hf, which was previously available as a PRO subscriber, but it seems the model does not respond.
Can you please reactivate it?
I can't apply on spaces.GPU on async funtions. I can't apply spaces.GPU in wrapped functions. It would be nice if both would be possible.
Hello Hugging Face Support Team,
I’m interested in using the models available through the PRO subscription and have reviewed the details on Inference for PRO in the following link: https://huggingface.co/blog/inference-pro.
Specifically, I would like to use the following model:
https://api-inference.huggingface.co/models/openai/whisper-large-v3-turbo
I’d like to know what the monthly usage limits are for this model under the PRO subscription. Specifically, how many requests can I make in a month, and what other limitations might apply?
Could you please provide detailed information regarding rate limits, monthly request quotas, response times, and any other restrictions associated with the PRO plan?
Thank you for your assistance.
Hey folks, I'm not sure where the best place to put this but I'd like some clarity on the models that have increased inference usage for PROs ~
Current Inference Docs
In the recently published serverless inference docs it mentions these models as having higher rate limits:
Model | Size | Supported Context Length | Use |
---|---|---|---|
Meta Llama 3.1 Instruct | 8B, 70B | 70B: 32k tokens / 8B: 8k tokens | High quality multilingual chat model with large context length |
Meta Llama 3 Instruct | 8B, 70B | 8k tokens | One of the best chat models |
Meta Llama Guard 3 | 8B | 4k tokens | |
Llama 2 Chat | 7B, 13B, 70B | 4k tokens | One of the best conversational models |
DeepSeek Coder v2 | 236B | 16k tokens | A model with coding capabilities. |
Bark | 0.9B | - | Text to audio generation |
Old Blog Article
But there's also this old blog post that introduces the feature with these models and it hasn't been updated:
Model | Size | Context Length | Use |
---|---|---|---|
Meta Llama 3 Instruct | 8B, 70B | 8k tokens | One of the best chat models |
Mixtral 8x7B Instruct | 45B MOE | 32k tokens | Performance comparable to top proprietary models |
Nous Hermes 2 Mixtral 8x7B DPO | 45B MOE | 32k tokens | Further trained over Mixtral 8x7B MoE |
Zephyr 7B β | 7B | 4k tokens | One of the best chat models at the 7B weight |
Llama 2 Chat | 7B, 13B | 4k tokens | One of the best conversational models |
Mistral 7B Instruct v0.2 | 7B | 4k tokens | One of the best chat models at the 7B weight |
Code Llama Base | 7B and 13B | 4k tokens | Autocomplete and infill code |
Code Llama Instruct | 34B | 16k tokens | Conversational code assistant |
Stable Diffusion XL | 3B UNet | - | Generate images |
Bark | 0.9B | - | Text to audio generation |
I assume that the new inference docs have the correct supported models list but could be updated to avoid confusion.
My Suggestions
If the inference docs are correct, I think it could use some updating!
- Llama-3-70B could be swapped out for Llama-3.3-70B-Instruct, while keeping Llama-3.1-8B-Instruct.
- We probably don't need two large Llama 3.x models, so I'd suggest replacing Llama-3.1-70B with Qwen2.5-72B-Instruct.
- It's time to retire Llama-2... In 2025 we have plenty of great reasoning models to prioritize like QwQ-32B-Preview or DeepSeek-R1-Distill-Qwen-32B.
- Although I do like the novel nature of suno/bark, it is starting to show its age. I'd suggest replacing it hexgrad/Kokoro-82M for its small size, exceptional quality, and long inputs.
- DeepSeek-Coder-V2 is a very large model that is matched or outperformed by Qwen2.5-Coder-32B-Instruct. If there are any concerns about the size of models or potential load, I think aiming to replace DeepSeek-Coder-V2 would be a wise use of resources.
Notable mentions and other thoughts
I tried to keep my suggestions limited to the current paradigm of serverless inference so that each model is a drop-in replacement for existing ones, while being realistic about size. However, it would be awesome to have a text-to-image model available on this list. The best and most agreeable image gen model is either FLUX.1-schnell or stabilityai/stable-diffusion-3.5-medium. Both models are relatively smol, and all above models are commercially permissive or already available on HuggingChat.
Thanks for reading :)