@kingabzpro on Hugging Face: "How can I make my RAG application generate real-time responses? Up until now…"

kingabzpro

posted an update Sep 7

Post

1836

How can I make my RAG application generate real-time responses? Up until now, I have been using Groq for fast LLM generation and the Gradio Live function. I am looking for a better solution that can help me build a real-time application without any delay. @abidlabs

kingabzpro/Real-Time-RAG

SVHawk13

Sep 9

This all depends on your use case(s), but here are some options:

Profile your code to determine where it performs the slowest. Troubleshoot these areas of your code first.
Speculative decoding (Qwen2-0.5B helping Qwen2-7B, for example)
Model preloading
Preloading and/or caching data
Caching query responses
Use smaller models
for embedding/retrieval
experiment with inference optimizations like torch.compile() and Unsloth/Liger/Marlin
Use fp8, bfloat16, or float16 torch dtype instead of float32 on GPU.
Consider a smaller vector DB of summarized data for the first retrieval instead of searching an entire fulltext DB up front.
Use async code where appropriate

Please note that RAG may not be the best choice for realtime use cases. The best thing to remember is to keep the data as close to the user as possible if you want to get it to the user faster.

kingabzpro

Sep 9

I'm having some issues with the RAG pipeline. It generally takes 0.2-2 seconds for it to respond, and most of the time the embedding model takes even longer. I can implement prompt caching, but I was considering a more hardware-related solution. What do you think about using Ray for distributed serving? Also, what do you think about GraphQL?

Join the conversation