Latency issue while inference using huggingfacepipeline from langchain.
#2
by
prasoons075
- opened
It takes around on average 5 mins to respond. Any hacks to reduce the model response time?
The best way is to use https://github.com/huggingface/text-generation-inference for hosting an optimized endpoint and then using that in langchain. Built-in HF inference is incredibly slow, specifically on Falcon models.
psinger
changed discussion status to
closed