Latency issue while inference using huggingfacepipeline from langchain.

by prasoons075 - opened Jun 23, 2023

Jun 23, 2023

It takes around on average 5 mins to respond. Any hacks to reduce the model response time?

H2O.ai org Jun 27, 2023

The best way is to use https://github.com/huggingface/text-generation-inference for hosting an optimized endpoint and then using that in langchain. Built-in HF inference is incredibly slow, specifically on Falcon models.

psinger changed discussion status to closed Jul 3, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment