Help! Hope to get an inference configuration that can run on multiple GPUs.

#25
by Lokis - opened

I have an 8*A100 80Gb server, but I've been testing for a long time and can't use the multi-GPU configuration stably. When using Auto, the output is very slow. Are there any documents or open-source configuration files that I can learn from?

Hey @Lokis and everyone else watching, check vllm for faster inference on multiple GPUs.
transformers implementation achieves low utilization from my experience, and thus its slow inference speed.

Sign up or log in to comment