Inference is very slow (about 3 secs/token)

#11
by rfernand - opened
Microsoft org

Great to have this model in HF! The inference is super slow - makes it hard to do real-time experiments. Can this be sped up easily?

Microsoft org

As measured on Windows 11, CPU: i9-13900KF, 128 GB RAM, GPU: RTX 3090 (24 GB).

use a quant. Which don't exist yet....

@rfernand your best bet is to use quantization and that should boost speed by a large amount and also it will take up less vram. I think you should use the gptq quant format and load it with huggingface to get best speed. Although transformers is somewhat simple, using something like exllama v2 should get you the fastest speed.
https://huggingface.co/TheBloke/Orca-2-13B-GPTQ

Use the 8 bit one for maximum quality

heh yeah and now they do exist ;)

Microsoft org

Thanks @YaTharThShaRma999 and @PsiPi .

This is great - I tried the 4-bit version (https://huggingface.co/TheBloke/Orca-2-13B-GGUF) with following results:
model loading: 4x faster
inference 12x faster

TLDR

  1. pip install ctransformers[cuda]
  2. python script for inference:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Orca-2-13B-GGUF", model_file="orca-2-13b.Q4_K_M.gguf", model_type="llama", gpu_layers=50)

print(llm("AI is going to"))
rfernand changed discussion status to closed

Yeah LoneStriker offers an excellent version as well

For inference, I get the following error:

`GLIBC_2.29' not found

Anyone know how to resolve this?

Specifically

[`GLIBC_2.29' not found](oserror: /lib64/libm.so.6: version `glibc_2.29' not found (required by /local/home/user_name/anaconda3/envs/odi-ds/lib/python3.9/site-packages/ctransformers/lib/cuda/libctransformers.so))

Says you have the wrong version of libc ? not to be glib but... Get the right one? might need to wrap it in an env. Don't know your situation. Good luck @Vulfgang

Thank you for replying, I think I have the right glib now but now everytime I run the code on jupyter my kernel just dies as soon as I try to download the model from the repo.

wait nevermind the last comment, all good

Sign up or log in to comment