Speeds compared to llama_cpp_python?
I'm running on an A6000, and I make sure that the GPU is fully utilized, but the pytorch version is WAY slower than the GGML for the same quantization.
Is anyone else seeing this?
Examples:
gptq-4bit-32g-actorder_True, Pytorch/Huggingface AutoModelForCausalLM = 1-2 tok/s
mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf, llama_cpp_python = 50 tok/s
Does this discrepency sound remotely correct? Is LlamaCPP that much faster? Is it just the quantization is done better there?
Despite auto-gptq claiming to be installed, I think it is running without CUDA backend. My pytorch is 12.1, cuda toolkit install is 12.x something, pip install auto-gptq claims to install pytorch2.1.1+cu121 version of auto-gptq, but when I go to run, it kicks me to the slow path, where I have read it is about 20x slower (2x20=40 tok/s...so that tracks).
I might try a side by side install with Cuda 11.8 to see if that can't fix the issue.
Documenting in case anyone else hits this issue.
@SpaceCowboy850 well it shouldn’t be that slow but llama cpp python is much faster then transformers/autogptq.
However, if you use gptq models with EXLLAMA, then it will be probably faster then llama cpp python.
But instead of gptq models, I recommend you use exl2 models as they are higher quality + faster.
Only exllamav2(and anything that uses it like text generation web ui) support exl2.(normal/v1 exllama does not support it)
With exllamav2 and exl2 quant format, you will get the fastest single prompt inference speed