TheBloke/Mixtral-8x7B-v0.1-GPTQ · Speeds compared to llama_cpp

Feb 2, 2024

I'm running on an A6000, and I make sure that the GPU is fully utilized, but the pytorch version is WAY slower than the GGML for the same quantization.
Is anyone else seeing this?

Examples:
gptq-4bit-32g-actorder_True, Pytorch/Huggingface AutoModelForCausalLM = 1-2 tok/s
mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf, llama_cpp_python = 50 tok/s

Does this discrepency sound remotely correct? Is LlamaCPP that much faster? Is it just the quantization is done better there?

SpaceCowboy850

Feb 6, 2024

•

edited Feb 6, 2024

Despite auto-gptq claiming to be installed, I think it is running without CUDA backend. My pytorch is 12.1, cuda toolkit install is 12.x something, pip install auto-gptq claims to install pytorch2.1.1+cu121 version of auto-gptq, but when I go to run, it kicks me to the slow path, where I have read it is about 20x slower (2x20=40 tok/s...so that tracks).

I might try a side by side install with Cuda 11.8 to see if that can't fix the issue.

Documenting in case anyone else hits this issue.

YaTharThShaRma999

Feb 6, 2024

@SpaceCowboy850 well it shouldn’t be that slow but llama cpp python is much faster then transformers/autogptq.

However, if you use gptq models with EXLLAMA, then it will be probably faster then llama cpp python.

But instead of gptq models, I recommend you use exl2 models as they are higher quality + faster.

Only exllamav2(and anything that uses it like text generation web ui) support exl2.(normal/v1 exllama does not support it)

With exllamav2 and exl2 quant format, you will get the fastest single prompt inference speed

TheBloke
/

Mixtral-8x7B-v0.1-GPTQ

Speeds compared to llama_cpp_python?