Why is the response slower than the 70B model?

#9
by shalene - opened

Is it unique to gguf or also to gptq? I test it with llama.cpp 12.14

for me is as fast as 13B model

I get varied results. When I first tested it out with llama.cpp it was about 5t/s. But when trying it later, it seems it can get 'stuck' parsing the prompt, leading to very long response times. I haven't had time to look into this further.

It seems slow on Nvidia A100 with 80G VRAM, does anyone know why ?

I noticed praising prompt is slow with q4k_m version but much faster with q5k_m.

On RTX 4090 & i9-14900K. Benchmark using llama-bench from llama.cpp.

model size params backend ngl threads t/s pp 512 t/s tg 128
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 8 205.07 83.16
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 16 204.48 83.21
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 24 204.28 83.22
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 32 203.82 83.17
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 8 145.54 27.75
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 16 121.58 25.57
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 24 147.14 26.41
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 32 145.23 9.36
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 8 58.18 15.12
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 16 49.28 13.8
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 24 64.25 15.07
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 32 73.69 12.02
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 8 33.86 10.5
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 16 31.75 9.5
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 24 40.37 10.58
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 32 45.39 8.8
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 8 18.02 7.1
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 16 19.74 5.9
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 24 24.81 6.74
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 32 28.31 5.62
This comment has been hidden

On RTX 4090 & i9-14900K. Benchmark using llama-bench from llama.cpp.

model size params backend ngl threads t/s pp 512 t/s tg 128
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 8 205.07 83.16
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 16 204.48 83.21
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 24 204.28 83.22
llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 32 203.82 83.17
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 8 145.54 27.75
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 16 121.58 25.57
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 24 147.14 26.41
llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 32 145.23 9.36
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 8 58.18 15.12
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 16 49.28 13.8
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 24 64.25 15.07
llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 32 73.69 12.02
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 8 33.86 10.5
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 16 31.75 9.5
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 24 40.37 10.58
llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 32 45.39 8.8
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 8 18.02 7.1
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 16 19.74 5.9
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 24 24.81 6.74
llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 32 28.31 5.62

Maybe try 1 or 4 threads, thanks. I just have 5600g and 4060ti-16g.

Sign up or log in to comment