Why is the response slower than the 70B model?
#9
by
shalene
- opened
Is it unique to gguf or also to gptq? I test it with llama.cpp 12.14
for me is as fast as 13B model
I get varied results. When I first tested it out with llama.cpp it was about 5t/s. But when trying it later, it seems it can get 'stuck' parsing the prompt, leading to very long response times. I haven't had time to look into this further.
It seems slow on Nvidia A100 with 80G VRAM, does anyone know why ?
I noticed praising prompt is slow with q4k_m version but much faster with q5k_m.
On RTX 4090 & i9-14900K. Benchmark using llama-bench from llama.cpp.
model | size | params | backend | ngl | threads | t/s pp 512 | t/s tg 128 |
---|---|---|---|---|---|---|---|
llama 7B mostly Q3_K - Medium | 18.96 GiB | 46.70 B | CUDA | 33 | 8 | 205.07 | 83.16 |
llama 7B mostly Q3_K - Medium | 18.96 GiB | 46.70 B | CUDA | 33 | 16 | 204.48 | 83.21 |
llama 7B mostly Q3_K - Medium | 18.96 GiB | 46.70 B | CUDA | 33 | 24 | 204.28 | 83.22 |
llama 7B mostly Q3_K - Medium | 18.96 GiB | 46.70 B | CUDA | 33 | 32 | 203.82 | 83.17 |
llama 7B mostly Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 27 | 8 | 145.54 | 27.75 |
llama 7B mostly Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 27 | 16 | 121.58 | 25.57 |
llama 7B mostly Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 27 | 24 | 147.14 | 26.41 |
llama 7B mostly Q4_K - Medium | 24.62 GiB | 46.70 B | CUDA | 27 | 32 | 145.23 | 9.36 |
llama 7B mostly Q5_K - Medium | 30.02 GiB | 46.70 B | CUDA | 22 | 8 | 58.18 | 15.12 |
llama 7B mostly Q5_K - Medium | 30.02 GiB | 46.70 B | CUDA | 22 | 16 | 49.28 | 13.8 |
llama 7B mostly Q5_K - Medium | 30.02 GiB | 46.70 B | CUDA | 22 | 24 | 64.25 | 15.07 |
llama 7B mostly Q5_K - Medium | 30.02 GiB | 46.70 B | CUDA | 22 | 32 | 73.69 | 12.02 |
llama 7B mostly Q6_K | 35.74 GiB | 46.70 B | CUDA | 19 | 8 | 33.86 | 10.5 |
llama 7B mostly Q6_K | 35.74 GiB | 46.70 B | CUDA | 19 | 16 | 31.75 | 9.5 |
llama 7B mostly Q6_K | 35.74 GiB | 46.70 B | CUDA | 19 | 24 | 40.37 | 10.58 |
llama 7B mostly Q6_K | 35.74 GiB | 46.70 B | CUDA | 19 | 32 | 45.39 | 8.8 |
llama 7B mostly Q8_0 | 46.22 GiB | 46.70 B | CUDA | 15 | 8 | 18.02 | 7.1 |
llama 7B mostly Q8_0 | 46.22 GiB | 46.70 B | CUDA | 15 | 16 | 19.74 | 5.9 |
llama 7B mostly Q8_0 | 46.22 GiB | 46.70 B | CUDA | 15 | 24 | 24.81 | 6.74 |
llama 7B mostly Q8_0 | 46.22 GiB | 46.70 B | CUDA | 15 | 32 | 28.31 | 5.62 |
This comment has been hidden
On RTX 4090 & i9-14900K. Benchmark using llama-bench from llama.cpp.
model size params backend ngl threads t/s pp 512 t/s tg 128 llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 8 205.07 83.16 llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 16 204.48 83.21 llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 24 204.28 83.22 llama 7B mostly Q3_K - Medium 18.96 GiB 46.70 B CUDA 33 32 203.82 83.17 llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 8 145.54 27.75 llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 16 121.58 25.57 llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 24 147.14 26.41 llama 7B mostly Q4_K - Medium 24.62 GiB 46.70 B CUDA 27 32 145.23 9.36 llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 8 58.18 15.12 llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 16 49.28 13.8 llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 24 64.25 15.07 llama 7B mostly Q5_K - Medium 30.02 GiB 46.70 B CUDA 22 32 73.69 12.02 llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 8 33.86 10.5 llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 16 31.75 9.5 llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 24 40.37 10.58 llama 7B mostly Q6_K 35.74 GiB 46.70 B CUDA 19 32 45.39 8.8 llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 8 18.02 7.1 llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 16 19.74 5.9 llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 24 24.81 6.74 llama 7B mostly Q8_0 46.22 GiB 46.70 B CUDA 15 32 28.31 5.62
Maybe try 1 or 4 threads, thanks. I just have 5600g and 4060ti-16g.