TheBloke commited on
Commit
012fbc4
1 Parent(s): 8c469a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -20,11 +20,13 @@ This repo contains 4bit GPTQ models for GPU inference, quantised using [GPTQ-for
20
 
21
  ## PERFORMANCE ISSUES
22
 
23
- I am currently working on re-creating these GPTQs due to performance issues reported by many people.
24
 
25
- If you've not yet downloaded the models you might want to wait an hour to see if the new files I'm making now will fix this problem.
26
 
27
- This message will disappear once the problem is resolved.
 
 
28
 
29
  ## GIBBERISH OUTPUT IN `text-generation-webui`?
30
 
 
20
 
21
  ## PERFORMANCE ISSUES
22
 
23
+ For reasons I can't yet understand, there are performance problems with these 4bit GPTQs that I have not experienced with any other GPTQ 7B or 13B models.
24
 
25
+ I have re-made the GPTQs several times, trying various versions of GPTQ-for-LLaMa code. But I currently can't resolve it.
26
 
27
+ Using the act-order.safetensors file on Triton code performs acceptably for me, testing on a 4090 - eg 10-13 tokens/s. But the no-act-order.safetensor file, tested on the older CUDA oobabooga GPTQ-for-LLaMa code, returns only 4 tokens/s.
28
+
29
+ I will keep investigating and trying to work out what's happening here. But for the moment, if you're not able to use Triton GPTQ-for-LLaMa, you may want to try another 7B GPTQ model.
30
 
31
  ## GIBBERISH OUTPUT IN `text-generation-webui`?
32