m33393
/

llama-65b-gptq-cuda-4bit-32g-safetensors

Text Generation

Inference Endpoints

Model card Files Files and versions Community

m33393 commited on May 28, 2023

Commit

074723c

•

1 Parent(s): ab574cb

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ Converted to HF with `transformers 4.30.0.dev0`, then quantized to 4 bit with GP
 `python llama.py ../llama-65b-hf c4 --wbits 4 --true-sequential --act-order --groupsize 32 --save_safetensors 4bit-32g.safetensors`
-PPL should be marginally better than group size 128
 ---
 Note that this model is quantized under GPTQ's `cuda` branch. Which means this model should work with 0cc4m's KoboldAI fork:

 `python llama.py ../llama-65b-hf c4 --wbits 4 --true-sequential --act-order --groupsize 32 --save_safetensors 4bit-32g.safetensors`
+PPL should be marginally better than group size 128 at the cost of more VRAM. An A6000 should still be able to fit it all at full 2048 context.
 ---
 Note that this model is quantized under GPTQ's `cuda` branch. Which means this model should work with 0cc4m's KoboldAI fork: