metadata
license: other
library_name: transformers
tags:
- safetensors
- llama
Converted to HF with transformers 4.30.0.dev0
, then quantized to 4 bit with GPTQ (Group size 32
):
python llama.py ../llama-65b-hf c4 --wbits 4 --true-sequential --act-order --groupsize 32 --save_safetensors 4bit-32g.safetensors
PPL should be marginally better than group size 128 at the cost of more VRAM. An A6000 should still be able to fit it all at full 2048 context.
Note that this model was quantized under GPTQ's cuda
branch. Which means it should work with 0cc4m's KoboldAI fork:
https://github.com/0cc4m/KoboldAI