m33393
/

llama-65b-gptq-cuda-4bit-32g-safetensors

Text Generation

Inference Endpoints

Model card Files Files and versions Community

llama-65b-gptq-cuda-4bit-32g-safetensors / README.md

m33393's picture

Update README.md

73b449d over 1 year ago

|

history blame contribute delete

614 Bytes

	---
	license: other
	library_name: transformers
	tags:
	- safetensors
	- llama
	---
	Converted to HF with `transformers 4.30.0.dev0`, then quantized to 4 bit with GPTQ (Group size `32`):

	`python llama.py ../llama-65b-hf c4 --wbits 4 --true-sequential --act-order --groupsize 32 --save_safetensors 4bit-32g.safetensors`

	PPL should be marginally better than group size 128 at the cost of more VRAM. An A6000 should still be able to fit it all at full 2048 context.

	---
	Note that this model was quantized under GPTQ's `cuda` branch. Which means it should work with 0cc4m's KoboldAI fork:
	https://github.com/0cc4m/KoboldAI