neuralmagic/Meta-Llama-3-8B-Instruct-FP8-KV

Jul 2, 2024

It seems that I can only load the quantized model by using vllm. I need to use "AutoFP8ForCausalLM.from_pretrained(local_model_path, quantize_config=quantize_config, local_files_only=True)" to load the the quantized model because I want to modify the quantize.py, but there is something wrong:" ValueError: Unknown quantization type, got fp8 - supported types are: ['awq', 'bitsandbytes_4bit', 'bitsandbytes_8bit', 'gptq', 'aqlm', 'quanto', 'eetq', 'hqq']". It looks like the "BaseQuantizeConfig" class is not acceptable. Is there a way to load the model so I can modify the model file?

mgoin

Neural Magic org Jul 2, 2024

Hi @Frz614 there is no way to load already quantized checkpoints back into AutoFP8 at the moment. vLLM is the intended place for inference.

mgoin

Neural Magic org 7 days ago

Use it in vLLM with vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-FP8-KV --kv-cache-dtype fp8

mgoin changed discussion status to closed 7 days ago

neuralmagic
/

Meta-Llama-3-8B-Instruct-FP8-KV

How to load this model?