--- tags: - fp8 --- Meta-Llama-3-8B-Instruct quantized to FP8 weights and activations using per-tensor quantization, ready for inference with vLLM >= 0.4.3. Produced using https://github.com/neuralmagic/AutoFP8/blob/b0c1f789c51659bb023c06521ecbd04cea4a26f6/quantize.py ```bash python quantize.py --model-id meta-llama/Meta-Llama-3-8B-Instruct --save-dir Meta-Llama-3-8B-Instruct-FP8 ``` Accuracy on MMLU: ``` vllm (pretrained=meta-llama/Meta-Llama-3-8B-Instruct,gpu_memory_utilization=0.4), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 16 | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.6569|± |0.0038| | - humanities |N/A |none | 5|acc |0.6049|± |0.0068| | - other |N/A |none | 5|acc |0.7203|± |0.0078| | - social_sciences|N/A |none | 5|acc |0.7663|± |0.0075| | - stem |N/A |none | 5|acc |0.5652|± |0.0085| vllm (pretrained=nm-testing/Meta-Llama-3-8B-Instruct-FP8,quantization=fp8,gpu_memory_utilization=0.4), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 16 | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.6567|± |0.0038| | - humanities |N/A |none | 5|acc |0.6072|± |0.0068| | - other |N/A |none | 5|acc |0.7206|± |0.0078| | - social_sciences|N/A |none | 5|acc |0.7618|± |0.0075| | - stem |N/A |none | 5|acc |0.5649|± |0.0085| ```