neuralmagic
/

Meta-Llama-3.1-70B-Instruct-FP8

Text Generation

text-generation-inference

Inference Endpoints

compressed-tensors

Model card Files Files and versions Community

nm-research commited on 27 days ago

Commit

dc7681f

·

verified ·

1 Parent(s): 08b31c0

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -37,7 +37,7 @@ It achieves an average score of 84.29 on the [OpenLLM](https://huggingface.co/sp
 ### Model Optimizations
-This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.

 ### Model Optimizations
+This model was obtained by quantizing the weights and activations of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) to FP8 data type.
 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.